Loading…
RSMoDM: Multimodal Momentum Distillation Model for Remote Sensing Visual Question Answering
Remote sensing (RS) visual question answering (VQA) is a task that answers questions about a given RS image by utilizing both image and textual information. However, existing methods in RS VQA overlook the fact that the ground truths in RS VQA benchmark datasets, which are algorithmically generated...
Saved in:
Published in: | IEEE journal of selected topics in applied earth observations and remote sensing 2024, Vol.17, p.16799-16814 |
---|---|
Main Authors: | , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Remote sensing (RS) visual question answering (VQA) is a task that answers questions about a given RS image by utilizing both image and textual information. However, existing methods in RS VQA overlook the fact that the ground truths in RS VQA benchmark datasets, which are algorithmically generated rather than manually annotated, may not always represent the most reasonable answers to the questions. In this article, we propose a multimodal momentum distillation model (RSMoDM) for RS VQA tasks. Specifically, we maintain the momentum distillation model during the training stage that generates stable and reliable pseudolabels for additional supervision, effectively preventing the model from being penalized for producing other reasonable outputs that differ from ground truth. Additionally, to address domain shift in RS, we employ the Vision Transformer (ViT) trained on a large-scale RS dataset for enhanced image feature extraction. Moreover, we introduce the multimodal fusion module with cross-attention for improved cross-modal representation learning. Our extensive experiments across three different RS VQA datasets demonstrate that RSMoDM achieves state-of-the-art performance, particularly excelling in scenarios with limited training data. The strong interpretability of our method is further evidenced by visualized attention maps. |
---|---|
ISSN: | 1939-1404 2151-1535 |
DOI: | 10.1109/JSTARS.2024.3419035 |