Loading…

Injecting Linguistic Into Visual Backbone: Query-Aware Multimodal Fusion Network for Remote Sensing Visual Grounding

The remote sensing visual grounding (RSVG) task focuses on accurately identifying and localizing specific targets in remote sensing (RS) images using descriptive query expressions. Existing methods independently extract visual and textual features, ignoring early complementary information between im...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on geoscience and remote sensing 2024, Vol.62, p.1-14
Main Authors: Li, Chongyang, Zhang, Wenkai, Bi, Hanbo, Li, Jihao, Li, Shuoke, Yu, Haichen, Sun, Xian, Wang, Hongqi
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The remote sensing visual grounding (RSVG) task focuses on accurately identifying and localizing specific targets in remote sensing (RS) images using descriptive query expressions. Existing methods independently extract visual and textual features, ignoring early complementary information between image and text. This leads to information loss and misalignment, limiting the model's ability to distinguish similar targets. To address this challenge, we propose the query-aware multimodal fusion network (QAMFN), which introduces an innovative query-guided visual attention (QGVA) mechanism in the early stages of the visual encoder. This mechanism integrates textual information during the early visual feature extraction process, thereby resolving the issue of missing image-text complementary information. QGVA ensures that the visual backbone accurately focuses on local features highly relevant to the query by injecting textual information into the visual encoding process. Additionally, to enhance the model's ability to integrate multimodal information and adapt to more complex RS images, we introduce the text-semantic attention-guided masking (TAM) module. TAM aggregates multimodal features processed by the backbones and filters out redundant information, producing high-quality fused features. Experiments demonstrate that our approach sets a new record on the DIOR-RSVG dataset, improving accuracy to 81.67% (an absolute increase of 4.98%).
ISSN:0196-2892
1558-0644
DOI:10.1109/TGRS.2024.3450303