Loading…
Modality-Specific Multimodal Global Enhanced Network for Text-Based Visual Question Answering
Text-based visual question answering (T-VQA) aims to answer questions about images by comprehending both detected objects and OCR(optical character recognition) tokens. Most existing methods fail to eliminate the noisy and redundant detected objects, and ignore the modality-specific information. To...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Text-based visual question answering (T-VQA) aims to answer questions about images by comprehending both detected objects and OCR(optical character recognition) tokens. Most existing methods fail to eliminate the noisy and redundant detected objects, and ignore the modality-specific information. To address these concerns, we propose multimodal global enhanced network (MGEN) for T-VQA. In MGEN the multi-modal global enhanced OCR graph focus on modeling the spatial relationships between OCR tokens rather than objects with noise and redundancy. Then, we introduce the multi-modal global enhanced transformer module, which is formed using the proposed attention mechanism, to reflect the specificity in the various modalities. The preceding two modules can leverage global features, implying that not only will the model's attention be directed to critical parts, but the noise can also be further reduced. Extensive experiments demonstrate the effectiveness and superiority of the proposed MGEN against the state-of-the-art methods. |
---|---|
ISSN: | 1945-788X |
DOI: | 10.1109/ICME52920.2022.9859865 |