Loading…

Modality-Specific Multimodal Global Enhanced Network for Text-Based Visual Question Answering

Text-based visual question answering (T-VQA) aims to answer questions about images by comprehending both detected objects and OCR(optical character recognition) tokens. Most existing methods fail to eliminate the noisy and redundant detected objects, and ignore the modality-specific information. To...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yang, Zhi, Xuan, Jun, Liu, Qing, Mao, Aihua
Format:	Conference Proceeding
Language:	English
Subjects:	Cognition Global Feature Graph Network Multimodal Transformer Noise measurement Optical character recognition Question answering (information retrieval) Redundancy T-VQA Text-VQA Transformers Visualization
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Text-based visual question answering (T-VQA) aims to answer questions about images by comprehending both detected objects and OCR(optical character recognition) tokens. Most existing methods fail to eliminate the noisy and redundant detected objects, and ignore the modality-specific information. To address these concerns, we propose multimodal global enhanced network (MGEN) for T-VQA. In MGEN the multi-modal global enhanced OCR graph focus on modeling the spatial relationships between OCR tokens rather than objects with noise and redundancy. Then, we introduce the multi-modal global enhanced transformer module, which is formed using the proposed attention mechanism, to reflect the specificity in the various modalities. The preceding two modules can leverage global features, implying that not only will the model's attention be directed to critical parts, but the noise can also be further reduced. Extensive experiments demonstrate the effectiveness and superiority of the proposed MGEN against the state-of-the-art methods.
ISSN:	1945-788X
DOI:	10.1109/ICME52920.2022.9859865