Loading…

Knowing What it is: Semantic-Enhanced Dual Attention Transformer

Attention has become an indispensable component of the models of various multimedia tasks like Image Captioning (IC) and Visual Question Answering (VQA). However, most existing attention modules are designed for capturing the spatial dependency, and are still insufficient in semantic understanding,...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on multimedia 2023, Vol.25, p.3723-3736
Main Authors:	Ma, Yiwei, Ji, Jiayi, Sun, Xiaoshuai, Zhou, Yiyi, Wu, Yongjian, Huang, Feiyue, Ji, Rongrong
Format:	Article
Language:	English
Subjects:	attention mechanism Head Image captioning Integrated circuit modeling Modules Multimedia Questions Semantics Task analysis transformer Transformers visual question answering Visual tasks Visualization
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Attention has become an indispensable component of the models of various multimedia tasks like Image Captioning (IC) and Visual Question Answering (VQA). However, most existing attention modules are designed for capturing the spatial dependency, and are still insufficient in semantic understanding, e.g. , the categories of objects and their attributes, which is also critical for image captioning. To compensate for this defect, we propose a novel attention module termed Channel-wise Attention Block (CAB) to model channel-wise dependency for both visual modality and linguistic modality, thereby improving semantic learning and multi-modal reasoning simultaneously. Specifically, CAB has two novel designs to tackle with the high overhead of channel-wise attention, which are the reduction-reconstruction block structure and the gating-based attention prediction . Based on CAB, we further propose a novel Semantic-enhanced Dual Attention Transformer (termed SDATR), which combines the merits of spatial and channel-wise attentions. To validate SDATR, we conduct extensive experiments on the MS COCO dataset and yield new state-of-the-art performance of 134.5 CIDEr score on COCO Karpathy test split and 136.0 CIDEr score on the official online testing server. To examine the generalization of SDATR, we also apply it to the task of visual question answering, where superior performance gains are also witnessed. The code and models are publicly available at https://github.com/xmu-xiaoma666/SDATR .
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2022.3164787