Loading…

CLVIN: Complete language-vision interaction network for visual question answering

The emergence of the Transformer optimizes the interactive modeling of multimodal information in visual question answering (VQA) tasks, helping machines better understand multimodal information. The existing Transformer-based end-to-end methods have made some achievements in applying the Encoder-Dec...

Full description

Saved in:
Bibliographic Details
Published in:Knowledge-based systems 2023-09, Vol.275, p.110706, Article 110706
Main Authors: Chen, Chongqing, Han, Dezhi, Shen, Xiang
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The emergence of the Transformer optimizes the interactive modeling of multimodal information in visual question answering (VQA) tasks, helping machines better understand multimodal information. The existing Transformer-based end-to-end methods have made some achievements in applying the Encoder-Decoder (E-D) mode or realizing complete interaction. However, almost no methods combine the advantages of the two well and give full play to them. Thus, this paper designs a complete language-vision interaction network (CLVIN) for VQA based on the implementation of the quadratic E-D mode. Based on the core framework of the modular co-attention network (MCAN), CLVIN achieves the complete interaction of multimodal information by using the E-D mode again, realizing the rational distribution of the question words’ weight information. In addition, to reduce the additional consumption of time and memory caused by introducing the quadratic E-D mode, this paper proposes a compact method called CLVIN-c through optimizing the underlying implementation of the scaled dot-product attention in Transformer. Finally, a series of experimental results based on the dataset VQA-v2.0 and CLEVR show that CLVIN has a significant performance improvement, and CLVIN-c achieves further optimizations in model size and performance. Code is available at https://github.com/RainyMoo/myvqa. •Present that incomplete interactions limit rationality for token distribution.•Design a quadratic E-D mode model CLVIN to realize reasonable token distribution.•Propose CLVIN-c to implement further improvements in model size and performance.•Realize significant or comparable performance gain compared to some existing SOTAs.
ISSN:0950-7051
1872-7409
DOI:10.1016/j.knosys.2023.110706