Loading…
CLVIN: Complete language-vision interaction network for visual question answering
The emergence of the Transformer optimizes the interactive modeling of multimodal information in visual question answering (VQA) tasks, helping machines better understand multimodal information. The existing Transformer-based end-to-end methods have made some achievements in applying the Encoder-Dec...
Saved in:
Published in: | Knowledge-based systems 2023-09, Vol.275, p.110706, Article 110706 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The emergence of the Transformer optimizes the interactive modeling of multimodal information in visual question answering (VQA) tasks, helping machines better understand multimodal information. The existing Transformer-based end-to-end methods have made some achievements in applying the Encoder-Decoder (E-D) mode or realizing complete interaction. However, almost no methods combine the advantages of the two well and give full play to them. Thus, this paper designs a complete language-vision interaction network (CLVIN) for VQA based on the implementation of the quadratic E-D mode. Based on the core framework of the modular co-attention network (MCAN), CLVIN achieves the complete interaction of multimodal information by using the E-D mode again, realizing the rational distribution of the question words’ weight information. In addition, to reduce the additional consumption of time and memory caused by introducing the quadratic E-D mode, this paper proposes a compact method called CLVIN-c through optimizing the underlying implementation of the scaled dot-product attention in Transformer. Finally, a series of experimental results based on the dataset VQA-v2.0 and CLEVR show that CLVIN has a significant performance improvement, and CLVIN-c achieves further optimizations in model size and performance. Code is available at https://github.com/RainyMoo/myvqa.
•Present that incomplete interactions limit rationality for token distribution.•Design a quadratic E-D mode model CLVIN to realize reasonable token distribution.•Propose CLVIN-c to implement further improvements in model size and performance.•Realize significant or comparable performance gain compared to some existing SOTAs. |
---|---|
ISSN: | 0950-7051 1872-7409 |
DOI: | 10.1016/j.knosys.2023.110706 |