Loading…

Word2Pix: Word to Pixel Cross-Attention Transformer in Visual Grounding

Current one-stage methods for visual grounding encode the language query as one holistic sentence embedding before fusion with visual features for target localization. Such a formulation provides insufficient ability to model query at the word level, and therefore is prone to neglect words that may...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transaction on neural networks and learning systems 2024-02, Vol.35 (2), p.1523-1533
Main Authors:	Zhao, Heng, Zhou, Joey Tianyi, Ong, Yew-Soon
Format:	Article
Language:	English
Subjects:	Attention Cross-attention Decoding deep learning Detectors Embedding Feature extraction Grounding Language Localization multimodal Pixels Queries Query languages referring expression comprehension Sentences Task analysis Transformers Visual discrimination learning visual grounding Visual perception Visualization Words (language)
Citations:	Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Current one-stage methods for visual grounding encode the language query as one holistic sentence embedding before fusion with visual features for target localization. Such a formulation provides insufficient ability to model query at the word level, and therefore is prone to neglect words that may not be the most important ones for a sentence but are critical for the referred object. In this article, we propose Word2Pix: a one-stage visual grounding network based on the encoder-decoder transformer architecture that enables learning for textual to visual feature correspondence via word to pixel attention. Each word from the query sentence is given an equal opportunity when attending to visual pixels through multiple stacks of transformer decoder layers. In this way, the decoder can learn to model the language query and fuse language with the visual features for target prediction simultaneously. We conduct the experiments on RefCOCO, RefCOCO+, and RefCOCOg datasets, and the proposed Word2Pix outperforms the existing one-stage methods by a notable margin. The results obtained also show that Word2Pix surpasses the two-stage visual grounding models, while at the same time keeping the merits of the one-stage paradigm, namely, end-to-end training and fast inference speed. Code is available at https://github.com/azurerain7/Word2Pix .
ISSN:	2162-237X 2162-2388
DOI:	10.1109/TNNLS.2022.3183827