Loading…

LGR-NET: Language Guided Reasoning Network for Referring Expression Comprehension

Referring Expression Comprehension (REC) is a fundamental task in the vision and language domain, which aims to locate an image region according to a natural language expression. REC requires the models to capture key clues in the text and perform accurate cross-modal reasoning. A recent trend emplo...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on circuits and systems for video technology 2024-08, Vol.34 (8), p.7771-7784
Main Authors:	Lu, Mingcong, Li, Ruifan, Feng, Fangxiang, Ma, Zhanyu, Wang, Xiaojie
Format:	Article
Language:	English
Subjects:	Cognition cross-modal reasoning Detectors Feature extraction Proposals referring expression comprehension Task analysis transformer Transformers Vision and language Visualization
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Referring Expression Comprehension (REC) is a fundamental task in the vision and language domain, which aims to locate an image region according to a natural language expression. REC requires the models to capture key clues in the text and perform accurate cross-modal reasoning. A recent trend employs transformer-based methods to address this problem. However, most of these methods typically treat image and text equally. They usually perform cross-modal reasoning in a crude way, and utilize textual features as a whole without detailed considerations (e.g., spatial information). This insufficient utilization of textual features will lead to sub-optimal results. In this paper, we propose a Language Guided Reasoning Network (LGR-NET) to fully utilize the guidance of the referring expression. To localize the referred object, we set a prediction token to capture cross-modal features. Furthermore, to sufficiently utilize the textual features, we extend them by our Textual Feature Extender (TFE) from three aspects. First, we design a novel coordinate embedding based on textual features. The coordinate embedding is incorporated to the prediction token to promote its capture of language-related visual features. Second, we employ the extracted textual features for Text-guided Cross-modal Alignment (TCA) and Fusion (TCF), alternately. Third, we devise a novel cross-modal loss to enhance cross-modal alignment between the referring expression and the learnable prediction token. We conduct extensive experiments on five benchmark datasets, and the experimental results show that our LGR-NET achieves a new state-of-the-art. Source code is available at https://github.com/lmc8133/LGR-NET .
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2024.3374786