Loading…

Cross-Modal Recurrent Semantic Comprehension for Referring Image Segmentation

Referring image segmentation aims to segment the target object from the image according to the description of language expression. Due to the diversity of language expressions, word sequences in different orders often express different semantic information. The previous methods focus more on matchin...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on circuits and systems for video technology 2023-07, Vol.33 (7), p.3229-3242
Main Authors: Shang, Chao, Li, Hongliang, Qiu, Heqian, Wu, Qingbo, Meng, Fanman, Zhao, Taijin, Ngan, King Ngi
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Referring image segmentation aims to segment the target object from the image according to the description of language expression. Due to the diversity of language expressions, word sequences in different orders often express different semantic information. The previous methods focus more on matching different words to different visual regions in the image separately, ignoring the global semantic understanding of language expression based on the sequence structure. To address this problem, we redesign a new recurrent network structure for referring image segmentation, called Cross-Modal Recurrent Semantic Comprehension Network (CRSCNet), to obtain a more comprehensive global semantic understanding through iterative cross-modal semantic reasoning. Specifically, in each iteration, we first propose a Dynamic SepConv to extract relevant visual features guided by language and further propose Language Attentional Feature Modulation to improve the feature discriminability, then propose a Cross-Modal Semantic Reasoning module to perform global semantic reasoning by capturing both linguistic and visual information, and finally updates and corrects the visual features of the predicted object based on semantic information. Moreover, we further propose a Cross-Modal ASPP to capture richer visual information referred to in the global semantics of the language expression from larger receptive fields. Extensive experiments demonstrate that our proposed network significantly outperforms previous state-of-the-art methods on multiple datasets.
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2022.3231964