Loading…
Cross-Modal Recurrent Semantic Comprehension for Referring Image Segmentation
Referring image segmentation aims to segment the target object from the image according to the description of language expression. Due to the diversity of language expressions, word sequences in different orders often express different semantic information. The previous methods focus more on matchin...
Saved in:
Published in: | IEEE transactions on circuits and systems for video technology 2023-07, Vol.33 (7), p.3229-3242 |
---|---|
Main Authors: | , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Referring image segmentation aims to segment the target object from the image according to the description of language expression. Due to the diversity of language expressions, word sequences in different orders often express different semantic information. The previous methods focus more on matching different words to different visual regions in the image separately, ignoring the global semantic understanding of language expression based on the sequence structure. To address this problem, we redesign a new recurrent network structure for referring image segmentation, called Cross-Modal Recurrent Semantic Comprehension Network (CRSCNet), to obtain a more comprehensive global semantic understanding through iterative cross-modal semantic reasoning. Specifically, in each iteration, we first propose a Dynamic SepConv to extract relevant visual features guided by language and further propose Language Attentional Feature Modulation to improve the feature discriminability, then propose a Cross-Modal Semantic Reasoning module to perform global semantic reasoning by capturing both linguistic and visual information, and finally updates and corrects the visual features of the predicted object based on semantic information. Moreover, we further propose a Cross-Modal ASPP to capture richer visual information referred to in the global semantics of the language expression from larger receptive fields. Extensive experiments demonstrate that our proposed network significantly outperforms previous state-of-the-art methods on multiple datasets. |
---|---|
ISSN: | 1051-8215 1558-2205 |
DOI: | 10.1109/TCSVT.2022.3231964 |