Loading…

Multiscale deep feature selection fusion network for referring image segmentation

Referring image segmentation has attracted extensive attention in recent years. Previous methods have explored the difficult alignment between visual and textual features, but this problem has not been effectively addressed. This leads to the problem of insufficient interaction between visual featur...

Full description

Saved in:

Bibliographic Details
Published in:	Multimedia tools and applications 2024-04, Vol.83 (12), p.36287-36305
Main Authors:	Dai, Xianwen, Lin, Jiacheng, Nai, Ke, Li, Qingpeng, Li, Zhiyong
Format:	Article
Language:	English
Subjects:	Accuracy Computer Communication Networks Computer Science Data Structures and Information Theory Deep learning Feature selection Image segmentation Multimedia Multimedia Information Systems Neural networks Semantics Special Purpose and Application-Based Systems Track 6: Computer Vision for Multimedia Applications
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Referring image segmentation has attracted extensive attention in recent years. Previous methods have explored the difficult alignment between visual and textual features, but this problem has not been effectively addressed. This leads to the problem of insufficient interaction between visual features and textual features, which affects model performance. To this end, we propose a language-aware pixel feature fusion module (LPFFM) based on self-attention mechanism to ensure that the features of the two modalities have sufficient interaction in the space and channels. Then we apply it in the shallow to deep layers of the encoder to gradually select visual features related to the text. Secondly, we propose a second selection mechanism to further select visual features that only contain the target. For this mechanism, we design an attention contrastive loss to better suppress irrelevant background information. Further, we propose a multi-scale deep features selection fusion network (MDSFNet) based on the U-net architecture. Finally, the experimental results show that our proposed method is competitive with previous methods, improving the performance by 2.87%, 3.17%, and 3.81% on three benchmark datasets, RefCOCO, RefCOCO+, and G-ref, respectively.
ISSN:	1573-7721 1380-7501 1573-7721
DOI:	10.1007/s11042-023-16913-6