Loading…

Spatial constraint for efficient semi-supervised video object segmentation

Semi-supervised video object segmentation is the process of tracking and segmenting objects in a video sequence based on annotated masks for one or more frames. Recently, memory-based methods have attracted a significant amount of attention due to their strong performance. Having too much redundant...

Full description

Saved in:
Bibliographic Details
Published in:Computer vision and image understanding 2023-12, Vol.237, p.103843, Article 103843
Main Authors: Chen, Yadang, Ji, Chuanjun, Yang, Zhi-Xin, Wu, Enhua
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Semi-supervised video object segmentation is the process of tracking and segmenting objects in a video sequence based on annotated masks for one or more frames. Recently, memory-based methods have attracted a significant amount of attention due to their strong performance. Having too much redundant information stored in memory, however, makes such methods inefficient and inaccurate. Moreover, a global matching strategy is usually used for memory reading, so these methods are susceptible to interference from semantically similar objects and are prone to incorrect segmentation. We propose a spatial constraint network to overcome these problems. In particular, we introduce a time-varying sensor and a dynamic feature memory to adaptively store pixel information to facilitate the modeling of the target object, which greatly reduces information redundancy in the memory without missing critical information. Furthermore, we propose an efficient memory reader that is less computationally intensive and has a smaller footprint. More importantly, we introduce a spatial constraint module to learn spatial consistency to obtain more precise segmentation; the target and distractors can be identified by the learned spatial response. The experimental results indicate that our method is competitive with state-of-the-art methods on several benchmark datasets. Our method also achieves an approximately 30 FPS inference speed, which is close to the requirement for real-time systems. •Time-varying sensor and dynamic feature memory reduce redundancy but retain key data.•Efficient memory reader has smaller footprint and reduces computational overhead.•Spatial constraint module maintains response map to filter visually similar objects.
ISSN:1077-3142
1090-235X
DOI:10.1016/j.cviu.2023.103843