Loading…

One-Stream Stepwise Decreasing for Vision-Language Tracking

Based on the fixed language descriptions in the initial frames, a vision-language tracker typically adopts a two-stream model structure to align vision and language features at the feature fusion stages. However, this paradigm may degrade the tracking performance due to inaccurate language descripti...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on circuits and systems for video technology 2024-10, Vol.34 (10), p.9053-9063
Main Authors: Zhang, Guangtong, Zhong, Bineng, Liang, Qihua, Mo, Zhiyi, Li, Ning, Song, Shuxiang
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Based on the fixed language descriptions in the initial frames, a vision-language tracker typically adopts a two-stream model structure to align vision and language features at the feature fusion stages. However, this paradigm may degrade the tracking performance due to inaccurate language descriptions and lacks further modal interaction. To address these issues, we propose a one-stream vision-language model called One-stream Stepwise Decreasing for Vision-Language Tracking (OSDT). Specifically, we first encode the language description using a language encoder. The obtained language features are then combined with visual images and entered jointly into a visual encoder, in which the encoder's self-attention mechanism is utilized to facilitate more interactions between language and visual features. Moreover, to mitigate the problems caused by inaccurate language descriptions, we design a stepwise decreasing multi-modal interaction framework, in which a Feature Filter Module (FFM) is introduced to select language features that are more relevant to visual information to provide semantic guidance for visual feature extraction. Furthermore, without additional feature fusion modules, our one-stream model framework can efficiently utilize the proposed feature filtering module for feature selection. Consequently, our tracker can achieve fast tracking speed in the vision-language tracking domain compared to existing state-of-the-art methods. We extensively evaluate our tracker on three benchmarks, i.e. TNL2K, LaSOT, and OTB99, demonstrating competing performance compared to state-of-the-art vision-language tracking methods.
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2024.3395352