Loading…

Mask-Guided Siamese Tracking With a Frequency-Spatial Hybrid Network

Current tracking methods often adopt a compact template to emphasize target-specific features, alongside an expansive search region to encapsulate surrounding environmental information. However, the employment of a small template size may result in the loss of critical contextual information, which...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on circuits and systems for video technology 2025-01, Vol.35 (1), p.103-117
Main Authors:	Xiong, Jiabing, Ling, Qiang
Format:	Article
Language:	English
Subjects:	Convolutional neural networks feature aggregation Feature extraction Frequency-domain analysis frequency-spatial hybrid mask embedding Modules Semantics Siamese network Spatial resolution Target masking Target tracking Tracking Visual tracking Visualization
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Current tracking methods often adopt a compact template to emphasize target-specific features, alongside an expansive search region to encapsulate surrounding environmental information. However, the employment of a small template size may result in the loss of critical contextual information, which can be particularly harmful in challenging scenarios. Moreover, current tracking methods predominantly focus on spatial or channel operations, neglecting the potential of the frequency domain. To resolve those issues, we propose a novel Mask-Guided Siamese Tracking (MGTrack) framework to enhance tracking efficacy from two perspectives. Firstly, we propose an innovative Template Mask Encoder (TME) that employs a large template to produce a learnable mask embedding, thus preserving more surrounding contextual cues while focusing on target-oriented discriminative features. Secondly, we propose a frequency-spatial hybrid network, which is composed of a Frequency-Spatial Fusion (FSF) module and a Frequency-Spatial Attention (FSA) module. Particularly, the FSF module integrates frequency blocks with local and global fusion blocks, effectively aggregating deep semantic features from the backbone network with shallow texture features. Additionally, the FSA module enables bidirectional information exchange between spatial and frequency attention during the feature interaction process. Experiments across short-term and long-term tracking benchmarks demonstrate that our MGTrack can achieve better tracking performance with fewer parameters and FLOPs than some state-of-the-art tracking frameworks. The code of our MGTrack is available at https://github.com/jiabingxiing/MGTrack .
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2024.3452714