Loading…

Visual tracking using transformer with a combination of convolution and attention

For Siamese-based trackers in the field of single object tracking, cross-correlation operation plays an important role. However, the cross-correlation essentially uses target feature to locally linearly match the search region, which leads to insufficient utilization or even loss of feature informat...

Full description

Saved in:

Bibliographic Details
Published in:	Image and vision computing 2023-09, Vol.137, p.104760, Article 104760
Main Authors:	Wang, Yuxuan, Yan, Liping, Feng, Zihang, Xia, Yuanqing, Xiao, Bo
Format:	Article
Language:	English
Subjects:	Attention Siamese networks Transformer Visual tracking
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	For Siamese-based trackers in the field of single object tracking, cross-correlation operation plays an important role. However, the cross-correlation essentially uses target feature to locally linearly match the search region, which leads to insufficient utilization or even loss of feature information. To effectively employ global context and sufficiently explore the relevance of template and search region, a novel matching operator is designed inspired by Transformer, which uses multi-head attention and embed a designed modulation module across the inputs of operator. Meanwhile, we equip our tracker with a multi-scale encoder/decoder strategy to gradually make more precise tracking. Finally, a complete tracking framework is presented named VTTR. The tracker consists of a feature extractor, a multi-scale encoder based on depth-wise convolution, a modified decoder as the matching operator and a prediction head. The proposed tracker is tested on many benchmarks and achieve excellent performance while running with fast speed. •An attention-based tracking framework using multi-scale strategyis presented.•The parallel multi-scale template encodercan better generalize template features.•The decoder that use modulation makesattention operation suitable for tracking.•Acoarse-to-fine decoding strategycan effectively minefeatureinformation.
ISSN:	0262-8856 1872-8138
DOI:	10.1016/j.imavis.2023.104760