Loading…

Transformer Sub-Patch Matching for High-Performance Visual Object Tracking

Visual tracking is a core component of intelligent transportation systems, especially for unmanned driving and road surveillance. Numerous convolutional neural network (CNN) trackers have achieved unprecedented performance. However, CNN features with regular spatial context relationships experience...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on intelligent transportation systems 2023-08, Vol.24 (8), p.1-15
Main Authors: Tang, Chuanming, Hu, Qintao, Zhou, Gaofan, Yao, Jinzhen, Zhang, Jianlin, Huang, Yongmei, Ye, Qixiang
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Visual tracking is a core component of intelligent transportation systems, especially for unmanned driving and road surveillance. Numerous convolutional neural network (CNN) trackers have achieved unprecedented performance. However, CNN features with regular spatial context relationships experience difficulty matching the rigid target templates when dramatic deformation and occlusion occur. In this paper, we propose a novel full Transformer Sub-patch Matching network for tracking (TSMtrack), which decomposes the tracked object into sub-patches, and interlaced matches the extracted sub-patches by leveraging the attention mechanism born with the Transformer. Roots in Transformer architecture, TSMtrack consists of image patch decomposition, sub-patch matching, and position prediction. Specifically, TSMtrack converts the whole frame into sub-patches and extracts the sub-patch features independently. By sub-patch matching and FFN-like prediction, TSMtrack enables independent similarity measurement between sub-patch features in an interlaced and iterative fashion. With a full Transformer pipeline implemented, we achieve a high-quality trade-off between tracking speed performance. Experiments on nine benchmarks demonstrate the effectiveness of our Transformer sub-patch matching framework. In particular, it realizes an AO of 75.6 on GOT-10K and SR of 57.9 on WebUAV-3M with 48 FPS on GPU RTX-2060s.
ISSN:1524-9050
1558-0016
DOI:10.1109/TITS.2023.3264664