Loading…
Transformer Sub-Patch Matching for High-Performance Visual Object Tracking
Visual tracking is a core component of intelligent transportation systems, especially for unmanned driving and road surveillance. Numerous convolutional neural network (CNN) trackers have achieved unprecedented performance. However, CNN features with regular spatial context relationships experience...
Saved in:
Published in: | IEEE transactions on intelligent transportation systems 2023-08, Vol.24 (8), p.1-15 |
---|---|
Main Authors: | , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Visual tracking is a core component of intelligent transportation systems, especially for unmanned driving and road surveillance. Numerous convolutional neural network (CNN) trackers have achieved unprecedented performance. However, CNN features with regular spatial context relationships experience difficulty matching the rigid target templates when dramatic deformation and occlusion occur. In this paper, we propose a novel full Transformer Sub-patch Matching network for tracking (TSMtrack), which decomposes the tracked object into sub-patches, and interlaced matches the extracted sub-patches by leveraging the attention mechanism born with the Transformer. Roots in Transformer architecture, TSMtrack consists of image patch decomposition, sub-patch matching, and position prediction. Specifically, TSMtrack converts the whole frame into sub-patches and extracts the sub-patch features independently. By sub-patch matching and FFN-like prediction, TSMtrack enables independent similarity measurement between sub-patch features in an interlaced and iterative fashion. With a full Transformer pipeline implemented, we achieve a high-quality trade-off between tracking speed performance. Experiments on nine benchmarks demonstrate the effectiveness of our Transformer sub-patch matching framework. In particular, it realizes an AO of 75.6 on GOT-10K and SR of 57.9 on WebUAV-3M with 48 FPS on GPU RTX-2060s. |
---|---|
ISSN: | 1524-9050 1558-0016 |
DOI: | 10.1109/TITS.2023.3264664 |