Loading…

MixFormer: End-to-End Tracking With Iterative Mixed Attention

Visual object tracking often employs a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, in this paper, we present a compact tracking frame...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on pattern analysis and machine intelligence 2024-06, Vol.46 (6), p.4129-4146
Main Authors: Cui, Yutao, Jiang, Cheng, Wu, Gangshan, Wang, Limin
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Visual object tracking often employs a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, in this paper, we present a compact tracking framework, termed as MixFormer , built upon transformers. Our core design is to utilize the flexibility of attention operations, and we propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows us to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer trackers simply by stacking multiple MAMs and placing a localization head on top. Specifically, we instantiate two types of MixFormer trackers, a hierarchical tracker MixCvT , and a non-hierarchical simple tracker MixViT . For these two trackers, we investigate a series of pre-training methods and uncover the different behaviors between supervised pre-training and self-supervised pre-training in our MixFormer trackers. We also extend the masked autoencoder pre-training to our MixFormer trackers and design the new competitive TrackMAE pre-training technique. Finally, to handle multiple target templates during online tracking, we devise an asymmetric attention scheme in MAM to reduce computational cost, and propose an effective score prediction module to select high-quality templates. Our MixFormer trackers set a new state-of-the-art performance on seven tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10 k, OTB100, TOTB and UAV123. In particular, our MixViT-L achieves AUC scores of 73.3% on LaSOT, 86.1% on TrackingNet and 82.8% on TOTB.
ISSN:0162-8828
1939-3539
2160-9292
DOI:10.1109/TPAMI.2024.3349519