Loading…

Adaptively bypassing vision transformer blocks for efficient visual tracking

Empowered by transformer-based models, visual tracking has advanced significantly. However, the slow speed of current trackers limits their applicability on devices with constrained computational resources. To address this challenge, we introduce ABTrack, an adaptive computation framework that adapt...

Full description

Saved in:

Bibliographic Details
Published in:	Pattern recognition 2025-05, Vol.161, p.111278, Article 111278
Main Authors:	Yang, Xiangyang, Zeng, Dan, Wang, Xucheng, Wu, You, Ye, Hengzhou, Zhao, Qijun, Li, Shuiwang
Format:	Article
Language:	English
Subjects:	Adaptively bypassing Efficient visual tracking Pruning
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites	cdi_FETCH-LOGICAL-c231t-302497733aa99b245ad18eb271de10ff8ea9f2f18e8c72c1c5ff1cffc29a6b573
container_end_page
container_issue
container_start_page	111278
container_title	Pattern recognition
container_volume	161
creator	Yang, Xiangyang Zeng, Dan Wang, Xucheng Wu, You Ye, Hengzhou Zhao, Qijun Li, Shuiwang
description	Empowered by transformer-based models, visual tracking has advanced significantly. However, the slow speed of current trackers limits their applicability on devices with constrained computational resources. To address this challenge, we introduce ABTrack, an adaptive computation framework that adaptively bypassing transformer blocks for efficient visual tracking. The rationale behind ABTrack is rooted in the observation that semantic features or relations do not uniformly impact the tracking task across all abstraction levels. Instead, this impact varies based on the characteristics of the target and the scene it occupies. Consequently, disregarding insignificant semantic features or relations at certain abstraction levels may not significantly affect the tracking accuracy. We propose a Bypass Decision Module (BDM) to determine if a transformer block should be bypassed, which adaptively simplifies the architecture of ViTs and thus speeds up the inference process. To counteract the time cost incurred by the BDMs and further enhance the efficiency of ViTs, we introduce a novel ViT pruning method to reduce the dimension of the latent representation of tokens in each transformer block. Extensive experiments on multiple tracking benchmarks validate the effectiveness and generality of the proposed method and show that it achieves state-of-the-art performance. Code is released at: https://github.com/xyyang317/ABTrack. •For semantic’s unbalanced tracking effect, we propose Bypass Decision Module.•To counteract BDM’s time, we propose a new ViT pruning to reduce token latent dim.•We introduce ABTrack, a tracker with favorable effectiveness and real-time capacity.
doi_str_mv	10.1016/j.patcog.2024.111278
format	article
fullrecord	<record><control><sourceid>elsevier_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1016_j_patcog_2024_111278</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S003132032401029X</els_id><sourcerecordid>S003132032401029X</sourcerecordid><originalsourceid>FETCH-LOGICAL-c231t-302497733aa99b245ad18eb271de10ff8ea9f2f18e8c72c1c5ff1cffc29a6b573</originalsourceid><addsrcrecordid>eNp9kM1qwzAQhHVooWnaN-jBL2BXKzmRfSmE0D8I9NKehbzeDUoc20huIG9fG_ec07LLzDD7CfEEMgMJ6-dD1rsBu32mpMozAFCmuBELKTWkWkl9J-5jPEgJBnK1ELtN7frBn6m5JNWldzH6dp-cffRdmwzBtZG7cKKQVE2Hx5iMW0LMHj21w6T7dc2kw-PoexC37JpIj_9zKX7eXr-3H-nu6_1zu9mlqDQMqR6blcZo7VxZVipfuRoKqpSBmkAyF-RKVjzeCjQKAVfMgMyoSreuVkYvRT7nYuhiDMS2D_7kwsWCtBMFe7AzBTtRsDOF0fYy22jsdvYUbJzeQKp9IBxs3fnrAX-TcGvZ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Adaptively bypassing vision transformer blocks for efficient visual tracking</title><source>ScienceDirect Freedom Collection</source><creator>Yang, Xiangyang ; Zeng, Dan ; Wang, Xucheng ; Wu, You ; Ye, Hengzhou ; Zhao, Qijun ; Li, Shuiwang</creator><creatorcontrib>Yang, Xiangyang ; Zeng, Dan ; Wang, Xucheng ; Wu, You ; Ye, Hengzhou ; Zhao, Qijun ; Li, Shuiwang</creatorcontrib><description>Empowered by transformer-based models, visual tracking has advanced significantly. However, the slow speed of current trackers limits their applicability on devices with constrained computational resources. To address this challenge, we introduce ABTrack, an adaptive computation framework that adaptively bypassing transformer blocks for efficient visual tracking. The rationale behind ABTrack is rooted in the observation that semantic features or relations do not uniformly impact the tracking task across all abstraction levels. Instead, this impact varies based on the characteristics of the target and the scene it occupies. Consequently, disregarding insignificant semantic features or relations at certain abstraction levels may not significantly affect the tracking accuracy. We propose a Bypass Decision Module (BDM) to determine if a transformer block should be bypassed, which adaptively simplifies the architecture of ViTs and thus speeds up the inference process. To counteract the time cost incurred by the BDMs and further enhance the efficiency of ViTs, we introduce a novel ViT pruning method to reduce the dimension of the latent representation of tokens in each transformer block. Extensive experiments on multiple tracking benchmarks validate the effectiveness and generality of the proposed method and show that it achieves state-of-the-art performance. Code is released at: https://github.com/xyyang317/ABTrack. •For semantic’s unbalanced tracking effect, we propose Bypass Decision Module.•To counteract BDM’s time, we propose a new ViT pruning to reduce token latent dim.•We introduce ABTrack, a tracker with favorable effectiveness and real-time capacity.</description><identifier>ISSN: 0031-3203</identifier><identifier>DOI: 10.1016/j.patcog.2024.111278</identifier><language>eng</language><publisher>Elsevier Ltd</publisher><subject>Adaptively bypassing ; Efficient visual tracking ; Pruning</subject><ispartof>Pattern recognition, 2025-05, Vol.161, p.111278, Article 111278</ispartof><rights>2024 Elsevier Ltd</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c231t-302497733aa99b245ad18eb271de10ff8ea9f2f18e8c72c1c5ff1cffc29a6b573</cites><orcidid>0000-0002-4587-513X ; 0009-0001-3025-0516 ; 0000-0001-6646-8747 ; 0000-0002-9036-7791</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Yang, Xiangyang</creatorcontrib><creatorcontrib>Zeng, Dan</creatorcontrib><creatorcontrib>Wang, Xucheng</creatorcontrib><creatorcontrib>Wu, You</creatorcontrib><creatorcontrib>Ye, Hengzhou</creatorcontrib><creatorcontrib>Zhao, Qijun</creatorcontrib><creatorcontrib>Li, Shuiwang</creatorcontrib><title>Adaptively bypassing vision transformer blocks for efficient visual tracking</title><title>Pattern recognition</title><description>Empowered by transformer-based models, visual tracking has advanced significantly. However, the slow speed of current trackers limits their applicability on devices with constrained computational resources. To address this challenge, we introduce ABTrack, an adaptive computation framework that adaptively bypassing transformer blocks for efficient visual tracking. The rationale behind ABTrack is rooted in the observation that semantic features or relations do not uniformly impact the tracking task across all abstraction levels. Instead, this impact varies based on the characteristics of the target and the scene it occupies. Consequently, disregarding insignificant semantic features or relations at certain abstraction levels may not significantly affect the tracking accuracy. We propose a Bypass Decision Module (BDM) to determine if a transformer block should be bypassed, which adaptively simplifies the architecture of ViTs and thus speeds up the inference process. To counteract the time cost incurred by the BDMs and further enhance the efficiency of ViTs, we introduce a novel ViT pruning method to reduce the dimension of the latent representation of tokens in each transformer block. Extensive experiments on multiple tracking benchmarks validate the effectiveness and generality of the proposed method and show that it achieves state-of-the-art performance. Code is released at: https://github.com/xyyang317/ABTrack. •For semantic’s unbalanced tracking effect, we propose Bypass Decision Module.•To counteract BDM’s time, we propose a new ViT pruning to reduce token latent dim.•We introduce ABTrack, a tracker with favorable effectiveness and real-time capacity.</description><subject>Adaptively bypassing</subject><subject>Efficient visual tracking</subject><subject>Pruning</subject><issn>0031-3203</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2025</creationdate><recordtype>article</recordtype><recordid>eNp9kM1qwzAQhHVooWnaN-jBL2BXKzmRfSmE0D8I9NKehbzeDUoc20huIG9fG_ec07LLzDD7CfEEMgMJ6-dD1rsBu32mpMozAFCmuBELKTWkWkl9J-5jPEgJBnK1ELtN7frBn6m5JNWldzH6dp-cffRdmwzBtZG7cKKQVE2Hx5iMW0LMHj21w6T7dc2kw-PoexC37JpIj_9zKX7eXr-3H-nu6_1zu9mlqDQMqR6blcZo7VxZVipfuRoKqpSBmkAyF-RKVjzeCjQKAVfMgMyoSreuVkYvRT7nYuhiDMS2D_7kwsWCtBMFe7AzBTtRsDOF0fYy22jsdvYUbJzeQKp9IBxs3fnrAX-TcGvZ</recordid><startdate>202505</startdate><enddate>202505</enddate><creator>Yang, Xiangyang</creator><creator>Zeng, Dan</creator><creator>Wang, Xucheng</creator><creator>Wu, You</creator><creator>Ye, Hengzhou</creator><creator>Zhao, Qijun</creator><creator>Li, Shuiwang</creator><general>Elsevier Ltd</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-4587-513X</orcidid><orcidid>https://orcid.org/0009-0001-3025-0516</orcidid><orcidid>https://orcid.org/0000-0001-6646-8747</orcidid><orcidid>https://orcid.org/0000-0002-9036-7791</orcidid></search><sort><creationdate>202505</creationdate><title>Adaptively bypassing vision transformer blocks for efficient visual tracking</title><author>Yang, Xiangyang ; Zeng, Dan ; Wang, Xucheng ; Wu, You ; Ye, Hengzhou ; Zhao, Qijun ; Li, Shuiwang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c231t-302497733aa99b245ad18eb271de10ff8ea9f2f18e8c72c1c5ff1cffc29a6b573</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2025</creationdate><topic>Adaptively bypassing</topic><topic>Efficient visual tracking</topic><topic>Pruning</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Yang, Xiangyang</creatorcontrib><creatorcontrib>Zeng, Dan</creatorcontrib><creatorcontrib>Wang, Xucheng</creatorcontrib><creatorcontrib>Wu, You</creatorcontrib><creatorcontrib>Ye, Hengzhou</creatorcontrib><creatorcontrib>Zhao, Qijun</creatorcontrib><creatorcontrib>Li, Shuiwang</creatorcontrib><collection>CrossRef</collection><jtitle>Pattern recognition</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yang, Xiangyang</au><au>Zeng, Dan</au><au>Wang, Xucheng</au><au>Wu, You</au><au>Ye, Hengzhou</au><au>Zhao, Qijun</au><au>Li, Shuiwang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Adaptively bypassing vision transformer blocks for efficient visual tracking</atitle><jtitle>Pattern recognition</jtitle><date>2025-05</date><risdate>2025</risdate><volume>161</volume><spage>111278</spage><pages>111278-</pages><artnum>111278</artnum><issn>0031-3203</issn><abstract>Empowered by transformer-based models, visual tracking has advanced significantly. However, the slow speed of current trackers limits their applicability on devices with constrained computational resources. To address this challenge, we introduce ABTrack, an adaptive computation framework that adaptively bypassing transformer blocks for efficient visual tracking. The rationale behind ABTrack is rooted in the observation that semantic features or relations do not uniformly impact the tracking task across all abstraction levels. Instead, this impact varies based on the characteristics of the target and the scene it occupies. Consequently, disregarding insignificant semantic features or relations at certain abstraction levels may not significantly affect the tracking accuracy. We propose a Bypass Decision Module (BDM) to determine if a transformer block should be bypassed, which adaptively simplifies the architecture of ViTs and thus speeds up the inference process. To counteract the time cost incurred by the BDMs and further enhance the efficiency of ViTs, we introduce a novel ViT pruning method to reduce the dimension of the latent representation of tokens in each transformer block. Extensive experiments on multiple tracking benchmarks validate the effectiveness and generality of the proposed method and show that it achieves state-of-the-art performance. Code is released at: https://github.com/xyyang317/ABTrack. •For semantic’s unbalanced tracking effect, we propose Bypass Decision Module.•To counteract BDM’s time, we propose a new ViT pruning to reduce token latent dim.•We introduce ABTrack, a tracker with favorable effectiveness and real-time capacity.</abstract><pub>Elsevier Ltd</pub><doi>10.1016/j.patcog.2024.111278</doi><orcidid>https://orcid.org/0000-0002-4587-513X</orcidid><orcidid>https://orcid.org/0009-0001-3025-0516</orcidid><orcidid>https://orcid.org/0000-0001-6646-8747</orcidid><orcidid>https://orcid.org/0000-0002-9036-7791</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0031-3203
ispartof	Pattern recognition, 2025-05, Vol.161, p.111278, Article 111278
issn	0031-3203
language	eng
recordid	cdi_crossref_primary_10_1016_j_patcog_2024_111278
source	ScienceDirect Freedom Collection
subjects	Adaptively bypassing Efficient visual tracking Pruning
title	Adaptively bypassing vision transformer blocks for efficient visual tracking
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-06T02%3A58%3A14IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-elsevier_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Adaptively%20bypassing%20vision%20transformer%20blocks%20for%20efficient%20visual%20tracking&rft.jtitle=Pattern%20recognition&rft.au=Yang,%20Xiangyang&rft.date=2025-05&rft.volume=161&rft.spage=111278&rft.pages=111278-&rft.artnum=111278&rft.issn=0031-3203&rft_id=info:doi/10.1016/j.patcog.2024.111278&rft_dat=%3Celsevier_cross%3ES003132032401029X%3C/elsevier_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c231t-302497733aa99b245ad18eb271de10ff8ea9f2f18e8c72c1c5ff1cffc29a6b573%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true