Loading…

A Transformer-based network with adaptive spatial prior for visual tracking

Single object tracking (SOT) in complex scenes presents significant challenges in computer vision. In recent years, transformer has shown its demonstrated efficacy in visual object tracking tasks, due to its capacity to capture the long-range dependencies between image pixels. However, two limitatio...

Full description

Saved in:
Bibliographic Details
Published in:Neurocomputing (Amsterdam) 2025-01, Vol.614, p.128821, Article 128821
Main Authors: Cheng, Feng, Peng, Gaoliang, Li, Junbao, Zhao, Benqi, Pan, Jeng-Shyang, Li, Hang
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c185t-1bcd7ae61b331342cd4d4c3d5d4a1a17afe5c5adaf0e499ae2a9807352767da83
container_end_page
container_issue
container_start_page 128821
container_title Neurocomputing (Amsterdam)
container_volume 614
creator Cheng, Feng
Peng, Gaoliang
Li, Junbao
Zhao, Benqi
Pan, Jeng-Shyang
Li, Hang
description Single object tracking (SOT) in complex scenes presents significant challenges in computer vision. In recent years, transformer has shown its demonstrated efficacy in visual object tracking tasks, due to its capacity to capture the long-range dependencies between image pixels. However, two limitations hinder the performance improvement of transformer-based trackers. Firstly, transformer splits and partitions the image into a sequence of patches, which disrupts the internal structural information of the object. Secondly, transformer-based trackers encode the target template and search region together, potentially leading to confusion between the target and background during feature interaction. To address the above issues, we propose a fully transformer-based tracking framework via learning structural prior information, called SPformer. In other words, a self-attention spatial-prior generative network is established for simulating the spatial associations between features. Moreover, the cross-attention structural prior extractors based on Gaussian and arbitrary distributions are developed to seek the semantic interaction features between the object template and the search region, effectively mitigating feature confusion. Extensive experiments on eight prevailing benchmarks demonstrate that SPformer outperforms existing state-of-art (SOAT) trackers. We further analyze the effectiveness of the two proposed prior modules and validate their application in target tracking models.
doi_str_mv 10.1016/j.neucom.2024.128821
format article
fullrecord <record><control><sourceid>elsevier_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1016_j_neucom_2024_128821</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0925231224015923</els_id><sourcerecordid>S0925231224015923</sourcerecordid><originalsourceid>FETCH-LOGICAL-c185t-1bcd7ae61b331342cd4d4c3d5d4a1a17afe5c5adaf0e499ae2a9807352767da83</originalsourceid><addsrcrecordid>eNp9kMtOwzAQRb0AiVL4Axb-gQSP7TTJBqmqeFRUYlPW1tSegPtIKttNxd-TKqxZjTTSuffqMPYAIgcBs8dt3tLJdodcCqlzkFUl4YpNRC2LTCqQN-w2xq0QUIKsJ-x9ztcB29h04UAh22Akx1tK5y7s-Nmnb44Oj8n3xOMRk8c9PwbfBT4AvPfxNDxSQLvz7dcdu25wH-n-707Z58vzevGWrT5el4v5KrNQFSmDjXUl0gw2SoHS0jrttFWucBoBocSGClsMtY0gXddIEutKlKqQ5ax0WKkp02OuDV2MgRozTDpg-DEgzEWC2ZpRgrlIMKOEAXsaMRq29Z6CidZTa8n5QDYZ1_n_A34BtPZrGg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>A Transformer-based network with adaptive spatial prior for visual tracking</title><source>ScienceDirect Journals</source><creator>Cheng, Feng ; Peng, Gaoliang ; Li, Junbao ; Zhao, Benqi ; Pan, Jeng-Shyang ; Li, Hang</creator><creatorcontrib>Cheng, Feng ; Peng, Gaoliang ; Li, Junbao ; Zhao, Benqi ; Pan, Jeng-Shyang ; Li, Hang</creatorcontrib><description>Single object tracking (SOT) in complex scenes presents significant challenges in computer vision. In recent years, transformer has shown its demonstrated efficacy in visual object tracking tasks, due to its capacity to capture the long-range dependencies between image pixels. However, two limitations hinder the performance improvement of transformer-based trackers. Firstly, transformer splits and partitions the image into a sequence of patches, which disrupts the internal structural information of the object. Secondly, transformer-based trackers encode the target template and search region together, potentially leading to confusion between the target and background during feature interaction. To address the above issues, we propose a fully transformer-based tracking framework via learning structural prior information, called SPformer. In other words, a self-attention spatial-prior generative network is established for simulating the spatial associations between features. Moreover, the cross-attention structural prior extractors based on Gaussian and arbitrary distributions are developed to seek the semantic interaction features between the object template and the search region, effectively mitigating feature confusion. Extensive experiments on eight prevailing benchmarks demonstrate that SPformer outperforms existing state-of-art (SOAT) trackers. We further analyze the effectiveness of the two proposed prior modules and validate their application in target tracking models.</description><identifier>ISSN: 0925-2312</identifier><identifier>DOI: 10.1016/j.neucom.2024.128821</identifier><language>eng</language><publisher>Elsevier B.V</publisher><subject>Object tracking ; Siamese network ; Spatial prior ; Visual transformer</subject><ispartof>Neurocomputing (Amsterdam), 2025-01, Vol.614, p.128821, Article 128821</ispartof><rights>2024 Elsevier B.V.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c185t-1bcd7ae61b331342cd4d4c3d5d4a1a17afe5c5adaf0e499ae2a9807352767da83</cites><orcidid>0000-0003-3726-1821 ; 0000-0002-8543-9455</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Cheng, Feng</creatorcontrib><creatorcontrib>Peng, Gaoliang</creatorcontrib><creatorcontrib>Li, Junbao</creatorcontrib><creatorcontrib>Zhao, Benqi</creatorcontrib><creatorcontrib>Pan, Jeng-Shyang</creatorcontrib><creatorcontrib>Li, Hang</creatorcontrib><title>A Transformer-based network with adaptive spatial prior for visual tracking</title><title>Neurocomputing (Amsterdam)</title><description>Single object tracking (SOT) in complex scenes presents significant challenges in computer vision. In recent years, transformer has shown its demonstrated efficacy in visual object tracking tasks, due to its capacity to capture the long-range dependencies between image pixels. However, two limitations hinder the performance improvement of transformer-based trackers. Firstly, transformer splits and partitions the image into a sequence of patches, which disrupts the internal structural information of the object. Secondly, transformer-based trackers encode the target template and search region together, potentially leading to confusion between the target and background during feature interaction. To address the above issues, we propose a fully transformer-based tracking framework via learning structural prior information, called SPformer. In other words, a self-attention spatial-prior generative network is established for simulating the spatial associations between features. Moreover, the cross-attention structural prior extractors based on Gaussian and arbitrary distributions are developed to seek the semantic interaction features between the object template and the search region, effectively mitigating feature confusion. Extensive experiments on eight prevailing benchmarks demonstrate that SPformer outperforms existing state-of-art (SOAT) trackers. We further analyze the effectiveness of the two proposed prior modules and validate their application in target tracking models.</description><subject>Object tracking</subject><subject>Siamese network</subject><subject>Spatial prior</subject><subject>Visual transformer</subject><issn>0925-2312</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2025</creationdate><recordtype>article</recordtype><recordid>eNp9kMtOwzAQRb0AiVL4Axb-gQSP7TTJBqmqeFRUYlPW1tSegPtIKttNxd-TKqxZjTTSuffqMPYAIgcBs8dt3tLJdodcCqlzkFUl4YpNRC2LTCqQN-w2xq0QUIKsJ-x9ztcB29h04UAh22Akx1tK5y7s-Nmnb44Oj8n3xOMRk8c9PwbfBT4AvPfxNDxSQLvz7dcdu25wH-n-707Z58vzevGWrT5el4v5KrNQFSmDjXUl0gw2SoHS0jrttFWucBoBocSGClsMtY0gXddIEutKlKqQ5ax0WKkp02OuDV2MgRozTDpg-DEgzEWC2ZpRgrlIMKOEAXsaMRq29Z6CidZTa8n5QDYZ1_n_A34BtPZrGg</recordid><startdate>20250121</startdate><enddate>20250121</enddate><creator>Cheng, Feng</creator><creator>Peng, Gaoliang</creator><creator>Li, Junbao</creator><creator>Zhao, Benqi</creator><creator>Pan, Jeng-Shyang</creator><creator>Li, Hang</creator><general>Elsevier B.V</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0003-3726-1821</orcidid><orcidid>https://orcid.org/0000-0002-8543-9455</orcidid></search><sort><creationdate>20250121</creationdate><title>A Transformer-based network with adaptive spatial prior for visual tracking</title><author>Cheng, Feng ; Peng, Gaoliang ; Li, Junbao ; Zhao, Benqi ; Pan, Jeng-Shyang ; Li, Hang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c185t-1bcd7ae61b331342cd4d4c3d5d4a1a17afe5c5adaf0e499ae2a9807352767da83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2025</creationdate><topic>Object tracking</topic><topic>Siamese network</topic><topic>Spatial prior</topic><topic>Visual transformer</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Cheng, Feng</creatorcontrib><creatorcontrib>Peng, Gaoliang</creatorcontrib><creatorcontrib>Li, Junbao</creatorcontrib><creatorcontrib>Zhao, Benqi</creatorcontrib><creatorcontrib>Pan, Jeng-Shyang</creatorcontrib><creatorcontrib>Li, Hang</creatorcontrib><collection>CrossRef</collection><jtitle>Neurocomputing (Amsterdam)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Cheng, Feng</au><au>Peng, Gaoliang</au><au>Li, Junbao</au><au>Zhao, Benqi</au><au>Pan, Jeng-Shyang</au><au>Li, Hang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Transformer-based network with adaptive spatial prior for visual tracking</atitle><jtitle>Neurocomputing (Amsterdam)</jtitle><date>2025-01-21</date><risdate>2025</risdate><volume>614</volume><spage>128821</spage><pages>128821-</pages><artnum>128821</artnum><issn>0925-2312</issn><abstract>Single object tracking (SOT) in complex scenes presents significant challenges in computer vision. In recent years, transformer has shown its demonstrated efficacy in visual object tracking tasks, due to its capacity to capture the long-range dependencies between image pixels. However, two limitations hinder the performance improvement of transformer-based trackers. Firstly, transformer splits and partitions the image into a sequence of patches, which disrupts the internal structural information of the object. Secondly, transformer-based trackers encode the target template and search region together, potentially leading to confusion between the target and background during feature interaction. To address the above issues, we propose a fully transformer-based tracking framework via learning structural prior information, called SPformer. In other words, a self-attention spatial-prior generative network is established for simulating the spatial associations between features. Moreover, the cross-attention structural prior extractors based on Gaussian and arbitrary distributions are developed to seek the semantic interaction features between the object template and the search region, effectively mitigating feature confusion. Extensive experiments on eight prevailing benchmarks demonstrate that SPformer outperforms existing state-of-art (SOAT) trackers. We further analyze the effectiveness of the two proposed prior modules and validate their application in target tracking models.</abstract><pub>Elsevier B.V</pub><doi>10.1016/j.neucom.2024.128821</doi><orcidid>https://orcid.org/0000-0003-3726-1821</orcidid><orcidid>https://orcid.org/0000-0002-8543-9455</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 0925-2312
ispartof Neurocomputing (Amsterdam), 2025-01, Vol.614, p.128821, Article 128821
issn 0925-2312
language eng
recordid cdi_crossref_primary_10_1016_j_neucom_2024_128821
source ScienceDirect Journals
subjects Object tracking
Siamese network
Spatial prior
Visual transformer
title A Transformer-based network with adaptive spatial prior for visual tracking
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T09%3A19%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-elsevier_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Transformer-based%20network%20with%20adaptive%20spatial%20prior%20for%20visual%20tracking&rft.jtitle=Neurocomputing%20(Amsterdam)&rft.au=Cheng,%20Feng&rft.date=2025-01-21&rft.volume=614&rft.spage=128821&rft.pages=128821-&rft.artnum=128821&rft.issn=0925-2312&rft_id=info:doi/10.1016/j.neucom.2024.128821&rft_dat=%3Celsevier_cross%3ES0925231224015923%3C/elsevier_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c185t-1bcd7ae61b331342cd4d4c3d5d4a1a17afe5c5adaf0e499ae2a9807352767da83%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true