Loading…

SiamTFA: Siamese Triple-Stream Feature Aggregation Network for Efficient RGBT Tracking

RGBT tracking is a task that utilizes images from visible (RGB) and thermal infrared (TIR) modalities to continuously locate a target, which plays an important role in various fields including intelligent transportation systems. Most existing RGBT trackers do not achieve high precision and real-time...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on intelligent transportation systems 2024-12, p.1-14
Main Authors:	Zhang, Jianming, Qin, Yu, Fan, Shimeng, Xiao, Zhu, Zhang, Jin
Format:	Article
Language:	English
Subjects:	Computational modeling Correlation Feature extraction Fuses lightweight attention multi-modal feature fusion Object tracking Real-time systems RGBT tracking Semantics siamese network Streams Target tracking Transformers triple-stream network
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	14
container_issue
container_start_page	1
container_title	IEEE transactions on intelligent transportation systems
container_volume
creator	Zhang, Jianming Qin, Yu Fan, Shimeng Xiao, Zhu Zhang, Jin
description	RGBT tracking is a task that utilizes images from visible (RGB) and thermal infrared (TIR) modalities to continuously locate a target, which plays an important role in various fields including intelligent transportation systems. Most existing RGBT trackers do not achieve high precision and real-time tracking speed simultaneously. To address this challenge, we propose an innovative RGBT tracker, the Siamese Triple-stream Feature Aggregation Network (SiamTFA). Firstly, a triple-stream backbone is presented to implement multi-modal feature extraction and fusion, which contains two parallel Swin Transformer feature extraction streams, and one feature fusion stream composed of joint-complementary feature aggregation (JCFA) modules. Secondly, our proposed JCFA module utilizes a joint-complementary attention to guide the aggregation of multi-modal features. Specifically, the joint attention can focus on spatial location information and semantic information of the target by combining the features of two modalities. Considering the complementarity between RGB and TIR modalities, the complementary attention is introduced to enhance the information of beneficial modality and suppress the information of ineffective modality. Thirdly, in order to reduce the computational complexity of the joint-complementary attention, we propose a depthwise shared attention structure, which utilizes depthwise convolution and shared features to achieve lightweight attention. Finally, we conduct extensive experiments on four official RGBT test datasets and the experimental results demonstrate that our proposed tracker outperforms some state-of-the-art trackers and the tracking speed reaches 37 frames per second (FPS). The code is available at https://github.com/zjjqinyu/SiamTFA.
doi_str_mv	10.1109/TITS.2024.3512551
format	article
fullrecord	<record><control><sourceid>crossref_ieee_</sourceid><recordid>TN_cdi_ieee_primary_10804856</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10804856</ieee_id><sourcerecordid>10_1109_TITS_2024_3512551</sourcerecordid><originalsourceid>FETCH-LOGICAL-c148t-aeb30213ae8a4225fce55878c51047fb3674bd100ba7c2a31cec9417b7d2dc353</originalsourceid><addsrcrecordid>eNpNkNtKw0AQhhdRsFYfQPBiXyB1Zg9N4l0tTS0UBRu9DZvNbFh7SNlExLc3ob3wan6G_xuGj7F7hAkipI_5Kt9MBAg1kRqF1njBRqh1EgHg9HLIQkUpaLhmN2371W-VRhyxz403-zybPfEhUEs8D_64o2jTBTJ7npHpvgPxWV0Hqk3nmwN_pe6nCVvumsAXznnr6dDx9-Vz3sPGbv2hvmVXzuxaujvPMfvIFvn8JVq_LVfz2TqyqJIuMlRKECgNJUYJoZ2l_uc4sRpBxa6U01iVFQKUJrbCSLRkU4VxGVeislLLMcPTXRuatg3kimPwexN-C4RiEFMMYopBTHEW0zMPJ8YT0b9-AirRU_kHDilfDA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>SiamTFA: Siamese Triple-Stream Feature Aggregation Network for Efficient RGBT Tracking</title><source>IEEE Xplore (Online service)</source><creator>Zhang, Jianming ; Qin, Yu ; Fan, Shimeng ; Xiao, Zhu ; Zhang, Jin</creator><creatorcontrib>Zhang, Jianming ; Qin, Yu ; Fan, Shimeng ; Xiao, Zhu ; Zhang, Jin</creatorcontrib><description>RGBT tracking is a task that utilizes images from visible (RGB) and thermal infrared (TIR) modalities to continuously locate a target, which plays an important role in various fields including intelligent transportation systems. Most existing RGBT trackers do not achieve high precision and real-time tracking speed simultaneously. To address this challenge, we propose an innovative RGBT tracker, the Siamese Triple-stream Feature Aggregation Network (SiamTFA). Firstly, a triple-stream backbone is presented to implement multi-modal feature extraction and fusion, which contains two parallel Swin Transformer feature extraction streams, and one feature fusion stream composed of joint-complementary feature aggregation (JCFA) modules. Secondly, our proposed JCFA module utilizes a joint-complementary attention to guide the aggregation of multi-modal features. Specifically, the joint attention can focus on spatial location information and semantic information of the target by combining the features of two modalities. Considering the complementarity between RGB and TIR modalities, the complementary attention is introduced to enhance the information of beneficial modality and suppress the information of ineffective modality. Thirdly, in order to reduce the computational complexity of the joint-complementary attention, we propose a depthwise shared attention structure, which utilizes depthwise convolution and shared features to achieve lightweight attention. Finally, we conduct extensive experiments on four official RGBT test datasets and the experimental results demonstrate that our proposed tracker outperforms some state-of-the-art trackers and the tracking speed reaches 37 frames per second (FPS). The code is available at https://github.com/zjjqinyu/SiamTFA.</description><identifier>ISSN: 1524-9050</identifier><identifier>EISSN: 1558-0016</identifier><identifier>DOI: 10.1109/TITS.2024.3512551</identifier><identifier>CODEN: ITISFG</identifier><language>eng</language><publisher>IEEE</publisher><subject>Computational modeling ; Correlation ; Feature extraction ; Fuses ; lightweight attention ; multi-modal feature fusion ; Object tracking ; Real-time systems ; RGBT tracking ; Semantics ; siamese network ; Streams ; Target tracking ; Transformers ; triple-stream network</subject><ispartof>IEEE transactions on intelligent transportation systems, 2024-12, p.1-14</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>jzhang@csust.edu.cn ; jmzhang@csust.edu.cn ; fanshimeng123@hnu.edu.cn ; qinyu@stu.csust.edu.cn ; zhxiao@hnu.edu.cn</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10804856$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,777,781,27905,27906,54777</link.rule.ids></links><search><creatorcontrib>Zhang, Jianming</creatorcontrib><creatorcontrib>Qin, Yu</creatorcontrib><creatorcontrib>Fan, Shimeng</creatorcontrib><creatorcontrib>Xiao, Zhu</creatorcontrib><creatorcontrib>Zhang, Jin</creatorcontrib><title>SiamTFA: Siamese Triple-Stream Feature Aggregation Network for Efficient RGBT Tracking</title><title>IEEE transactions on intelligent transportation systems</title><addtitle>TITS</addtitle><description>RGBT tracking is a task that utilizes images from visible (RGB) and thermal infrared (TIR) modalities to continuously locate a target, which plays an important role in various fields including intelligent transportation systems. Most existing RGBT trackers do not achieve high precision and real-time tracking speed simultaneously. To address this challenge, we propose an innovative RGBT tracker, the Siamese Triple-stream Feature Aggregation Network (SiamTFA). Firstly, a triple-stream backbone is presented to implement multi-modal feature extraction and fusion, which contains two parallel Swin Transformer feature extraction streams, and one feature fusion stream composed of joint-complementary feature aggregation (JCFA) modules. Secondly, our proposed JCFA module utilizes a joint-complementary attention to guide the aggregation of multi-modal features. Specifically, the joint attention can focus on spatial location information and semantic information of the target by combining the features of two modalities. Considering the complementarity between RGB and TIR modalities, the complementary attention is introduced to enhance the information of beneficial modality and suppress the information of ineffective modality. Thirdly, in order to reduce the computational complexity of the joint-complementary attention, we propose a depthwise shared attention structure, which utilizes depthwise convolution and shared features to achieve lightweight attention. Finally, we conduct extensive experiments on four official RGBT test datasets and the experimental results demonstrate that our proposed tracker outperforms some state-of-the-art trackers and the tracking speed reaches 37 frames per second (FPS). The code is available at https://github.com/zjjqinyu/SiamTFA.</description><subject>Computational modeling</subject><subject>Correlation</subject><subject>Feature extraction</subject><subject>Fuses</subject><subject>lightweight attention</subject><subject>multi-modal feature fusion</subject><subject>Object tracking</subject><subject>Real-time systems</subject><subject>RGBT tracking</subject><subject>Semantics</subject><subject>siamese network</subject><subject>Streams</subject><subject>Target tracking</subject><subject>Transformers</subject><subject>triple-stream network</subject><issn>1524-9050</issn><issn>1558-0016</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNpNkNtKw0AQhhdRsFYfQPBiXyB1Zg9N4l0tTS0UBRu9DZvNbFh7SNlExLc3ob3wan6G_xuGj7F7hAkipI_5Kt9MBAg1kRqF1njBRqh1EgHg9HLIQkUpaLhmN2371W-VRhyxz403-zybPfEhUEs8D_64o2jTBTJ7npHpvgPxWV0Hqk3nmwN_pe6nCVvumsAXznnr6dDx9-Vz3sPGbv2hvmVXzuxaujvPMfvIFvn8JVq_LVfz2TqyqJIuMlRKECgNJUYJoZ2l_uc4sRpBxa6U01iVFQKUJrbCSLRkU4VxGVeislLLMcPTXRuatg3kimPwexN-C4RiEFMMYopBTHEW0zMPJ8YT0b9-AirRU_kHDilfDA</recordid><startdate>20241218</startdate><enddate>20241218</enddate><creator>Zhang, Jianming</creator><creator>Qin, Yu</creator><creator>Fan, Shimeng</creator><creator>Xiao, Zhu</creator><creator>Zhang, Jin</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/jzhang@csust.edu.cn</orcidid><orcidid>https://orcid.org/jmzhang@csust.edu.cn</orcidid><orcidid>https://orcid.org/fanshimeng123@hnu.edu.cn</orcidid><orcidid>https://orcid.org/qinyu@stu.csust.edu.cn</orcidid><orcidid>https://orcid.org/zhxiao@hnu.edu.cn</orcidid></search><sort><creationdate>20241218</creationdate><title>SiamTFA: Siamese Triple-Stream Feature Aggregation Network for Efficient RGBT Tracking</title><author>Zhang, Jianming ; Qin, Yu ; Fan, Shimeng ; Xiao, Zhu ; Zhang, Jin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c148t-aeb30213ae8a4225fce55878c51047fb3674bd100ba7c2a31cec9417b7d2dc353</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computational modeling</topic><topic>Correlation</topic><topic>Feature extraction</topic><topic>Fuses</topic><topic>lightweight attention</topic><topic>multi-modal feature fusion</topic><topic>Object tracking</topic><topic>Real-time systems</topic><topic>RGBT tracking</topic><topic>Semantics</topic><topic>siamese network</topic><topic>Streams</topic><topic>Target tracking</topic><topic>Transformers</topic><topic>triple-stream network</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Jianming</creatorcontrib><creatorcontrib>Qin, Yu</creatorcontrib><creatorcontrib>Fan, Shimeng</creatorcontrib><creatorcontrib>Xiao, Zhu</creatorcontrib><creatorcontrib>Zhang, Jin</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE</collection><collection>CrossRef</collection><jtitle>IEEE transactions on intelligent transportation systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhang, Jianming</au><au>Qin, Yu</au><au>Fan, Shimeng</au><au>Xiao, Zhu</au><au>Zhang, Jin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SiamTFA: Siamese Triple-Stream Feature Aggregation Network for Efficient RGBT Tracking</atitle><jtitle>IEEE transactions on intelligent transportation systems</jtitle><stitle>TITS</stitle><date>2024-12-18</date><risdate>2024</risdate><spage>1</spage><epage>14</epage><pages>1-14</pages><issn>1524-9050</issn><eissn>1558-0016</eissn><coden>ITISFG</coden><abstract>RGBT tracking is a task that utilizes images from visible (RGB) and thermal infrared (TIR) modalities to continuously locate a target, which plays an important role in various fields including intelligent transportation systems. Most existing RGBT trackers do not achieve high precision and real-time tracking speed simultaneously. To address this challenge, we propose an innovative RGBT tracker, the Siamese Triple-stream Feature Aggregation Network (SiamTFA). Firstly, a triple-stream backbone is presented to implement multi-modal feature extraction and fusion, which contains two parallel Swin Transformer feature extraction streams, and one feature fusion stream composed of joint-complementary feature aggregation (JCFA) modules. Secondly, our proposed JCFA module utilizes a joint-complementary attention to guide the aggregation of multi-modal features. Specifically, the joint attention can focus on spatial location information and semantic information of the target by combining the features of two modalities. Considering the complementarity between RGB and TIR modalities, the complementary attention is introduced to enhance the information of beneficial modality and suppress the information of ineffective modality. Thirdly, in order to reduce the computational complexity of the joint-complementary attention, we propose a depthwise shared attention structure, which utilizes depthwise convolution and shared features to achieve lightweight attention. Finally, we conduct extensive experiments on four official RGBT test datasets and the experimental results demonstrate that our proposed tracker outperforms some state-of-the-art trackers and the tracking speed reaches 37 frames per second (FPS). The code is available at https://github.com/zjjqinyu/SiamTFA.</abstract><pub>IEEE</pub><doi>10.1109/TITS.2024.3512551</doi><tpages>14</tpages><orcidid>https://orcid.org/jzhang@csust.edu.cn</orcidid><orcidid>https://orcid.org/jmzhang@csust.edu.cn</orcidid><orcidid>https://orcid.org/fanshimeng123@hnu.edu.cn</orcidid><orcidid>https://orcid.org/qinyu@stu.csust.edu.cn</orcidid><orcidid>https://orcid.org/zhxiao@hnu.edu.cn</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 1524-9050
ispartof	IEEE transactions on intelligent transportation systems, 2024-12, p.1-14
issn	1524-9050 1558-0016
language	eng
recordid	cdi_ieee_primary_10804856
source	IEEE Xplore (Online service)
subjects	Computational modeling Correlation Feature extraction Fuses lightweight attention multi-modal feature fusion Object tracking Real-time systems RGBT tracking Semantics siamese network Streams Target tracking Transformers triple-stream network
title	SiamTFA: Siamese Triple-Stream Feature Aggregation Network for Efficient RGBT Tracking
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-20T13%3A53%3A01IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SiamTFA:%20Siamese%20Triple-Stream%20Feature%20Aggregation%20Network%20for%20Efficient%20RGBT%20Tracking&rft.jtitle=IEEE%20transactions%20on%20intelligent%20transportation%20systems&rft.au=Zhang,%20Jianming&rft.date=2024-12-18&rft.spage=1&rft.epage=14&rft.pages=1-14&rft.issn=1524-9050&rft.eissn=1558-0016&rft.coden=ITISFG&rft_id=info:doi/10.1109/TITS.2024.3512551&rft_dat=%3Ccrossref_ieee_%3E10_1109_TITS_2024_3512551%3C/crossref_ieee_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c148t-aeb30213ae8a4225fce55878c51047fb3674bd100ba7c2a31cec9417b7d2dc353%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10804856&rfr_iscdi=true