Loading…

One-Stream Stepwise Decreasing for Vision-Language Tracking

Based on the fixed language descriptions in the initial frames, a vision-language tracker typically adopts a two-stream model structure to align vision and language features at the feature fusion stages. However, this paradigm may degrade the tracking performance due to inaccurate language descripti...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on circuits and systems for video technology 2024-10, Vol.34 (10), p.9053-9063
Main Authors: Zhang, Guangtong, Zhong, Bineng, Liang, Qihua, Mo, Zhiyi, Li, Ning, Song, Shuxiang
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c247t-98f72c4397f5f27baa5d9f3c18c8bc33da09d192721ed59537db3c45e35c9d9b3
container_end_page 9063
container_issue 10
container_start_page 9053
container_title IEEE transactions on circuits and systems for video technology
container_volume 34
creator Zhang, Guangtong
Zhong, Bineng
Liang, Qihua
Mo, Zhiyi
Li, Ning
Song, Shuxiang
description Based on the fixed language descriptions in the initial frames, a vision-language tracker typically adopts a two-stream model structure to align vision and language features at the feature fusion stages. However, this paradigm may degrade the tracking performance due to inaccurate language descriptions and lacks further modal interaction. To address these issues, we propose a one-stream vision-language model called One-stream Stepwise Decreasing for Vision-Language Tracking (OSDT). Specifically, we first encode the language description using a language encoder. The obtained language features are then combined with visual images and entered jointly into a visual encoder, in which the encoder's self-attention mechanism is utilized to facilitate more interactions between language and visual features. Moreover, to mitigate the problems caused by inaccurate language descriptions, we design a stepwise decreasing multi-modal interaction framework, in which a Feature Filter Module (FFM) is introduced to select language features that are more relevant to visual information to provide semantic guidance for visual feature extraction. Furthermore, without additional feature fusion modules, our one-stream model framework can efficiently utilize the proposed feature filtering module for feature selection. Consequently, our tracker can achieve fast tracking speed in the vision-language tracking domain compared to existing state-of-the-art methods. We extensively evaluate our tracker on three benchmarks, i.e. TNL2K, LaSOT, and OTB99, demonstrating competing performance compared to state-of-the-art vision-language tracking methods.
doi_str_mv 10.1109/TCSVT.2024.3395352
format article
fullrecord <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_ieee_primary_10510485</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10510485</ieee_id><sourcerecordid>3123120046</sourcerecordid><originalsourceid>FETCH-LOGICAL-c247t-98f72c4397f5f27baa5d9f3c18c8bc33da09d192721ed59537db3c45e35c9d9b3</originalsourceid><addsrcrecordid>eNpNkE1LAzEQhoMoWKt_QDwseE7N15gET1I_odBD115DNjtbttrdmmwR_72p9SAMzDDzvjPDQ8glZxPOmb0pp4tlORFMqImUFiSIIzLiAIYKweA41ww4NYLDKTlLac0YV0bpEbmbd0gXQ0S_KRYDbr_ahMUDhtxIbbcqmj4Wyza1fUdnvlvt_AqLMvrwnofn5KTxHwkv_vKYvD09ltMXOps_v07vZzQIpQdqTaNFUNLqBhqhK--hto0M3ARTBSlrz2zNrdCCYw35eV1XMihACcHWtpJjcn3Yu4395w7T4Nb9Lnb5pJNc5GBM3WaVOKhC7FOK2LhtbDc-fjvO3B6S-4Xk9pDcH6RsujqYWkT8ZwDOlAH5A1MBYoI</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3123120046</pqid></control><display><type>article</type><title>One-Stream Stepwise Decreasing for Vision-Language Tracking</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Zhang, Guangtong ; Zhong, Bineng ; Liang, Qihua ; Mo, Zhiyi ; Li, Ning ; Song, Shuxiang</creator><creatorcontrib>Zhang, Guangtong ; Zhong, Bineng ; Liang, Qihua ; Mo, Zhiyi ; Li, Ning ; Song, Shuxiang</creatorcontrib><description>Based on the fixed language descriptions in the initial frames, a vision-language tracker typically adopts a two-stream model structure to align vision and language features at the feature fusion stages. However, this paradigm may degrade the tracking performance due to inaccurate language descriptions and lacks further modal interaction. To address these issues, we propose a one-stream vision-language model called One-stream Stepwise Decreasing for Vision-Language Tracking (OSDT). Specifically, we first encode the language description using a language encoder. The obtained language features are then combined with visual images and entered jointly into a visual encoder, in which the encoder's self-attention mechanism is utilized to facilitate more interactions between language and visual features. Moreover, to mitigate the problems caused by inaccurate language descriptions, we design a stepwise decreasing multi-modal interaction framework, in which a Feature Filter Module (FFM) is introduced to select language features that are more relevant to visual information to provide semantic guidance for visual feature extraction. Furthermore, without additional feature fusion modules, our one-stream model framework can efficiently utilize the proposed feature filtering module for feature selection. Consequently, our tracker can achieve fast tracking speed in the vision-language tracking domain compared to existing state-of-the-art methods. We extensively evaluate our tracker on three benchmarks, i.e. TNL2K, LaSOT, and OTB99, demonstrating competing performance compared to state-of-the-art vision-language tracking methods.</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2024.3395352</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Automobiles ; Coders ; Descriptions ; Feature extraction ; Information filters ; Language ; Modules ; Natural languages ; Object tracking ; Roads ; Target tracking ; Tracking ; Vision ; vision-language tracking ; Visualization</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2024-10, Vol.34 (10), p.9053-9063</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c247t-98f72c4397f5f27baa5d9f3c18c8bc33da09d192721ed59537db3c45e35c9d9b3</cites><orcidid>0000-0003-2353-5246 ; 0009-0001-1513-0313 ; 0009-0006-3867-6753 ; 0000-0003-0280-2640 ; 0000-0003-3423-1539</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10510485$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27923,27924,54795</link.rule.ids></links><search><creatorcontrib>Zhang, Guangtong</creatorcontrib><creatorcontrib>Zhong, Bineng</creatorcontrib><creatorcontrib>Liang, Qihua</creatorcontrib><creatorcontrib>Mo, Zhiyi</creatorcontrib><creatorcontrib>Li, Ning</creatorcontrib><creatorcontrib>Song, Shuxiang</creatorcontrib><title>One-Stream Stepwise Decreasing for Vision-Language Tracking</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description>Based on the fixed language descriptions in the initial frames, a vision-language tracker typically adopts a two-stream model structure to align vision and language features at the feature fusion stages. However, this paradigm may degrade the tracking performance due to inaccurate language descriptions and lacks further modal interaction. To address these issues, we propose a one-stream vision-language model called One-stream Stepwise Decreasing for Vision-Language Tracking (OSDT). Specifically, we first encode the language description using a language encoder. The obtained language features are then combined with visual images and entered jointly into a visual encoder, in which the encoder's self-attention mechanism is utilized to facilitate more interactions between language and visual features. Moreover, to mitigate the problems caused by inaccurate language descriptions, we design a stepwise decreasing multi-modal interaction framework, in which a Feature Filter Module (FFM) is introduced to select language features that are more relevant to visual information to provide semantic guidance for visual feature extraction. Furthermore, without additional feature fusion modules, our one-stream model framework can efficiently utilize the proposed feature filtering module for feature selection. Consequently, our tracker can achieve fast tracking speed in the vision-language tracking domain compared to existing state-of-the-art methods. We extensively evaluate our tracker on three benchmarks, i.e. TNL2K, LaSOT, and OTB99, demonstrating competing performance compared to state-of-the-art vision-language tracking methods.</description><subject>Automobiles</subject><subject>Coders</subject><subject>Descriptions</subject><subject>Feature extraction</subject><subject>Information filters</subject><subject>Language</subject><subject>Modules</subject><subject>Natural languages</subject><subject>Object tracking</subject><subject>Roads</subject><subject>Target tracking</subject><subject>Tracking</subject><subject>Vision</subject><subject>vision-language tracking</subject><subject>Visualization</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNpNkE1LAzEQhoMoWKt_QDwseE7N15gET1I_odBD115DNjtbttrdmmwR_72p9SAMzDDzvjPDQ8glZxPOmb0pp4tlORFMqImUFiSIIzLiAIYKweA41ww4NYLDKTlLac0YV0bpEbmbd0gXQ0S_KRYDbr_ahMUDhtxIbbcqmj4Wyza1fUdnvlvt_AqLMvrwnofn5KTxHwkv_vKYvD09ltMXOps_v07vZzQIpQdqTaNFUNLqBhqhK--hto0M3ARTBSlrz2zNrdCCYw35eV1XMihACcHWtpJjcn3Yu4395w7T4Nb9Lnb5pJNc5GBM3WaVOKhC7FOK2LhtbDc-fjvO3B6S-4Xk9pDcH6RsujqYWkT8ZwDOlAH5A1MBYoI</recordid><startdate>20241001</startdate><enddate>20241001</enddate><creator>Zhang, Guangtong</creator><creator>Zhong, Bineng</creator><creator>Liang, Qihua</creator><creator>Mo, Zhiyi</creator><creator>Li, Ning</creator><creator>Song, Shuxiang</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-2353-5246</orcidid><orcidid>https://orcid.org/0009-0001-1513-0313</orcidid><orcidid>https://orcid.org/0009-0006-3867-6753</orcidid><orcidid>https://orcid.org/0000-0003-0280-2640</orcidid><orcidid>https://orcid.org/0000-0003-3423-1539</orcidid></search><sort><creationdate>20241001</creationdate><title>One-Stream Stepwise Decreasing for Vision-Language Tracking</title><author>Zhang, Guangtong ; Zhong, Bineng ; Liang, Qihua ; Mo, Zhiyi ; Li, Ning ; Song, Shuxiang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c247t-98f72c4397f5f27baa5d9f3c18c8bc33da09d192721ed59537db3c45e35c9d9b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Automobiles</topic><topic>Coders</topic><topic>Descriptions</topic><topic>Feature extraction</topic><topic>Information filters</topic><topic>Language</topic><topic>Modules</topic><topic>Natural languages</topic><topic>Object tracking</topic><topic>Roads</topic><topic>Target tracking</topic><topic>Tracking</topic><topic>Vision</topic><topic>vision-language tracking</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Guangtong</creatorcontrib><creatorcontrib>Zhong, Bineng</creatorcontrib><creatorcontrib>Liang, Qihua</creatorcontrib><creatorcontrib>Mo, Zhiyi</creatorcontrib><creatorcontrib>Li, Ning</creatorcontrib><creatorcontrib>Song, Shuxiang</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE/IET Electronic Library</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhang, Guangtong</au><au>Zhong, Bineng</au><au>Liang, Qihua</au><au>Mo, Zhiyi</au><au>Li, Ning</au><au>Song, Shuxiang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>One-Stream Stepwise Decreasing for Vision-Language Tracking</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2024-10-01</date><risdate>2024</risdate><volume>34</volume><issue>10</issue><spage>9053</spage><epage>9063</epage><pages>9053-9063</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract>Based on the fixed language descriptions in the initial frames, a vision-language tracker typically adopts a two-stream model structure to align vision and language features at the feature fusion stages. However, this paradigm may degrade the tracking performance due to inaccurate language descriptions and lacks further modal interaction. To address these issues, we propose a one-stream vision-language model called One-stream Stepwise Decreasing for Vision-Language Tracking (OSDT). Specifically, we first encode the language description using a language encoder. The obtained language features are then combined with visual images and entered jointly into a visual encoder, in which the encoder's self-attention mechanism is utilized to facilitate more interactions between language and visual features. Moreover, to mitigate the problems caused by inaccurate language descriptions, we design a stepwise decreasing multi-modal interaction framework, in which a Feature Filter Module (FFM) is introduced to select language features that are more relevant to visual information to provide semantic guidance for visual feature extraction. Furthermore, without additional feature fusion modules, our one-stream model framework can efficiently utilize the proposed feature filtering module for feature selection. Consequently, our tracker can achieve fast tracking speed in the vision-language tracking domain compared to existing state-of-the-art methods. We extensively evaluate our tracker on three benchmarks, i.e. TNL2K, LaSOT, and OTB99, demonstrating competing performance compared to state-of-the-art vision-language tracking methods.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2024.3395352</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0003-2353-5246</orcidid><orcidid>https://orcid.org/0009-0001-1513-0313</orcidid><orcidid>https://orcid.org/0009-0006-3867-6753</orcidid><orcidid>https://orcid.org/0000-0003-0280-2640</orcidid><orcidid>https://orcid.org/0000-0003-3423-1539</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 1051-8215
ispartof IEEE transactions on circuits and systems for video technology, 2024-10, Vol.34 (10), p.9053-9063
issn 1051-8215
1558-2205
language eng
recordid cdi_ieee_primary_10510485
source IEEE Electronic Library (IEL) Journals
subjects Automobiles
Coders
Descriptions
Feature extraction
Information filters
Language
Modules
Natural languages
Object tracking
Roads
Target tracking
Tracking
Vision
vision-language tracking
Visualization
title One-Stream Stepwise Decreasing for Vision-Language Tracking
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T22%3A08%3A50IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=One-Stream%20Stepwise%20Decreasing%20for%20Vision-Language%20Tracking&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Zhang,%20Guangtong&rft.date=2024-10-01&rft.volume=34&rft.issue=10&rft.spage=9053&rft.epage=9063&rft.pages=9053-9063&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2024.3395352&rft_dat=%3Cproquest_ieee_%3E3123120046%3C/proquest_ieee_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c247t-98f72c4397f5f27baa5d9f3c18c8bc33da09d192721ed59537db3c45e35c9d9b3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3123120046&rft_id=info:pmid/&rft_ieee_id=10510485&rfr_iscdi=true