Loading…
One-Stream Stepwise Decreasing for Vision-Language Tracking
Based on the fixed language descriptions in the initial frames, a vision-language tracker typically adopts a two-stream model structure to align vision and language features at the feature fusion stages. However, this paradigm may degrade the tracking performance due to inaccurate language descripti...
Saved in:
Published in: | IEEE transactions on circuits and systems for video technology 2024-10, Vol.34 (10), p.9053-9063 |
---|---|
Main Authors: | , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | cdi_FETCH-LOGICAL-c247t-98f72c4397f5f27baa5d9f3c18c8bc33da09d192721ed59537db3c45e35c9d9b3 |
container_end_page | 9063 |
container_issue | 10 |
container_start_page | 9053 |
container_title | IEEE transactions on circuits and systems for video technology |
container_volume | 34 |
creator | Zhang, Guangtong Zhong, Bineng Liang, Qihua Mo, Zhiyi Li, Ning Song, Shuxiang |
description | Based on the fixed language descriptions in the initial frames, a vision-language tracker typically adopts a two-stream model structure to align vision and language features at the feature fusion stages. However, this paradigm may degrade the tracking performance due to inaccurate language descriptions and lacks further modal interaction. To address these issues, we propose a one-stream vision-language model called One-stream Stepwise Decreasing for Vision-Language Tracking (OSDT). Specifically, we first encode the language description using a language encoder. The obtained language features are then combined with visual images and entered jointly into a visual encoder, in which the encoder's self-attention mechanism is utilized to facilitate more interactions between language and visual features. Moreover, to mitigate the problems caused by inaccurate language descriptions, we design a stepwise decreasing multi-modal interaction framework, in which a Feature Filter Module (FFM) is introduced to select language features that are more relevant to visual information to provide semantic guidance for visual feature extraction. Furthermore, without additional feature fusion modules, our one-stream model framework can efficiently utilize the proposed feature filtering module for feature selection. Consequently, our tracker can achieve fast tracking speed in the vision-language tracking domain compared to existing state-of-the-art methods. We extensively evaluate our tracker on three benchmarks, i.e. TNL2K, LaSOT, and OTB99, demonstrating competing performance compared to state-of-the-art vision-language tracking methods. |
doi_str_mv | 10.1109/TCSVT.2024.3395352 |
format | article |
fullrecord | <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_ieee_primary_10510485</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10510485</ieee_id><sourcerecordid>3123120046</sourcerecordid><originalsourceid>FETCH-LOGICAL-c247t-98f72c4397f5f27baa5d9f3c18c8bc33da09d192721ed59537db3c45e35c9d9b3</originalsourceid><addsrcrecordid>eNpNkE1LAzEQhoMoWKt_QDwseE7N15gET1I_odBD115DNjtbttrdmmwR_72p9SAMzDDzvjPDQ8glZxPOmb0pp4tlORFMqImUFiSIIzLiAIYKweA41ww4NYLDKTlLac0YV0bpEbmbd0gXQ0S_KRYDbr_ahMUDhtxIbbcqmj4Wyza1fUdnvlvt_AqLMvrwnofn5KTxHwkv_vKYvD09ltMXOps_v07vZzQIpQdqTaNFUNLqBhqhK--hto0M3ARTBSlrz2zNrdCCYw35eV1XMihACcHWtpJjcn3Yu4395w7T4Nb9Lnb5pJNc5GBM3WaVOKhC7FOK2LhtbDc-fjvO3B6S-4Xk9pDcH6RsujqYWkT8ZwDOlAH5A1MBYoI</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3123120046</pqid></control><display><type>article</type><title>One-Stream Stepwise Decreasing for Vision-Language Tracking</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Zhang, Guangtong ; Zhong, Bineng ; Liang, Qihua ; Mo, Zhiyi ; Li, Ning ; Song, Shuxiang</creator><creatorcontrib>Zhang, Guangtong ; Zhong, Bineng ; Liang, Qihua ; Mo, Zhiyi ; Li, Ning ; Song, Shuxiang</creatorcontrib><description>Based on the fixed language descriptions in the initial frames, a vision-language tracker typically adopts a two-stream model structure to align vision and language features at the feature fusion stages. However, this paradigm may degrade the tracking performance due to inaccurate language descriptions and lacks further modal interaction. To address these issues, we propose a one-stream vision-language model called One-stream Stepwise Decreasing for Vision-Language Tracking (OSDT). Specifically, we first encode the language description using a language encoder. The obtained language features are then combined with visual images and entered jointly into a visual encoder, in which the encoder's self-attention mechanism is utilized to facilitate more interactions between language and visual features. Moreover, to mitigate the problems caused by inaccurate language descriptions, we design a stepwise decreasing multi-modal interaction framework, in which a Feature Filter Module (FFM) is introduced to select language features that are more relevant to visual information to provide semantic guidance for visual feature extraction. Furthermore, without additional feature fusion modules, our one-stream model framework can efficiently utilize the proposed feature filtering module for feature selection. Consequently, our tracker can achieve fast tracking speed in the vision-language tracking domain compared to existing state-of-the-art methods. We extensively evaluate our tracker on three benchmarks, i.e. TNL2K, LaSOT, and OTB99, demonstrating competing performance compared to state-of-the-art vision-language tracking methods.</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2024.3395352</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Automobiles ; Coders ; Descriptions ; Feature extraction ; Information filters ; Language ; Modules ; Natural languages ; Object tracking ; Roads ; Target tracking ; Tracking ; Vision ; vision-language tracking ; Visualization</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2024-10, Vol.34 (10), p.9053-9063</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c247t-98f72c4397f5f27baa5d9f3c18c8bc33da09d192721ed59537db3c45e35c9d9b3</cites><orcidid>0000-0003-2353-5246 ; 0009-0001-1513-0313 ; 0009-0006-3867-6753 ; 0000-0003-0280-2640 ; 0000-0003-3423-1539</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10510485$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27923,27924,54795</link.rule.ids></links><search><creatorcontrib>Zhang, Guangtong</creatorcontrib><creatorcontrib>Zhong, Bineng</creatorcontrib><creatorcontrib>Liang, Qihua</creatorcontrib><creatorcontrib>Mo, Zhiyi</creatorcontrib><creatorcontrib>Li, Ning</creatorcontrib><creatorcontrib>Song, Shuxiang</creatorcontrib><title>One-Stream Stepwise Decreasing for Vision-Language Tracking</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description>Based on the fixed language descriptions in the initial frames, a vision-language tracker typically adopts a two-stream model structure to align vision and language features at the feature fusion stages. However, this paradigm may degrade the tracking performance due to inaccurate language descriptions and lacks further modal interaction. To address these issues, we propose a one-stream vision-language model called One-stream Stepwise Decreasing for Vision-Language Tracking (OSDT). Specifically, we first encode the language description using a language encoder. The obtained language features are then combined with visual images and entered jointly into a visual encoder, in which the encoder's self-attention mechanism is utilized to facilitate more interactions between language and visual features. Moreover, to mitigate the problems caused by inaccurate language descriptions, we design a stepwise decreasing multi-modal interaction framework, in which a Feature Filter Module (FFM) is introduced to select language features that are more relevant to visual information to provide semantic guidance for visual feature extraction. Furthermore, without additional feature fusion modules, our one-stream model framework can efficiently utilize the proposed feature filtering module for feature selection. Consequently, our tracker can achieve fast tracking speed in the vision-language tracking domain compared to existing state-of-the-art methods. We extensively evaluate our tracker on three benchmarks, i.e. TNL2K, LaSOT, and OTB99, demonstrating competing performance compared to state-of-the-art vision-language tracking methods.</description><subject>Automobiles</subject><subject>Coders</subject><subject>Descriptions</subject><subject>Feature extraction</subject><subject>Information filters</subject><subject>Language</subject><subject>Modules</subject><subject>Natural languages</subject><subject>Object tracking</subject><subject>Roads</subject><subject>Target tracking</subject><subject>Tracking</subject><subject>Vision</subject><subject>vision-language tracking</subject><subject>Visualization</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNpNkE1LAzEQhoMoWKt_QDwseE7N15gET1I_odBD115DNjtbttrdmmwR_72p9SAMzDDzvjPDQ8glZxPOmb0pp4tlORFMqImUFiSIIzLiAIYKweA41ww4NYLDKTlLac0YV0bpEbmbd0gXQ0S_KRYDbr_ahMUDhtxIbbcqmj4Wyza1fUdnvlvt_AqLMvrwnofn5KTxHwkv_vKYvD09ltMXOps_v07vZzQIpQdqTaNFUNLqBhqhK--hto0M3ARTBSlrz2zNrdCCYw35eV1XMihACcHWtpJjcn3Yu4395w7T4Nb9Lnb5pJNc5GBM3WaVOKhC7FOK2LhtbDc-fjvO3B6S-4Xk9pDcH6RsujqYWkT8ZwDOlAH5A1MBYoI</recordid><startdate>20241001</startdate><enddate>20241001</enddate><creator>Zhang, Guangtong</creator><creator>Zhong, Bineng</creator><creator>Liang, Qihua</creator><creator>Mo, Zhiyi</creator><creator>Li, Ning</creator><creator>Song, Shuxiang</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-2353-5246</orcidid><orcidid>https://orcid.org/0009-0001-1513-0313</orcidid><orcidid>https://orcid.org/0009-0006-3867-6753</orcidid><orcidid>https://orcid.org/0000-0003-0280-2640</orcidid><orcidid>https://orcid.org/0000-0003-3423-1539</orcidid></search><sort><creationdate>20241001</creationdate><title>One-Stream Stepwise Decreasing for Vision-Language Tracking</title><author>Zhang, Guangtong ; Zhong, Bineng ; Liang, Qihua ; Mo, Zhiyi ; Li, Ning ; Song, Shuxiang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c247t-98f72c4397f5f27baa5d9f3c18c8bc33da09d192721ed59537db3c45e35c9d9b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Automobiles</topic><topic>Coders</topic><topic>Descriptions</topic><topic>Feature extraction</topic><topic>Information filters</topic><topic>Language</topic><topic>Modules</topic><topic>Natural languages</topic><topic>Object tracking</topic><topic>Roads</topic><topic>Target tracking</topic><topic>Tracking</topic><topic>Vision</topic><topic>vision-language tracking</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Guangtong</creatorcontrib><creatorcontrib>Zhong, Bineng</creatorcontrib><creatorcontrib>Liang, Qihua</creatorcontrib><creatorcontrib>Mo, Zhiyi</creatorcontrib><creatorcontrib>Li, Ning</creatorcontrib><creatorcontrib>Song, Shuxiang</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE/IET Electronic Library</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhang, Guangtong</au><au>Zhong, Bineng</au><au>Liang, Qihua</au><au>Mo, Zhiyi</au><au>Li, Ning</au><au>Song, Shuxiang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>One-Stream Stepwise Decreasing for Vision-Language Tracking</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2024-10-01</date><risdate>2024</risdate><volume>34</volume><issue>10</issue><spage>9053</spage><epage>9063</epage><pages>9053-9063</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract>Based on the fixed language descriptions in the initial frames, a vision-language tracker typically adopts a two-stream model structure to align vision and language features at the feature fusion stages. However, this paradigm may degrade the tracking performance due to inaccurate language descriptions and lacks further modal interaction. To address these issues, we propose a one-stream vision-language model called One-stream Stepwise Decreasing for Vision-Language Tracking (OSDT). Specifically, we first encode the language description using a language encoder. The obtained language features are then combined with visual images and entered jointly into a visual encoder, in which the encoder's self-attention mechanism is utilized to facilitate more interactions between language and visual features. Moreover, to mitigate the problems caused by inaccurate language descriptions, we design a stepwise decreasing multi-modal interaction framework, in which a Feature Filter Module (FFM) is introduced to select language features that are more relevant to visual information to provide semantic guidance for visual feature extraction. Furthermore, without additional feature fusion modules, our one-stream model framework can efficiently utilize the proposed feature filtering module for feature selection. Consequently, our tracker can achieve fast tracking speed in the vision-language tracking domain compared to existing state-of-the-art methods. We extensively evaluate our tracker on three benchmarks, i.e. TNL2K, LaSOT, and OTB99, demonstrating competing performance compared to state-of-the-art vision-language tracking methods.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2024.3395352</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0003-2353-5246</orcidid><orcidid>https://orcid.org/0009-0001-1513-0313</orcidid><orcidid>https://orcid.org/0009-0006-3867-6753</orcidid><orcidid>https://orcid.org/0000-0003-0280-2640</orcidid><orcidid>https://orcid.org/0000-0003-3423-1539</orcidid></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1051-8215 |
ispartof | IEEE transactions on circuits and systems for video technology, 2024-10, Vol.34 (10), p.9053-9063 |
issn | 1051-8215 1558-2205 |
language | eng |
recordid | cdi_ieee_primary_10510485 |
source | IEEE Electronic Library (IEL) Journals |
subjects | Automobiles Coders Descriptions Feature extraction Information filters Language Modules Natural languages Object tracking Roads Target tracking Tracking Vision vision-language tracking Visualization |
title | One-Stream Stepwise Decreasing for Vision-Language Tracking |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T22%3A08%3A50IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=One-Stream%20Stepwise%20Decreasing%20for%20Vision-Language%20Tracking&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Zhang,%20Guangtong&rft.date=2024-10-01&rft.volume=34&rft.issue=10&rft.spage=9053&rft.epage=9063&rft.pages=9053-9063&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2024.3395352&rft_dat=%3Cproquest_ieee_%3E3123120046%3C/proquest_ieee_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c247t-98f72c4397f5f27baa5d9f3c18c8bc33da09d192721ed59537db3c45e35c9d9b3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3123120046&rft_id=info:pmid/&rft_ieee_id=10510485&rfr_iscdi=true |