Loading…

Transformer Driven Matching Selection Mechanism for Multi-Label Image Classification

Graph Matching has recently emerged as an attractive technique applied to various computer vision tasks. Graph Matching based multi-label image classification, in particular, treats each image as a bag of instances and reformulates the classification task as an instance-label matching selection prob...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on circuits and systems for video technology 2024-02, Vol.34 (2), p.924-937
Main Authors:	Wu, Yanan, Feng, Songhe, Zhao, Gongpei, Jin, Yi
Format:	Article
Language:	English
Subjects:	attention mechanism Computational modeling Computer vision Correlation Datasets Graph matching Image classification Labels Multi-label image classification Semantics Task analysis transformer Transformers Visualization
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c296t-4374f4532abaa79aa7cabec6525d23441b38a875994e350404fac1c8fa6664733
cites	cdi_FETCH-LOGICAL-c296t-4374f4532abaa79aa7cabec6525d23441b38a875994e350404fac1c8fa6664733
container_end_page	937
container_issue	2
container_start_page	924
container_title	IEEE transactions on circuits and systems for video technology
container_volume	34
creator	Wu, Yanan Feng, Songhe Zhao, Gongpei Jin, Yi
description	Graph Matching has recently emerged as an attractive technique applied to various computer vision tasks. Graph Matching based multi-label image classification, in particular, treats each image as a bag of instances and reformulates the classification task as an instance-label matching selection problem, achieving state-of-the-art results on diverse benchmarks. However, the generalization and scalability of such learned model cannot be well guaranteed due to its manually predetermined graph structure and high-dimension embedding of dense connections between instances and labels. To address these limitations, in this work, we propose a novel {T} ransformer Driven {M} atching {S} election framework for Multi-Label Image {C} lassification (C-TMS), where instance structural relationships, class-wise global dependencies, and the co-occurrence possibility of varying instance-label assignments are simultaneously taken into consideration in a unified and adaptive manner. Moreover, the parallelization capability of the Transformer enables efficient computation, making our model scalable to large-scale datasets. Specifically, we first represent instances and labels as nodes in the visual space and label space respectively, and then compute the hidden representation of each node in its individual space, by attending a self-attention strategy over its entire neighborhood. Subsequently, the cross-attention is adopted to excavate the correct assignments between instances and labels, and further interprets how classifying each label depends on the instances within an image and its interaction with other labels. Finally, an asymmetric focal loss is designed to optimize the instance-label correspondence, and read out image-level category confidences. Extensive experiments conducted on various multi-label image datasets demonstrate the superiority of our proposed method.
doi_str_mv	10.1109/TCSVT.2023.3288205
format	article
fullrecord	<record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_proquest_journals_2923122937</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10158710</ieee_id><sourcerecordid>2923122937</sourcerecordid><originalsourceid>FETCH-LOGICAL-c296t-4374f4532abaa79aa7cabec6525d23441b38a875994e350404fac1c8fa6664733</originalsourceid><addsrcrecordid>eNpNkE9LAzEQxYMoWKtfQDwseN6aTJLd7FFWq4UWD129hmnMtin7pyZbwW9vanvwMMxjeO8N_Ai5ZXTCGC0eqnL5UU2AAp9wUAqoPCMjJqVKIerzqKlkqQImL8lVCFtKmVAiH5Gq8tiFuvet9cmTd9-2SxY4mI3r1snSNtYMro8nazbYudAm0Zos9s3g0jmubJPMWlzbpGwwBFc7gwf7NbmosQn25rTH5H36XJWv6fztZVY-zlMDRTakgueiFpIDrhDzIo6JlSaTID-BC8FWXKHKZVEIyyUVVNRomFE1Zlkmcs7H5P7Yu_P9196GQW_7ve_iSw0FcAZQ8Dy64Ogyvg_B21rvvGvR_2hG9YGe_qOnD_T0iV4M3R1Dzlr7L8Ckyhnlv0jJa2I</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2923122937</pqid></control><display><type>article</type><title>Transformer Driven Matching Selection Mechanism for Multi-Label Image Classification</title><source>IEEE Xplore (Online service)</source><creator>Wu, Yanan ; Feng, Songhe ; Zhao, Gongpei ; Jin, Yi</creator><creatorcontrib>Wu, Yanan ; Feng, Songhe ; Zhao, Gongpei ; Jin, Yi</creatorcontrib><description><![CDATA[Graph Matching has recently emerged as an attractive technique applied to various computer vision tasks. Graph Matching based multi-label image classification, in particular, treats each image as a bag of instances and reformulates the classification task as an instance-label matching selection problem, achieving state-of-the-art results on diverse benchmarks. However, the generalization and scalability of such learned model cannot be well guaranteed due to its manually predetermined graph structure and high-dimension embedding of dense connections between instances and labels. To address these limitations, in this work, we propose a novel <inline-formula> <tex-math notation="LaTeX">{T} </tex-math></inline-formula>ransformer Driven <inline-formula> <tex-math notation="LaTeX">{M} </tex-math></inline-formula>atching <inline-formula> <tex-math notation="LaTeX">{S} </tex-math></inline-formula>election framework for Multi-Label Image <inline-formula> <tex-math notation="LaTeX">{C} </tex-math></inline-formula>lassification (C-TMS), where instance structural relationships, class-wise global dependencies, and the co-occurrence possibility of varying instance-label assignments are simultaneously taken into consideration in a unified and adaptive manner. Moreover, the parallelization capability of the Transformer enables efficient computation, making our model scalable to large-scale datasets. Specifically, we first represent instances and labels as nodes in the visual space and label space respectively, and then compute the hidden representation of each node in its individual space, by attending a self-attention strategy over its entire neighborhood. Subsequently, the cross-attention is adopted to excavate the correct assignments between instances and labels, and further interprets how classifying each label depends on the instances within an image and its interaction with other labels. Finally, an asymmetric focal loss is designed to optimize the instance-label correspondence, and read out image-level category confidences. Extensive experiments conducted on various multi-label image datasets demonstrate the superiority of our proposed method.]]></description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2023.3288205</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>attention mechanism ; Computational modeling ; Computer vision ; Correlation ; Datasets ; Graph matching ; Image classification ; Labels ; Multi-label image classification ; Semantics ; Task analysis ; transformer ; Transformers ; Visualization</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2024-02, Vol.34 (2), p.924-937</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c296t-4374f4532abaa79aa7cabec6525d23441b38a875994e350404fac1c8fa6664733</citedby><cites>FETCH-LOGICAL-c296t-4374f4532abaa79aa7cabec6525d23441b38a875994e350404fac1c8fa6664733</cites><orcidid>0000-0002-3301-6303 ; 0000-0002-5922-9358 ; 0000-0001-8408-3816</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10158710$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,54796</link.rule.ids></links><search><creatorcontrib>Wu, Yanan</creatorcontrib><creatorcontrib>Feng, Songhe</creatorcontrib><creatorcontrib>Zhao, Gongpei</creatorcontrib><creatorcontrib>Jin, Yi</creatorcontrib><title>Transformer Driven Matching Selection Mechanism for Multi-Label Image Classification</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description><![CDATA[Graph Matching has recently emerged as an attractive technique applied to various computer vision tasks. Graph Matching based multi-label image classification, in particular, treats each image as a bag of instances and reformulates the classification task as an instance-label matching selection problem, achieving state-of-the-art results on diverse benchmarks. However, the generalization and scalability of such learned model cannot be well guaranteed due to its manually predetermined graph structure and high-dimension embedding of dense connections between instances and labels. To address these limitations, in this work, we propose a novel <inline-formula> <tex-math notation="LaTeX">{T} </tex-math></inline-formula>ransformer Driven <inline-formula> <tex-math notation="LaTeX">{M} </tex-math></inline-formula>atching <inline-formula> <tex-math notation="LaTeX">{S} </tex-math></inline-formula>election framework for Multi-Label Image <inline-formula> <tex-math notation="LaTeX">{C} </tex-math></inline-formula>lassification (C-TMS), where instance structural relationships, class-wise global dependencies, and the co-occurrence possibility of varying instance-label assignments are simultaneously taken into consideration in a unified and adaptive manner. Moreover, the parallelization capability of the Transformer enables efficient computation, making our model scalable to large-scale datasets. Specifically, we first represent instances and labels as nodes in the visual space and label space respectively, and then compute the hidden representation of each node in its individual space, by attending a self-attention strategy over its entire neighborhood. Subsequently, the cross-attention is adopted to excavate the correct assignments between instances and labels, and further interprets how classifying each label depends on the instances within an image and its interaction with other labels. Finally, an asymmetric focal loss is designed to optimize the instance-label correspondence, and read out image-level category confidences. Extensive experiments conducted on various multi-label image datasets demonstrate the superiority of our proposed method.]]></description><subject>attention mechanism</subject><subject>Computational modeling</subject><subject>Computer vision</subject><subject>Correlation</subject><subject>Datasets</subject><subject>Graph matching</subject><subject>Image classification</subject><subject>Labels</subject><subject>Multi-label image classification</subject><subject>Semantics</subject><subject>Task analysis</subject><subject>transformer</subject><subject>Transformers</subject><subject>Visualization</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNpNkE9LAzEQxYMoWKtfQDwseN6aTJLd7FFWq4UWD129hmnMtin7pyZbwW9vanvwMMxjeO8N_Ai5ZXTCGC0eqnL5UU2AAp9wUAqoPCMjJqVKIerzqKlkqQImL8lVCFtKmVAiH5Gq8tiFuvet9cmTd9-2SxY4mI3r1snSNtYMro8nazbYudAm0Zos9s3g0jmubJPMWlzbpGwwBFc7gwf7NbmosQn25rTH5H36XJWv6fztZVY-zlMDRTakgueiFpIDrhDzIo6JlSaTID-BC8FWXKHKZVEIyyUVVNRomFE1Zlkmcs7H5P7Yu_P9196GQW_7ve_iSw0FcAZQ8Dy64Ogyvg_B21rvvGvR_2hG9YGe_qOnD_T0iV4M3R1Dzlr7L8Ckyhnlv0jJa2I</recordid><startdate>20240201</startdate><enddate>20240201</enddate><creator>Wu, Yanan</creator><creator>Feng, Songhe</creator><creator>Zhao, Gongpei</creator><creator>Jin, Yi</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-3301-6303</orcidid><orcidid>https://orcid.org/0000-0002-5922-9358</orcidid><orcidid>https://orcid.org/0000-0001-8408-3816</orcidid></search><sort><creationdate>20240201</creationdate><title>Transformer Driven Matching Selection Mechanism for Multi-Label Image Classification</title><author>Wu, Yanan ; Feng, Songhe ; Zhao, Gongpei ; Jin, Yi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c296t-4374f4532abaa79aa7cabec6525d23441b38a875994e350404fac1c8fa6664733</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>attention mechanism</topic><topic>Computational modeling</topic><topic>Computer vision</topic><topic>Correlation</topic><topic>Datasets</topic><topic>Graph matching</topic><topic>Image classification</topic><topic>Labels</topic><topic>Multi-label image classification</topic><topic>Semantics</topic><topic>Task analysis</topic><topic>transformer</topic><topic>Transformers</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wu, Yanan</creatorcontrib><creatorcontrib>Feng, Songhe</creatorcontrib><creatorcontrib>Zhao, Gongpei</creatorcontrib><creatorcontrib>Jin, Yi</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEL</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wu, Yanan</au><au>Feng, Songhe</au><au>Zhao, Gongpei</au><au>Jin, Yi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Transformer Driven Matching Selection Mechanism for Multi-Label Image Classification</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2024-02-01</date><risdate>2024</risdate><volume>34</volume><issue>2</issue><spage>924</spage><epage>937</epage><pages>924-937</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract><![CDATA[Graph Matching has recently emerged as an attractive technique applied to various computer vision tasks. Graph Matching based multi-label image classification, in particular, treats each image as a bag of instances and reformulates the classification task as an instance-label matching selection problem, achieving state-of-the-art results on diverse benchmarks. However, the generalization and scalability of such learned model cannot be well guaranteed due to its manually predetermined graph structure and high-dimension embedding of dense connections between instances and labels. To address these limitations, in this work, we propose a novel <inline-formula> <tex-math notation="LaTeX">{T} </tex-math></inline-formula>ransformer Driven <inline-formula> <tex-math notation="LaTeX">{M} </tex-math></inline-formula>atching <inline-formula> <tex-math notation="LaTeX">{S} </tex-math></inline-formula>election framework for Multi-Label Image <inline-formula> <tex-math notation="LaTeX">{C} </tex-math></inline-formula>lassification (C-TMS), where instance structural relationships, class-wise global dependencies, and the co-occurrence possibility of varying instance-label assignments are simultaneously taken into consideration in a unified and adaptive manner. Moreover, the parallelization capability of the Transformer enables efficient computation, making our model scalable to large-scale datasets. Specifically, we first represent instances and labels as nodes in the visual space and label space respectively, and then compute the hidden representation of each node in its individual space, by attending a self-attention strategy over its entire neighborhood. Subsequently, the cross-attention is adopted to excavate the correct assignments between instances and labels, and further interprets how classifying each label depends on the instances within an image and its interaction with other labels. Finally, an asymmetric focal loss is designed to optimize the instance-label correspondence, and read out image-level category confidences. Extensive experiments conducted on various multi-label image datasets demonstrate the superiority of our proposed method.]]></abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2023.3288205</doi><tpages>14</tpages><orcidid>https://orcid.org/0000-0002-3301-6303</orcidid><orcidid>https://orcid.org/0000-0002-5922-9358</orcidid><orcidid>https://orcid.org/0000-0001-8408-3816</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 1051-8215
ispartof	IEEE transactions on circuits and systems for video technology, 2024-02, Vol.34 (2), p.924-937
issn	1051-8215 1558-2205
language	eng
recordid	cdi_proquest_journals_2923122937
source	IEEE Xplore (Online service)
subjects	attention mechanism Computational modeling Computer vision Correlation Datasets Graph matching Image classification Labels Multi-label image classification Semantics Task analysis transformer Transformers Visualization
title	Transformer Driven Matching Selection Mechanism for Multi-Label Image Classification
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T15%3A28%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Transformer%20Driven%20Matching%20Selection%20Mechanism%20for%20Multi-Label%20Image%20Classification&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Wu,%20Yanan&rft.date=2024-02-01&rft.volume=34&rft.issue=2&rft.spage=924&rft.epage=937&rft.pages=924-937&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2023.3288205&rft_dat=%3Cproquest_ieee_%3E2923122937%3C/proquest_ieee_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c296t-4374f4532abaa79aa7cabec6525d23441b38a875994e350404fac1c8fa6664733%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2923122937&rft_id=info:pmid/&rft_ieee_id=10158710&rfr_iscdi=true