Loading…

Transformer Driven Matching Selection Mechanism for Multi-Label Image Classification

Graph Matching has recently emerged as an attractive technique applied to various computer vision tasks. Graph Matching based multi-label image classification, in particular, treats each image as a bag of instances and reformulates the classification task as an instance-label matching selection prob...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on circuits and systems for video technology 2024-02, Vol.34 (2), p.924-937
Main Authors: Wu, Yanan, Feng, Songhe, Zhao, Gongpei, Jin, Yi
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c296t-4374f4532abaa79aa7cabec6525d23441b38a875994e350404fac1c8fa6664733
cites cdi_FETCH-LOGICAL-c296t-4374f4532abaa79aa7cabec6525d23441b38a875994e350404fac1c8fa6664733
container_end_page 937
container_issue 2
container_start_page 924
container_title IEEE transactions on circuits and systems for video technology
container_volume 34
creator Wu, Yanan
Feng, Songhe
Zhao, Gongpei
Jin, Yi
description Graph Matching has recently emerged as an attractive technique applied to various computer vision tasks. Graph Matching based multi-label image classification, in particular, treats each image as a bag of instances and reformulates the classification task as an instance-label matching selection problem, achieving state-of-the-art results on diverse benchmarks. However, the generalization and scalability of such learned model cannot be well guaranteed due to its manually predetermined graph structure and high-dimension embedding of dense connections between instances and labels. To address these limitations, in this work, we propose a novel {T} ransformer Driven {M} atching {S} election framework for Multi-Label Image {C} lassification (C-TMS), where instance structural relationships, class-wise global dependencies, and the co-occurrence possibility of varying instance-label assignments are simultaneously taken into consideration in a unified and adaptive manner. Moreover, the parallelization capability of the Transformer enables efficient computation, making our model scalable to large-scale datasets. Specifically, we first represent instances and labels as nodes in the visual space and label space respectively, and then compute the hidden representation of each node in its individual space, by attending a self-attention strategy over its entire neighborhood. Subsequently, the cross-attention is adopted to excavate the correct assignments between instances and labels, and further interprets how classifying each label depends on the instances within an image and its interaction with other labels. Finally, an asymmetric focal loss is designed to optimize the instance-label correspondence, and read out image-level category confidences. Extensive experiments conducted on various multi-label image datasets demonstrate the superiority of our proposed method.
doi_str_mv 10.1109/TCSVT.2023.3288205
format article
fullrecord <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_proquest_journals_2923122937</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10158710</ieee_id><sourcerecordid>2923122937</sourcerecordid><originalsourceid>FETCH-LOGICAL-c296t-4374f4532abaa79aa7cabec6525d23441b38a875994e350404fac1c8fa6664733</originalsourceid><addsrcrecordid>eNpNkE9LAzEQxYMoWKtfQDwseN6aTJLd7FFWq4UWD129hmnMtin7pyZbwW9vanvwMMxjeO8N_Ai5ZXTCGC0eqnL5UU2AAp9wUAqoPCMjJqVKIerzqKlkqQImL8lVCFtKmVAiH5Gq8tiFuvet9cmTd9-2SxY4mI3r1snSNtYMro8nazbYudAm0Zos9s3g0jmubJPMWlzbpGwwBFc7gwf7NbmosQn25rTH5H36XJWv6fztZVY-zlMDRTakgueiFpIDrhDzIo6JlSaTID-BC8FWXKHKZVEIyyUVVNRomFE1Zlkmcs7H5P7Yu_P9196GQW_7ve_iSw0FcAZQ8Dy64Ogyvg_B21rvvGvR_2hG9YGe_qOnD_T0iV4M3R1Dzlr7L8Ckyhnlv0jJa2I</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2923122937</pqid></control><display><type>article</type><title>Transformer Driven Matching Selection Mechanism for Multi-Label Image Classification</title><source>IEEE Xplore (Online service)</source><creator>Wu, Yanan ; Feng, Songhe ; Zhao, Gongpei ; Jin, Yi</creator><creatorcontrib>Wu, Yanan ; Feng, Songhe ; Zhao, Gongpei ; Jin, Yi</creatorcontrib><description><![CDATA[Graph Matching has recently emerged as an attractive technique applied to various computer vision tasks. Graph Matching based multi-label image classification, in particular, treats each image as a bag of instances and reformulates the classification task as an instance-label matching selection problem, achieving state-of-the-art results on diverse benchmarks. However, the generalization and scalability of such learned model cannot be well guaranteed due to its manually predetermined graph structure and high-dimension embedding of dense connections between instances and labels. To address these limitations, in this work, we propose a novel <inline-formula> <tex-math notation="LaTeX">{T} </tex-math></inline-formula>ransformer Driven <inline-formula> <tex-math notation="LaTeX">{M} </tex-math></inline-formula>atching <inline-formula> <tex-math notation="LaTeX">{S} </tex-math></inline-formula>election framework for Multi-Label Image <inline-formula> <tex-math notation="LaTeX">{C} </tex-math></inline-formula>lassification (C-TMS), where instance structural relationships, class-wise global dependencies, and the co-occurrence possibility of varying instance-label assignments are simultaneously taken into consideration in a unified and adaptive manner. Moreover, the parallelization capability of the Transformer enables efficient computation, making our model scalable to large-scale datasets. Specifically, we first represent instances and labels as nodes in the visual space and label space respectively, and then compute the hidden representation of each node in its individual space, by attending a self-attention strategy over its entire neighborhood. Subsequently, the cross-attention is adopted to excavate the correct assignments between instances and labels, and further interprets how classifying each label depends on the instances within an image and its interaction with other labels. Finally, an asymmetric focal loss is designed to optimize the instance-label correspondence, and read out image-level category confidences. Extensive experiments conducted on various multi-label image datasets demonstrate the superiority of our proposed method.]]></description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2023.3288205</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>attention mechanism ; Computational modeling ; Computer vision ; Correlation ; Datasets ; Graph matching ; Image classification ; Labels ; Multi-label image classification ; Semantics ; Task analysis ; transformer ; Transformers ; Visualization</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2024-02, Vol.34 (2), p.924-937</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c296t-4374f4532abaa79aa7cabec6525d23441b38a875994e350404fac1c8fa6664733</citedby><cites>FETCH-LOGICAL-c296t-4374f4532abaa79aa7cabec6525d23441b38a875994e350404fac1c8fa6664733</cites><orcidid>0000-0002-3301-6303 ; 0000-0002-5922-9358 ; 0000-0001-8408-3816</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10158710$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,54796</link.rule.ids></links><search><creatorcontrib>Wu, Yanan</creatorcontrib><creatorcontrib>Feng, Songhe</creatorcontrib><creatorcontrib>Zhao, Gongpei</creatorcontrib><creatorcontrib>Jin, Yi</creatorcontrib><title>Transformer Driven Matching Selection Mechanism for Multi-Label Image Classification</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description><![CDATA[Graph Matching has recently emerged as an attractive technique applied to various computer vision tasks. Graph Matching based multi-label image classification, in particular, treats each image as a bag of instances and reformulates the classification task as an instance-label matching selection problem, achieving state-of-the-art results on diverse benchmarks. However, the generalization and scalability of such learned model cannot be well guaranteed due to its manually predetermined graph structure and high-dimension embedding of dense connections between instances and labels. To address these limitations, in this work, we propose a novel <inline-formula> <tex-math notation="LaTeX">{T} </tex-math></inline-formula>ransformer Driven <inline-formula> <tex-math notation="LaTeX">{M} </tex-math></inline-formula>atching <inline-formula> <tex-math notation="LaTeX">{S} </tex-math></inline-formula>election framework for Multi-Label Image <inline-formula> <tex-math notation="LaTeX">{C} </tex-math></inline-formula>lassification (C-TMS), where instance structural relationships, class-wise global dependencies, and the co-occurrence possibility of varying instance-label assignments are simultaneously taken into consideration in a unified and adaptive manner. Moreover, the parallelization capability of the Transformer enables efficient computation, making our model scalable to large-scale datasets. Specifically, we first represent instances and labels as nodes in the visual space and label space respectively, and then compute the hidden representation of each node in its individual space, by attending a self-attention strategy over its entire neighborhood. Subsequently, the cross-attention is adopted to excavate the correct assignments between instances and labels, and further interprets how classifying each label depends on the instances within an image and its interaction with other labels. Finally, an asymmetric focal loss is designed to optimize the instance-label correspondence, and read out image-level category confidences. Extensive experiments conducted on various multi-label image datasets demonstrate the superiority of our proposed method.]]></description><subject>attention mechanism</subject><subject>Computational modeling</subject><subject>Computer vision</subject><subject>Correlation</subject><subject>Datasets</subject><subject>Graph matching</subject><subject>Image classification</subject><subject>Labels</subject><subject>Multi-label image classification</subject><subject>Semantics</subject><subject>Task analysis</subject><subject>transformer</subject><subject>Transformers</subject><subject>Visualization</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNpNkE9LAzEQxYMoWKtfQDwseN6aTJLd7FFWq4UWD129hmnMtin7pyZbwW9vanvwMMxjeO8N_Ai5ZXTCGC0eqnL5UU2AAp9wUAqoPCMjJqVKIerzqKlkqQImL8lVCFtKmVAiH5Gq8tiFuvet9cmTd9-2SxY4mI3r1snSNtYMro8nazbYudAm0Zos9s3g0jmubJPMWlzbpGwwBFc7gwf7NbmosQn25rTH5H36XJWv6fztZVY-zlMDRTakgueiFpIDrhDzIo6JlSaTID-BC8FWXKHKZVEIyyUVVNRomFE1Zlkmcs7H5P7Yu_P9196GQW_7ve_iSw0FcAZQ8Dy64Ogyvg_B21rvvGvR_2hG9YGe_qOnD_T0iV4M3R1Dzlr7L8Ckyhnlv0jJa2I</recordid><startdate>20240201</startdate><enddate>20240201</enddate><creator>Wu, Yanan</creator><creator>Feng, Songhe</creator><creator>Zhao, Gongpei</creator><creator>Jin, Yi</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-3301-6303</orcidid><orcidid>https://orcid.org/0000-0002-5922-9358</orcidid><orcidid>https://orcid.org/0000-0001-8408-3816</orcidid></search><sort><creationdate>20240201</creationdate><title>Transformer Driven Matching Selection Mechanism for Multi-Label Image Classification</title><author>Wu, Yanan ; Feng, Songhe ; Zhao, Gongpei ; Jin, Yi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c296t-4374f4532abaa79aa7cabec6525d23441b38a875994e350404fac1c8fa6664733</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>attention mechanism</topic><topic>Computational modeling</topic><topic>Computer vision</topic><topic>Correlation</topic><topic>Datasets</topic><topic>Graph matching</topic><topic>Image classification</topic><topic>Labels</topic><topic>Multi-label image classification</topic><topic>Semantics</topic><topic>Task analysis</topic><topic>transformer</topic><topic>Transformers</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wu, Yanan</creatorcontrib><creatorcontrib>Feng, Songhe</creatorcontrib><creatorcontrib>Zhao, Gongpei</creatorcontrib><creatorcontrib>Jin, Yi</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEL</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wu, Yanan</au><au>Feng, Songhe</au><au>Zhao, Gongpei</au><au>Jin, Yi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Transformer Driven Matching Selection Mechanism for Multi-Label Image Classification</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2024-02-01</date><risdate>2024</risdate><volume>34</volume><issue>2</issue><spage>924</spage><epage>937</epage><pages>924-937</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract><![CDATA[Graph Matching has recently emerged as an attractive technique applied to various computer vision tasks. Graph Matching based multi-label image classification, in particular, treats each image as a bag of instances and reformulates the classification task as an instance-label matching selection problem, achieving state-of-the-art results on diverse benchmarks. However, the generalization and scalability of such learned model cannot be well guaranteed due to its manually predetermined graph structure and high-dimension embedding of dense connections between instances and labels. To address these limitations, in this work, we propose a novel <inline-formula> <tex-math notation="LaTeX">{T} </tex-math></inline-formula>ransformer Driven <inline-formula> <tex-math notation="LaTeX">{M} </tex-math></inline-formula>atching <inline-formula> <tex-math notation="LaTeX">{S} </tex-math></inline-formula>election framework for Multi-Label Image <inline-formula> <tex-math notation="LaTeX">{C} </tex-math></inline-formula>lassification (C-TMS), where instance structural relationships, class-wise global dependencies, and the co-occurrence possibility of varying instance-label assignments are simultaneously taken into consideration in a unified and adaptive manner. Moreover, the parallelization capability of the Transformer enables efficient computation, making our model scalable to large-scale datasets. Specifically, we first represent instances and labels as nodes in the visual space and label space respectively, and then compute the hidden representation of each node in its individual space, by attending a self-attention strategy over its entire neighborhood. Subsequently, the cross-attention is adopted to excavate the correct assignments between instances and labels, and further interprets how classifying each label depends on the instances within an image and its interaction with other labels. Finally, an asymmetric focal loss is designed to optimize the instance-label correspondence, and read out image-level category confidences. Extensive experiments conducted on various multi-label image datasets demonstrate the superiority of our proposed method.]]></abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2023.3288205</doi><tpages>14</tpages><orcidid>https://orcid.org/0000-0002-3301-6303</orcidid><orcidid>https://orcid.org/0000-0002-5922-9358</orcidid><orcidid>https://orcid.org/0000-0001-8408-3816</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 1051-8215
ispartof IEEE transactions on circuits and systems for video technology, 2024-02, Vol.34 (2), p.924-937
issn 1051-8215
1558-2205
language eng
recordid cdi_proquest_journals_2923122937
source IEEE Xplore (Online service)
subjects attention mechanism
Computational modeling
Computer vision
Correlation
Datasets
Graph matching
Image classification
Labels
Multi-label image classification
Semantics
Task analysis
transformer
Transformers
Visualization
title Transformer Driven Matching Selection Mechanism for Multi-Label Image Classification
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T15%3A28%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Transformer%20Driven%20Matching%20Selection%20Mechanism%20for%20Multi-Label%20Image%20Classification&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Wu,%20Yanan&rft.date=2024-02-01&rft.volume=34&rft.issue=2&rft.spage=924&rft.epage=937&rft.pages=924-937&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2023.3288205&rft_dat=%3Cproquest_ieee_%3E2923122937%3C/proquest_ieee_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c296t-4374f4532abaa79aa7cabec6525d23441b38a875994e350404fac1c8fa6664733%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2923122937&rft_id=info:pmid/&rft_ieee_id=10158710&rfr_iscdi=true