Loading…
Transformer Driven Matching Selection Mechanism for Multi-Label Image Classification
Graph Matching has recently emerged as an attractive technique applied to various computer vision tasks. Graph Matching based multi-label image classification, in particular, treats each image as a bag of instances and reformulates the classification task as an instance-label matching selection prob...
Saved in:
Published in: | IEEE transactions on circuits and systems for video technology 2024-02, Vol.34 (2), p.924-937 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c296t-4374f4532abaa79aa7cabec6525d23441b38a875994e350404fac1c8fa6664733 |
---|---|
cites | cdi_FETCH-LOGICAL-c296t-4374f4532abaa79aa7cabec6525d23441b38a875994e350404fac1c8fa6664733 |
container_end_page | 937 |
container_issue | 2 |
container_start_page | 924 |
container_title | IEEE transactions on circuits and systems for video technology |
container_volume | 34 |
creator | Wu, Yanan Feng, Songhe Zhao, Gongpei Jin, Yi |
description | Graph Matching has recently emerged as an attractive technique applied to various computer vision tasks. Graph Matching based multi-label image classification, in particular, treats each image as a bag of instances and reformulates the classification task as an instance-label matching selection problem, achieving state-of-the-art results on diverse benchmarks. However, the generalization and scalability of such learned model cannot be well guaranteed due to its manually predetermined graph structure and high-dimension embedding of dense connections between instances and labels. To address these limitations, in this work, we propose a novel {T} ransformer Driven {M} atching {S} election framework for Multi-Label Image {C} lassification (C-TMS), where instance structural relationships, class-wise global dependencies, and the co-occurrence possibility of varying instance-label assignments are simultaneously taken into consideration in a unified and adaptive manner. Moreover, the parallelization capability of the Transformer enables efficient computation, making our model scalable to large-scale datasets. Specifically, we first represent instances and labels as nodes in the visual space and label space respectively, and then compute the hidden representation of each node in its individual space, by attending a self-attention strategy over its entire neighborhood. Subsequently, the cross-attention is adopted to excavate the correct assignments between instances and labels, and further interprets how classifying each label depends on the instances within an image and its interaction with other labels. Finally, an asymmetric focal loss is designed to optimize the instance-label correspondence, and read out image-level category confidences. Extensive experiments conducted on various multi-label image datasets demonstrate the superiority of our proposed method. |
doi_str_mv | 10.1109/TCSVT.2023.3288205 |
format | article |
fullrecord | <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_proquest_journals_2923122937</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10158710</ieee_id><sourcerecordid>2923122937</sourcerecordid><originalsourceid>FETCH-LOGICAL-c296t-4374f4532abaa79aa7cabec6525d23441b38a875994e350404fac1c8fa6664733</originalsourceid><addsrcrecordid>eNpNkE9LAzEQxYMoWKtfQDwseN6aTJLd7FFWq4UWD129hmnMtin7pyZbwW9vanvwMMxjeO8N_Ai5ZXTCGC0eqnL5UU2AAp9wUAqoPCMjJqVKIerzqKlkqQImL8lVCFtKmVAiH5Gq8tiFuvet9cmTd9-2SxY4mI3r1snSNtYMro8nazbYudAm0Zos9s3g0jmubJPMWlzbpGwwBFc7gwf7NbmosQn25rTH5H36XJWv6fztZVY-zlMDRTakgueiFpIDrhDzIo6JlSaTID-BC8FWXKHKZVEIyyUVVNRomFE1Zlkmcs7H5P7Yu_P9196GQW_7ve_iSw0FcAZQ8Dy64Ogyvg_B21rvvGvR_2hG9YGe_qOnD_T0iV4M3R1Dzlr7L8Ckyhnlv0jJa2I</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2923122937</pqid></control><display><type>article</type><title>Transformer Driven Matching Selection Mechanism for Multi-Label Image Classification</title><source>IEEE Xplore (Online service)</source><creator>Wu, Yanan ; Feng, Songhe ; Zhao, Gongpei ; Jin, Yi</creator><creatorcontrib>Wu, Yanan ; Feng, Songhe ; Zhao, Gongpei ; Jin, Yi</creatorcontrib><description><![CDATA[Graph Matching has recently emerged as an attractive technique applied to various computer vision tasks. Graph Matching based multi-label image classification, in particular, treats each image as a bag of instances and reformulates the classification task as an instance-label matching selection problem, achieving state-of-the-art results on diverse benchmarks. However, the generalization and scalability of such learned model cannot be well guaranteed due to its manually predetermined graph structure and high-dimension embedding of dense connections between instances and labels. To address these limitations, in this work, we propose a novel <inline-formula> <tex-math notation="LaTeX">{T} </tex-math></inline-formula>ransformer Driven <inline-formula> <tex-math notation="LaTeX">{M} </tex-math></inline-formula>atching <inline-formula> <tex-math notation="LaTeX">{S} </tex-math></inline-formula>election framework for Multi-Label Image <inline-formula> <tex-math notation="LaTeX">{C} </tex-math></inline-formula>lassification (C-TMS), where instance structural relationships, class-wise global dependencies, and the co-occurrence possibility of varying instance-label assignments are simultaneously taken into consideration in a unified and adaptive manner. Moreover, the parallelization capability of the Transformer enables efficient computation, making our model scalable to large-scale datasets. Specifically, we first represent instances and labels as nodes in the visual space and label space respectively, and then compute the hidden representation of each node in its individual space, by attending a self-attention strategy over its entire neighborhood. Subsequently, the cross-attention is adopted to excavate the correct assignments between instances and labels, and further interprets how classifying each label depends on the instances within an image and its interaction with other labels. Finally, an asymmetric focal loss is designed to optimize the instance-label correspondence, and read out image-level category confidences. Extensive experiments conducted on various multi-label image datasets demonstrate the superiority of our proposed method.]]></description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2023.3288205</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>attention mechanism ; Computational modeling ; Computer vision ; Correlation ; Datasets ; Graph matching ; Image classification ; Labels ; Multi-label image classification ; Semantics ; Task analysis ; transformer ; Transformers ; Visualization</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2024-02, Vol.34 (2), p.924-937</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c296t-4374f4532abaa79aa7cabec6525d23441b38a875994e350404fac1c8fa6664733</citedby><cites>FETCH-LOGICAL-c296t-4374f4532abaa79aa7cabec6525d23441b38a875994e350404fac1c8fa6664733</cites><orcidid>0000-0002-3301-6303 ; 0000-0002-5922-9358 ; 0000-0001-8408-3816</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10158710$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,54796</link.rule.ids></links><search><creatorcontrib>Wu, Yanan</creatorcontrib><creatorcontrib>Feng, Songhe</creatorcontrib><creatorcontrib>Zhao, Gongpei</creatorcontrib><creatorcontrib>Jin, Yi</creatorcontrib><title>Transformer Driven Matching Selection Mechanism for Multi-Label Image Classification</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description><![CDATA[Graph Matching has recently emerged as an attractive technique applied to various computer vision tasks. Graph Matching based multi-label image classification, in particular, treats each image as a bag of instances and reformulates the classification task as an instance-label matching selection problem, achieving state-of-the-art results on diverse benchmarks. However, the generalization and scalability of such learned model cannot be well guaranteed due to its manually predetermined graph structure and high-dimension embedding of dense connections between instances and labels. To address these limitations, in this work, we propose a novel <inline-formula> <tex-math notation="LaTeX">{T} </tex-math></inline-formula>ransformer Driven <inline-formula> <tex-math notation="LaTeX">{M} </tex-math></inline-formula>atching <inline-formula> <tex-math notation="LaTeX">{S} </tex-math></inline-formula>election framework for Multi-Label Image <inline-formula> <tex-math notation="LaTeX">{C} </tex-math></inline-formula>lassification (C-TMS), where instance structural relationships, class-wise global dependencies, and the co-occurrence possibility of varying instance-label assignments are simultaneously taken into consideration in a unified and adaptive manner. Moreover, the parallelization capability of the Transformer enables efficient computation, making our model scalable to large-scale datasets. Specifically, we first represent instances and labels as nodes in the visual space and label space respectively, and then compute the hidden representation of each node in its individual space, by attending a self-attention strategy over its entire neighborhood. Subsequently, the cross-attention is adopted to excavate the correct assignments between instances and labels, and further interprets how classifying each label depends on the instances within an image and its interaction with other labels. Finally, an asymmetric focal loss is designed to optimize the instance-label correspondence, and read out image-level category confidences. Extensive experiments conducted on various multi-label image datasets demonstrate the superiority of our proposed method.]]></description><subject>attention mechanism</subject><subject>Computational modeling</subject><subject>Computer vision</subject><subject>Correlation</subject><subject>Datasets</subject><subject>Graph matching</subject><subject>Image classification</subject><subject>Labels</subject><subject>Multi-label image classification</subject><subject>Semantics</subject><subject>Task analysis</subject><subject>transformer</subject><subject>Transformers</subject><subject>Visualization</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNpNkE9LAzEQxYMoWKtfQDwseN6aTJLd7FFWq4UWD129hmnMtin7pyZbwW9vanvwMMxjeO8N_Ai5ZXTCGC0eqnL5UU2AAp9wUAqoPCMjJqVKIerzqKlkqQImL8lVCFtKmVAiH5Gq8tiFuvet9cmTd9-2SxY4mI3r1snSNtYMro8nazbYudAm0Zos9s3g0jmubJPMWlzbpGwwBFc7gwf7NbmosQn25rTH5H36XJWv6fztZVY-zlMDRTakgueiFpIDrhDzIo6JlSaTID-BC8FWXKHKZVEIyyUVVNRomFE1Zlkmcs7H5P7Yu_P9196GQW_7ve_iSw0FcAZQ8Dy64Ogyvg_B21rvvGvR_2hG9YGe_qOnD_T0iV4M3R1Dzlr7L8Ckyhnlv0jJa2I</recordid><startdate>20240201</startdate><enddate>20240201</enddate><creator>Wu, Yanan</creator><creator>Feng, Songhe</creator><creator>Zhao, Gongpei</creator><creator>Jin, Yi</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-3301-6303</orcidid><orcidid>https://orcid.org/0000-0002-5922-9358</orcidid><orcidid>https://orcid.org/0000-0001-8408-3816</orcidid></search><sort><creationdate>20240201</creationdate><title>Transformer Driven Matching Selection Mechanism for Multi-Label Image Classification</title><author>Wu, Yanan ; Feng, Songhe ; Zhao, Gongpei ; Jin, Yi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c296t-4374f4532abaa79aa7cabec6525d23441b38a875994e350404fac1c8fa6664733</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>attention mechanism</topic><topic>Computational modeling</topic><topic>Computer vision</topic><topic>Correlation</topic><topic>Datasets</topic><topic>Graph matching</topic><topic>Image classification</topic><topic>Labels</topic><topic>Multi-label image classification</topic><topic>Semantics</topic><topic>Task analysis</topic><topic>transformer</topic><topic>Transformers</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wu, Yanan</creatorcontrib><creatorcontrib>Feng, Songhe</creatorcontrib><creatorcontrib>Zhao, Gongpei</creatorcontrib><creatorcontrib>Jin, Yi</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEL</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wu, Yanan</au><au>Feng, Songhe</au><au>Zhao, Gongpei</au><au>Jin, Yi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Transformer Driven Matching Selection Mechanism for Multi-Label Image Classification</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2024-02-01</date><risdate>2024</risdate><volume>34</volume><issue>2</issue><spage>924</spage><epage>937</epage><pages>924-937</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract><![CDATA[Graph Matching has recently emerged as an attractive technique applied to various computer vision tasks. Graph Matching based multi-label image classification, in particular, treats each image as a bag of instances and reformulates the classification task as an instance-label matching selection problem, achieving state-of-the-art results on diverse benchmarks. However, the generalization and scalability of such learned model cannot be well guaranteed due to its manually predetermined graph structure and high-dimension embedding of dense connections between instances and labels. To address these limitations, in this work, we propose a novel <inline-formula> <tex-math notation="LaTeX">{T} </tex-math></inline-formula>ransformer Driven <inline-formula> <tex-math notation="LaTeX">{M} </tex-math></inline-formula>atching <inline-formula> <tex-math notation="LaTeX">{S} </tex-math></inline-formula>election framework for Multi-Label Image <inline-formula> <tex-math notation="LaTeX">{C} </tex-math></inline-formula>lassification (C-TMS), where instance structural relationships, class-wise global dependencies, and the co-occurrence possibility of varying instance-label assignments are simultaneously taken into consideration in a unified and adaptive manner. Moreover, the parallelization capability of the Transformer enables efficient computation, making our model scalable to large-scale datasets. Specifically, we first represent instances and labels as nodes in the visual space and label space respectively, and then compute the hidden representation of each node in its individual space, by attending a self-attention strategy over its entire neighborhood. Subsequently, the cross-attention is adopted to excavate the correct assignments between instances and labels, and further interprets how classifying each label depends on the instances within an image and its interaction with other labels. Finally, an asymmetric focal loss is designed to optimize the instance-label correspondence, and read out image-level category confidences. Extensive experiments conducted on various multi-label image datasets demonstrate the superiority of our proposed method.]]></abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2023.3288205</doi><tpages>14</tpages><orcidid>https://orcid.org/0000-0002-3301-6303</orcidid><orcidid>https://orcid.org/0000-0002-5922-9358</orcidid><orcidid>https://orcid.org/0000-0001-8408-3816</orcidid></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1051-8215 |
ispartof | IEEE transactions on circuits and systems for video technology, 2024-02, Vol.34 (2), p.924-937 |
issn | 1051-8215 1558-2205 |
language | eng |
recordid | cdi_proquest_journals_2923122937 |
source | IEEE Xplore (Online service) |
subjects | attention mechanism Computational modeling Computer vision Correlation Datasets Graph matching Image classification Labels Multi-label image classification Semantics Task analysis transformer Transformers Visualization |
title | Transformer Driven Matching Selection Mechanism for Multi-Label Image Classification |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T15%3A28%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Transformer%20Driven%20Matching%20Selection%20Mechanism%20for%20Multi-Label%20Image%20Classification&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Wu,%20Yanan&rft.date=2024-02-01&rft.volume=34&rft.issue=2&rft.spage=924&rft.epage=937&rft.pages=924-937&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2023.3288205&rft_dat=%3Cproquest_ieee_%3E2923122937%3C/proquest_ieee_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c296t-4374f4532abaa79aa7cabec6525d23441b38a875994e350404fac1c8fa6664733%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2923122937&rft_id=info:pmid/&rft_ieee_id=10158710&rfr_iscdi=true |