Loading…

Transformer Driven Matching Selection Mechanism for Multi-Label Image Classification

Graph Matching has recently emerged as an attractive technique applied to various computer vision tasks. Graph Matching based multi-label image classification, in particular, treats each image as a bag of instances and reformulates the classification task as an instance-label matching selection prob...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on circuits and systems for video technology 2024-02, Vol.34 (2), p.924-937
Main Authors: Wu, Yanan, Feng, Songhe, Zhao, Gongpei, Jin, Yi
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Graph Matching has recently emerged as an attractive technique applied to various computer vision tasks. Graph Matching based multi-label image classification, in particular, treats each image as a bag of instances and reformulates the classification task as an instance-label matching selection problem, achieving state-of-the-art results on diverse benchmarks. However, the generalization and scalability of such learned model cannot be well guaranteed due to its manually predetermined graph structure and high-dimension embedding of dense connections between instances and labels. To address these limitations, in this work, we propose a novel {T} ransformer Driven {M} atching {S} election framework for Multi-Label Image {C} lassification (C-TMS), where instance structural relationships, class-wise global dependencies, and the co-occurrence possibility of varying instance-label assignments are simultaneously taken into consideration in a unified and adaptive manner. Moreover, the parallelization capability of the Transformer enables efficient computation, making our model scalable to large-scale datasets. Specifically, we first represent instances and labels as nodes in the visual space and label space respectively, and then compute the hidden representation of each node in its individual space, by attending a self-attention strategy over its entire neighborhood. Subsequently, the cross-attention is adopted to excavate the correct assignments between instances and labels, and further interprets how classifying each label depends on the instances within an image and its interaction with other labels. Finally, an asymmetric focal loss is designed to optimize the instance-label correspondence, and read out image-level category confidences. Extensive experiments conducted on various multi-label image datasets demonstrate the superiority of our proposed method.
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2023.3288205