Loading…

Distance Restricted Transformer Encoder for Multi-Label Classification

Multi-label image classification is a fundamental but challenging task in Multimedia community. It aims to predict a set of labels presented in an image. Great progress has been made by exploring convolutional neural network with binary cross-entropy loss recently. However, conventional approaches a...

Full description

Saved in:
Bibliographic Details
Main Authors: Wang, Xiaomei, Li, Yaqian, Luo, Tong, Guo, Yandong, Fu, Yanwei, Xue, Xiangyang
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Multi-label image classification is a fundamental but challenging task in Multimedia community. It aims to predict a set of labels presented in an image. Great progress has been made by exploring convolutional neural network with binary cross-entropy loss recently. However, conventional approaches are limited to highlight the key visual contents associated with target labels and pay little attention to confining the distances between visual and positive/negative label representations. To target these aspects, we firstly introduce a variant transformer encoder model for acquiring the underlying and crucial visual information related to ground truth labels. Specifically, a novel primal feature guided net is designed to maintain the original visual features during encoding process. Secondly, we exploit a distance restricted learning strategy in a common semantic space to shrink the distances of images with positive labels while expand with the negative ones during training stage. Extensive experiments are executed on MSCOCO and WIDER Attribute datasets and outstanding performance is achieved compared with other state-of-the-art models.
ISSN:1945-788X
DOI:10.1109/ICME51207.2021.9428164