Loading…
Distance Restricted Transformer Encoder for Multi-Label Classification
Multi-label image classification is a fundamental but challenging task in Multimedia community. It aims to predict a set of labels presented in an image. Great progress has been made by exploring convolutional neural network with binary cross-entropy loss recently. However, conventional approaches a...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Multi-label image classification is a fundamental but challenging task in Multimedia community. It aims to predict a set of labels presented in an image. Great progress has been made by exploring convolutional neural network with binary cross-entropy loss recently. However, conventional approaches are limited to highlight the key visual contents associated with target labels and pay little attention to confining the distances between visual and positive/negative label representations. To target these aspects, we firstly introduce a variant transformer encoder model for acquiring the underlying and crucial visual information related to ground truth labels. Specifically, a novel primal feature guided net is designed to maintain the original visual features during encoding process. Secondly, we exploit a distance restricted learning strategy in a common semantic space to shrink the distances of images with positive labels while expand with the negative ones during training stage. Extensive experiments are executed on MSCOCO and WIDER Attribute datasets and outstanding performance is achieved compared with other state-of-the-art models. |
---|---|
ISSN: | 1945-788X |
DOI: | 10.1109/ICME51207.2021.9428164 |