Loading…

Exploring Vision-Language Models for Imbalanced Learning

Vision-language models (VLMs) that use contrastive language-image pre-training have shown promising zero-shot classification performance. However, their performance on imbalanced dataset is relatively poor, where the distribution of classes in the training dataset is skewed, leading to poor performa...

Full description

Saved in:

Bibliographic Details
Published in:	International journal of computer vision 2024, Vol.132 (1), p.224-237
Main Authors:	Wang, Yidong, Yu, Zhuohao, Wang, Jindong, Heng, Qiang, Chen, Hao, Ye, Wei, Xie, Rui, Xie, Xing, Zhang, Shikun
Format:	Article
Language:	English
Subjects:	Algorithms Artificial Intelligence Classification Computer Imaging Computer Science Computer vision Cost analysis Datasets Image Processing and Computer Vision Machine learning Pattern Recognition Pattern Recognition and Graphics Performance prediction Special Issue on The Promises and Dangers of Large Vision Models Vision
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Vision-language models (VLMs) that use contrastive language-image pre-training have shown promising zero-shot classification performance. However, their performance on imbalanced dataset is relatively poor, where the distribution of classes in the training dataset is skewed, leading to poor performance in predicting minority classes. For instance, CLIP achieved only 5% accuracy on the iNaturalist18 dataset. We propose to add a lightweight decoder to VLMs to avoid out of memory problem caused by large number of classes and capture nuanced features for tail classes. Then, we explore improvements of VLMs using prompt tuning, fine-tuning, and incorporating imbalanced algorithms such as Focal Loss, Balanced SoftMax and Distribution Alignment. Experiments demonstrate that the performance of VLMs can be further boosted when used with decoder and imbalanced methods. Specifically, our improved VLMs significantly outperforms zero-shot classification by an average accuracy of 6.58 %, 69.82 %, and 6.17 %, on ImageNet-LT, iNaturalist18, and Places-LT, respectively. We further analyze the influence of pre-training data size, backbones, and training cost. Our study highlights the significance of imbalanced learning algorithms in face of VLMs pre-trained by huge data. We release our code at https://github.com/Imbalance-VLM/Imbalance-VLM .
ISSN:	0920-5691 1573-1405
DOI:	10.1007/s11263-023-01868-w