Loading…
Integrating Language Guidance into Image-Text Matching for Correcting False Negatives
Image-Text Matching (ITM) aims to establish the correspondence between images and sentences. ITM is fundamental to various vision and language understanding tasks. However, there are limitations in the way existing ITM benchmarks are constructed. The ITM benchmark collects pairs of images and senten...
Saved in:
Published in: | IEEE transactions on multimedia 2024-01, Vol.26, p.1-14 |
---|---|
Main Authors: | , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Image-Text Matching (ITM) aims to establish the correspondence between images and sentences. ITM is fundamental to various vision and language understanding tasks. However, there are limitations in the way existing ITM benchmarks are constructed. The ITM benchmark collects pairs of images and sentences during construction. Therefore, only samples that are paired at collection are annotated as positive. All other samples are annotated as negative. Many correlations are missed in these samples that are annotated as negative. For example, a sentence matches only one image at the time of collection. Only this image is annotated as positive for the sentence. All other images are annotated as negative. However, these negative images may contain images that correspond to the sentences. These mislabeled samples are called false negatives . Existing ITM models are optimized based on annotations containing mislabels, which can introduce noise during training. In this paper, we propose an ITM framework integrating Language Guidance ( LG ) for correcting false negatives. A language pre-training model is introduced into the ITM framework to identify false negatives. To correct false negatives, we propose language guidance loss, which adaptively corrects the locations of false negatives in the visual-semantic embedding space. Extensive experiments on two ITM benchmarks show that our method can improve the performance of existing ITM models. To verify the performance of correcting false negatives, we conduct further experiments on ECCV Caption. ECCV Caption is a verified dataset where false negatives in annotations have been corrected. The experimental results show that our method can recall more relevant false negatives. The code is available at https://github.com/AAA-Zheng/LG_ITM . |
---|---|
ISSN: | 1520-9210 1941-0077 |
DOI: | 10.1109/TMM.2023.3261443 |