Loading…
Exploring Vision-Language Foundation Model for Novel Object Captioning
It is always well believed that pre-trained vision-language foundation models (e.g., CLIP) would substantially facilitate vision-language tasks. Nevertheless, there has been less evidence in support of the idea on describing novel objects in images. In this paper, we propose the Novel Object Transfo...
Saved in:
Published in: | IEEE transactions on circuits and systems for video technology 2024-08, p.1-1 |
---|---|
Main Authors: | , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | It is always well believed that pre-trained vision-language foundation models (e.g., CLIP) would substantially facilitate vision-language tasks. Nevertheless, there has been less evidence in support of the idea on describing novel objects in images. In this paper, we propose the Novel Object Transformer with CLIP (NOTC), a Transformer-based model that innovatively exploits the powerful vision-language representation ability of CLIP to enhance novel object captioning model's training and sentence decoding processes. Technically, given the primary bag-of-objects extracted by Faster R-CNN, NOTC first capitalize on an object distiller module to emphasize the most salient objects and infer the missing novel ones. The refined object words are additionally fed into the object-centric word predictor to generate sentence word-by-word. During training, we design a CLIP-based self-critical sequence training paradigm to select visually-grounded sampled sentence with higher CLIP score reward, which enables a joint training process of captioning model over out-domain training images with novel objects. Moreover, at inference, a new CLIP beam search algorithm is devised to enforce the existence of novel objects and encourage the partial word sequences with higher CLIP scores, thereby decoding both visually-grounded and comprehensive sentences. Extensive experiments are conducted on held-out COCO and nocaps datasets, and competitive performances are reported when compared to state-of-the-art approaches. |
---|---|
ISSN: | 1051-8215 1558-2205 |
DOI: | 10.1109/TCSVT.2024.3452437 |