Loading…

Region-Aware Image Captioning via Interaction Learning

Image captioning is one of the primary goals in computer vision which aims to automatically generate natural descriptions for images. Intuitively, human visual system can notice some stimulating regions at first glance, and then volitionally focus on interesting objects within the region. For exampl...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on circuits and systems for video technology 2022-06, Vol.32 (6), p.3685-3696
Main Authors: Liu, An-An, Zhai, Yingchen, Xu, Ning, Nie, Weizhi, Li, Wenhui, Zhang, Yongdong
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Image captioning is one of the primary goals in computer vision which aims to automatically generate natural descriptions for images. Intuitively, human visual system can notice some stimulating regions at first glance, and then volitionally focus on interesting objects within the region. For example, to generate a free-form sentence about "boy-catch-baseball", the visual region involving "boy" and "baseball" could be first attended and then guide the salient object discovery for the word-by-word generation. Till now, previous captioning works mainly rely on the object-wise modeling and ignore the rich regional patterns. To mitigate the drawback, this paper proposes the region-aware interaction learning method, which aims to explicitly capture the semantic correlations in the region and object dimensions for the word inference. First, given an image, we extract a set of regions which contain diverse objects and their relations. Second, we present the spatial-GCN interaction refining structure which can establish the connection between regions and objects to effectively capture contextual information. Third, we design the dual-attention interaction inference procedure, which enables attention to be calculated in region and object dimensions jointly for the word generation. Specifically, the guidance mechanism is proposed to selectively emphasize semantic inter-dependencies from region to object attentions. Extensive experiments on the MSCOCO dataset demonstrate the superiority of the proposed method. Additional ablation studies and visualization further validate its effectiveness.
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2021.3107035