Loading…

Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning

Observing a set of images and their corresponding paragraph-captions, a challenging task is to learn how to produce a semantically coherent paragraph to describe the visual content of an image. Inspired by recent successes in integrating semantic topics into this task, this paper develops a plug-and...

Full description

Saved in:

Bibliographic Details
Published in:	International journal of computer vision 2022-08, Vol.130 (8), p.1920-1937
Main Authors:	Guo, Dandan, Lu, Ruiying, Chen, Bo, Zeng, Zequn, Zhou, Mingyuan
Format:	Article
Language:	English
Subjects:	Artificial Intelligence Computational linguistics Computer Imaging Computer Science Image Processing and Computer Vision Language processing Multilayers Natural language interfaces Pattern Recognition Pattern Recognition and Graphics Semantics Vision Visual observation
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Observing a set of images and their corresponding paragraph-captions, a challenging task is to learn how to produce a semantically coherent paragraph to describe the visual content of an image. Inspired by recent successes in integrating semantic topics into this task, this paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework, which couples a visual extractor with a deep topic model to guide the learning of a language model. To capture the correlations between the image and text at multiple levels of abstraction and learn the semantic topics from images, we design a variational inference network to build the mapping from image features to textual captions. To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model, including Long Short-Term Memory and Transformer, and jointly optimized. Experiments on public datasets demonstrate that the proposed models, which are competitive with many state-of-the-art approaches in terms of standard evaluation metrics, can be used to both distill interpretable multi-layer semantic topics and generate diverse and coherent captions.
ISSN:	0920-5691 1573-1405
DOI:	10.1007/s11263-022-01624-6