Loading…

Research on image captioning using dilated convolution ResNet and attention mechanism: Research on image captioning using dilated convolution ResNet and attention mechanism

Image captioning, which refers to generating a textual description of the image content from a given image, has been recognized as a key problem in visual-to-linguistic tasks. In this work, we introduce dilated convolution to increase the perceptual field, which can better capture an image’s details...

Full description

Saved in:

Bibliographic Details
Published in:	Multimedia systems 2025-02, Vol.31 (1), Article 47
Main Authors:	Li, Haisheng, Yuan, Rongrong, Li, Qiuyi, Hu, Cong
Format:	Article
Language:	English
Subjects:	Accuracy Attention Computer Communication Networks Computer Graphics Computer Science Convolution Cryptology Data Storage Representation Feature extraction Multilayer perceptrons Multimedia Information Systems Operating Systems Regular Paper Visual tasks
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Image captioning, which refers to generating a textual description of the image content from a given image, has been recognized as a key problem in visual-to-linguistic tasks. In this work, we introduce dilated convolution to increase the perceptual field, which can better capture an image’s details and contextual information and extract richer image features. A sparse multilayer perceptron is introduced and combined with an attention mechanism to enhance the extraction of detailed features and attention to essential feature regions, thus improving the network’s expressive ability and feature selection. Furthermore, the residual squeeze-and-excitation module is added to help the model better understand the image content, thus improving the accuracy of the image captioning task. However, the main challenge is achieving high accuracy in capturing both local and global image features simultaneously while maintaining model efficiency and reducing computational costs. The experimental results on the Flickr8k and Flickr30k datasets show that our proposed method has improved the generation accuracy and diversity, which can better capture image features and improve the accuracy of generated captions.
ISSN:	0942-4962 1432-1882
DOI:	10.1007/s00530-024-01653-w