Loading…

Research on image captioning using dilated convolution ResNet and attention mechanism: Research on image captioning using dilated convolution ResNet and attention mechanism

Image captioning, which refers to generating a textual description of the image content from a given image, has been recognized as a key problem in visual-to-linguistic tasks. In this work, we introduce dilated convolution to increase the perceptual field, which can better capture an image’s details...

Full description

Saved in:

Bibliographic Details
Published in:	Multimedia systems 2025-02, Vol.31 (1), Article 47
Main Authors:	Li, Haisheng, Yuan, Rongrong, Li, Qiuyi, Hu, Cong
Format:	Article
Language:	English
Subjects:	Accuracy Attention Computer Communication Networks Computer Graphics Computer Science Convolution Cryptology Data Storage Representation Feature extraction Multilayer perceptrons Multimedia Information Systems Operating Systems Regular Paper Visual tasks
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites	cdi_FETCH-LOGICAL-c200t-14f48ffc1ca6964173c6daf3e327aa4eedad4ad5a21ffd19533ea60f201f68af3
container_end_page
container_issue	1
container_start_page
container_title	Multimedia systems
container_volume	31
creator	Li, Haisheng Yuan, Rongrong Li, Qiuyi Hu, Cong
description	Image captioning, which refers to generating a textual description of the image content from a given image, has been recognized as a key problem in visual-to-linguistic tasks. In this work, we introduce dilated convolution to increase the perceptual field, which can better capture an image’s details and contextual information and extract richer image features. A sparse multilayer perceptron is introduced and combined with an attention mechanism to enhance the extraction of detailed features and attention to essential feature regions, thus improving the network’s expressive ability and feature selection. Furthermore, the residual squeeze-and-excitation module is added to help the model better understand the image content, thus improving the accuracy of the image captioning task. However, the main challenge is achieving high accuracy in capturing both local and global image features simultaneously while maintaining model efficiency and reducing computational costs. The experimental results on the Flickr8k and Flickr30k datasets show that our proposed method has improved the generation accuracy and diversity, which can better capture image features and improve the accuracy of generated captions.
doi_str_mv	10.1007/s00530-024-01653-w
format	article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3152183054</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3152183054</sourcerecordid><originalsourceid>FETCH-LOGICAL-c200t-14f48ffc1ca6964173c6daf3e327aa4eedad4ad5a21ffd19533ea60f201f68af3</originalsourceid><addsrcrecordid>eNp9kFtLAzEQhYMoWKt_wKeAz9HJZbO7j1K8QVEQ-xyGXNotbbZudi3-e9Ou4JsvMzBzvjPMIeSawy0HKO8SQCGBgVAMuC4k25-QCVdSMF5V4pRMoFaCqVqLc3KR0hqAl1rChCzeffLY2RVtI222uPTU4q5v2tjEJR3Sobpmg7131Lbxq90MhyXN2KvvKUZHse99PA633q4wNml7Sc4CbpK_-u1Tsnh8-Jg9s_nb08vsfs6sAOgZV0FVIVhuUdda8VJa7TBIL0WJqLx36BS6AgUPwfG6kNKjhiCAB11l4ZTcjL67rv0cfOrNuh26mE8ayQvBKwmFyioxqmzXptT5YHZdfrX7NhzMIT4zxmdyfOYYn9lnSI5QyuK49N2f9T_UDzoTdNs</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3152183054</pqid></control><display><type>article</type><title>Research on image captioning using dilated convolution ResNet and attention mechanism: Research on image captioning using dilated convolution ResNet and attention mechanism</title><source>Springer Nature</source><creator>Li, Haisheng ; Yuan, Rongrong ; Li, Qiuyi ; Hu, Cong</creator><creatorcontrib>Li, Haisheng ; Yuan, Rongrong ; Li, Qiuyi ; Hu, Cong</creatorcontrib><description>Image captioning, which refers to generating a textual description of the image content from a given image, has been recognized as a key problem in visual-to-linguistic tasks. In this work, we introduce dilated convolution to increase the perceptual field, which can better capture an image’s details and contextual information and extract richer image features. A sparse multilayer perceptron is introduced and combined with an attention mechanism to enhance the extraction of detailed features and attention to essential feature regions, thus improving the network’s expressive ability and feature selection. Furthermore, the residual squeeze-and-excitation module is added to help the model better understand the image content, thus improving the accuracy of the image captioning task. However, the main challenge is achieving high accuracy in capturing both local and global image features simultaneously while maintaining model efficiency and reducing computational costs. The experimental results on the Flickr8k and Flickr30k datasets show that our proposed method has improved the generation accuracy and diversity, which can better capture image features and improve the accuracy of generated captions.</description><identifier>ISSN: 0942-4962</identifier><identifier>EISSN: 1432-1882</identifier><identifier>DOI: 10.1007/s00530-024-01653-w</identifier><language>eng</language><publisher>Berlin/Heidelberg: Springer Berlin Heidelberg</publisher><subject>Accuracy ; Attention ; Computer Communication Networks ; Computer Graphics ; Computer Science ; Convolution ; Cryptology ; Data Storage Representation ; Feature extraction ; Multilayer perceptrons ; Multimedia Information Systems ; Operating Systems ; Regular Paper ; Visual tasks</subject><ispartof>Multimedia systems, 2025-02, Vol.31 (1), Article 47</ispartof><rights>The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2025 Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><rights>Copyright Springer Nature B.V. 2025</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c200t-14f48ffc1ca6964173c6daf3e327aa4eedad4ad5a21ffd19533ea60f201f68af3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Li, Haisheng</creatorcontrib><creatorcontrib>Yuan, Rongrong</creatorcontrib><creatorcontrib>Li, Qiuyi</creatorcontrib><creatorcontrib>Hu, Cong</creatorcontrib><title>Research on image captioning using dilated convolution ResNet and attention mechanism: Research on image captioning using dilated convolution ResNet and attention mechanism</title><title>Multimedia systems</title><addtitle>Multimedia Systems</addtitle><description>Image captioning, which refers to generating a textual description of the image content from a given image, has been recognized as a key problem in visual-to-linguistic tasks. In this work, we introduce dilated convolution to increase the perceptual field, which can better capture an image’s details and contextual information and extract richer image features. A sparse multilayer perceptron is introduced and combined with an attention mechanism to enhance the extraction of detailed features and attention to essential feature regions, thus improving the network’s expressive ability and feature selection. Furthermore, the residual squeeze-and-excitation module is added to help the model better understand the image content, thus improving the accuracy of the image captioning task. However, the main challenge is achieving high accuracy in capturing both local and global image features simultaneously while maintaining model efficiency and reducing computational costs. The experimental results on the Flickr8k and Flickr30k datasets show that our proposed method has improved the generation accuracy and diversity, which can better capture image features and improve the accuracy of generated captions.</description><subject>Accuracy</subject><subject>Attention</subject><subject>Computer Communication Networks</subject><subject>Computer Graphics</subject><subject>Computer Science</subject><subject>Convolution</subject><subject>Cryptology</subject><subject>Data Storage Representation</subject><subject>Feature extraction</subject><subject>Multilayer perceptrons</subject><subject>Multimedia Information Systems</subject><subject>Operating Systems</subject><subject>Regular Paper</subject><subject>Visual tasks</subject><issn>0942-4962</issn><issn>1432-1882</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2025</creationdate><recordtype>article</recordtype><recordid>eNp9kFtLAzEQhYMoWKt_wKeAz9HJZbO7j1K8QVEQ-xyGXNotbbZudi3-e9Ou4JsvMzBzvjPMIeSawy0HKO8SQCGBgVAMuC4k25-QCVdSMF5V4pRMoFaCqVqLc3KR0hqAl1rChCzeffLY2RVtI222uPTU4q5v2tjEJR3Sobpmg7131Lbxq90MhyXN2KvvKUZHse99PA633q4wNml7Sc4CbpK_-u1Tsnh8-Jg9s_nb08vsfs6sAOgZV0FVIVhuUdda8VJa7TBIL0WJqLx36BS6AgUPwfG6kNKjhiCAB11l4ZTcjL67rv0cfOrNuh26mE8ayQvBKwmFyioxqmzXptT5YHZdfrX7NhzMIT4zxmdyfOYYn9lnSI5QyuK49N2f9T_UDzoTdNs</recordid><startdate>20250201</startdate><enddate>20250201</enddate><creator>Li, Haisheng</creator><creator>Yuan, Rongrong</creator><creator>Li, Qiuyi</creator><creator>Hu, Cong</creator><general>Springer Berlin Heidelberg</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20250201</creationdate><title>Research on image captioning using dilated convolution ResNet and attention mechanism</title><author>Li, Haisheng ; Yuan, Rongrong ; Li, Qiuyi ; Hu, Cong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c200t-14f48ffc1ca6964173c6daf3e327aa4eedad4ad5a21ffd19533ea60f201f68af3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2025</creationdate><topic>Accuracy</topic><topic>Attention</topic><topic>Computer Communication Networks</topic><topic>Computer Graphics</topic><topic>Computer Science</topic><topic>Convolution</topic><topic>Cryptology</topic><topic>Data Storage Representation</topic><topic>Feature extraction</topic><topic>Multilayer perceptrons</topic><topic>Multimedia Information Systems</topic><topic>Operating Systems</topic><topic>Regular Paper</topic><topic>Visual tasks</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Li, Haisheng</creatorcontrib><creatorcontrib>Yuan, Rongrong</creatorcontrib><creatorcontrib>Li, Qiuyi</creatorcontrib><creatorcontrib>Hu, Cong</creatorcontrib><collection>CrossRef</collection><jtitle>Multimedia systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Haisheng</au><au>Yuan, Rongrong</au><au>Li, Qiuyi</au><au>Hu, Cong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Research on image captioning using dilated convolution ResNet and attention mechanism: Research on image captioning using dilated convolution ResNet and attention mechanism</atitle><jtitle>Multimedia systems</jtitle><stitle>Multimedia Systems</stitle><date>2025-02-01</date><risdate>2025</risdate><volume>31</volume><issue>1</issue><artnum>47</artnum><issn>0942-4962</issn><eissn>1432-1882</eissn><abstract>Image captioning, which refers to generating a textual description of the image content from a given image, has been recognized as a key problem in visual-to-linguistic tasks. In this work, we introduce dilated convolution to increase the perceptual field, which can better capture an image’s details and contextual information and extract richer image features. A sparse multilayer perceptron is introduced and combined with an attention mechanism to enhance the extraction of detailed features and attention to essential feature regions, thus improving the network’s expressive ability and feature selection. Furthermore, the residual squeeze-and-excitation module is added to help the model better understand the image content, thus improving the accuracy of the image captioning task. However, the main challenge is achieving high accuracy in capturing both local and global image features simultaneously while maintaining model efficiency and reducing computational costs. The experimental results on the Flickr8k and Flickr30k datasets show that our proposed method has improved the generation accuracy and diversity, which can better capture image features and improve the accuracy of generated captions.</abstract><cop>Berlin/Heidelberg</cop><pub>Springer Berlin Heidelberg</pub><doi>10.1007/s00530-024-01653-w</doi></addata></record>
fulltext	fulltext
identifier	ISSN: 0942-4962
ispartof	Multimedia systems, 2025-02, Vol.31 (1), Article 47
issn	0942-4962 1432-1882
language	eng
recordid	cdi_proquest_journals_3152183054
source	Springer Nature
subjects	Accuracy Attention Computer Communication Networks Computer Graphics Computer Science Convolution Cryptology Data Storage Representation Feature extraction Multilayer perceptrons Multimedia Information Systems Operating Systems Regular Paper Visual tasks
title	Research on image captioning using dilated convolution ResNet and attention mechanism: Research on image captioning using dilated convolution ResNet and attention mechanism
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T13%3A08%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Research%20on%20image%20captioning%20using%20dilated%20convolution%20ResNet%20and%20attention%20mechanism:%20Research%20on%20image%20captioning%20using%20dilated%20convolution%20ResNet%20and%20attention%20mechanism&rft.jtitle=Multimedia%20systems&rft.au=Li,%20Haisheng&rft.date=2025-02-01&rft.volume=31&rft.issue=1&rft.artnum=47&rft.issn=0942-4962&rft.eissn=1432-1882&rft_id=info:doi/10.1007/s00530-024-01653-w&rft_dat=%3Cproquest_cross%3E3152183054%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c200t-14f48ffc1ca6964173c6daf3e327aa4eedad4ad5a21ffd19533ea60f201f68af3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3152183054&rft_id=info:pmid/&rfr_iscdi=true