Loading…

Research on image captioning using dilated convolution ResNet and attention mechanism: Research on image captioning using dilated convolution ResNet and attention mechanism

Image captioning, which refers to generating a textual description of the image content from a given image, has been recognized as a key problem in visual-to-linguistic tasks. In this work, we introduce dilated convolution to increase the perceptual field, which can better capture an image’s details...

Full description

Saved in:
Bibliographic Details
Published in:Multimedia systems 2025-02, Vol.31 (1), Article 47
Main Authors: Li, Haisheng, Yuan, Rongrong, Li, Qiuyi, Hu, Cong
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c200t-14f48ffc1ca6964173c6daf3e327aa4eedad4ad5a21ffd19533ea60f201f68af3
container_end_page
container_issue 1
container_start_page
container_title Multimedia systems
container_volume 31
creator Li, Haisheng
Yuan, Rongrong
Li, Qiuyi
Hu, Cong
description Image captioning, which refers to generating a textual description of the image content from a given image, has been recognized as a key problem in visual-to-linguistic tasks. In this work, we introduce dilated convolution to increase the perceptual field, which can better capture an image’s details and contextual information and extract richer image features. A sparse multilayer perceptron is introduced and combined with an attention mechanism to enhance the extraction of detailed features and attention to essential feature regions, thus improving the network’s expressive ability and feature selection. Furthermore, the residual squeeze-and-excitation module is added to help the model better understand the image content, thus improving the accuracy of the image captioning task. However, the main challenge is achieving high accuracy in capturing both local and global image features simultaneously while maintaining model efficiency and reducing computational costs. The experimental results on the Flickr8k and Flickr30k datasets show that our proposed method has improved the generation accuracy and diversity, which can better capture image features and improve the accuracy of generated captions.
doi_str_mv 10.1007/s00530-024-01653-w
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3152183054</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3152183054</sourcerecordid><originalsourceid>FETCH-LOGICAL-c200t-14f48ffc1ca6964173c6daf3e327aa4eedad4ad5a21ffd19533ea60f201f68af3</originalsourceid><addsrcrecordid>eNp9kFtLAzEQhYMoWKt_wKeAz9HJZbO7j1K8QVEQ-xyGXNotbbZudi3-e9Ou4JsvMzBzvjPMIeSawy0HKO8SQCGBgVAMuC4k25-QCVdSMF5V4pRMoFaCqVqLc3KR0hqAl1rChCzeffLY2RVtI222uPTU4q5v2tjEJR3Sobpmg7131Lbxq90MhyXN2KvvKUZHse99PA633q4wNml7Sc4CbpK_-u1Tsnh8-Jg9s_nb08vsfs6sAOgZV0FVIVhuUdda8VJa7TBIL0WJqLx36BS6AgUPwfG6kNKjhiCAB11l4ZTcjL67rv0cfOrNuh26mE8ayQvBKwmFyioxqmzXptT5YHZdfrX7NhzMIT4zxmdyfOYYn9lnSI5QyuK49N2f9T_UDzoTdNs</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3152183054</pqid></control><display><type>article</type><title>Research on image captioning using dilated convolution ResNet and attention mechanism: Research on image captioning using dilated convolution ResNet and attention mechanism</title><source>Springer Nature</source><creator>Li, Haisheng ; Yuan, Rongrong ; Li, Qiuyi ; Hu, Cong</creator><creatorcontrib>Li, Haisheng ; Yuan, Rongrong ; Li, Qiuyi ; Hu, Cong</creatorcontrib><description>Image captioning, which refers to generating a textual description of the image content from a given image, has been recognized as a key problem in visual-to-linguistic tasks. In this work, we introduce dilated convolution to increase the perceptual field, which can better capture an image’s details and contextual information and extract richer image features. A sparse multilayer perceptron is introduced and combined with an attention mechanism to enhance the extraction of detailed features and attention to essential feature regions, thus improving the network’s expressive ability and feature selection. Furthermore, the residual squeeze-and-excitation module is added to help the model better understand the image content, thus improving the accuracy of the image captioning task. However, the main challenge is achieving high accuracy in capturing both local and global image features simultaneously while maintaining model efficiency and reducing computational costs. The experimental results on the Flickr8k and Flickr30k datasets show that our proposed method has improved the generation accuracy and diversity, which can better capture image features and improve the accuracy of generated captions.</description><identifier>ISSN: 0942-4962</identifier><identifier>EISSN: 1432-1882</identifier><identifier>DOI: 10.1007/s00530-024-01653-w</identifier><language>eng</language><publisher>Berlin/Heidelberg: Springer Berlin Heidelberg</publisher><subject>Accuracy ; Attention ; Computer Communication Networks ; Computer Graphics ; Computer Science ; Convolution ; Cryptology ; Data Storage Representation ; Feature extraction ; Multilayer perceptrons ; Multimedia Information Systems ; Operating Systems ; Regular Paper ; Visual tasks</subject><ispartof>Multimedia systems, 2025-02, Vol.31 (1), Article 47</ispartof><rights>The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2025 Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><rights>Copyright Springer Nature B.V. 2025</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c200t-14f48ffc1ca6964173c6daf3e327aa4eedad4ad5a21ffd19533ea60f201f68af3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Li, Haisheng</creatorcontrib><creatorcontrib>Yuan, Rongrong</creatorcontrib><creatorcontrib>Li, Qiuyi</creatorcontrib><creatorcontrib>Hu, Cong</creatorcontrib><title>Research on image captioning using dilated convolution ResNet and attention mechanism: Research on image captioning using dilated convolution ResNet and attention mechanism</title><title>Multimedia systems</title><addtitle>Multimedia Systems</addtitle><description>Image captioning, which refers to generating a textual description of the image content from a given image, has been recognized as a key problem in visual-to-linguistic tasks. In this work, we introduce dilated convolution to increase the perceptual field, which can better capture an image’s details and contextual information and extract richer image features. A sparse multilayer perceptron is introduced and combined with an attention mechanism to enhance the extraction of detailed features and attention to essential feature regions, thus improving the network’s expressive ability and feature selection. Furthermore, the residual squeeze-and-excitation module is added to help the model better understand the image content, thus improving the accuracy of the image captioning task. However, the main challenge is achieving high accuracy in capturing both local and global image features simultaneously while maintaining model efficiency and reducing computational costs. The experimental results on the Flickr8k and Flickr30k datasets show that our proposed method has improved the generation accuracy and diversity, which can better capture image features and improve the accuracy of generated captions.</description><subject>Accuracy</subject><subject>Attention</subject><subject>Computer Communication Networks</subject><subject>Computer Graphics</subject><subject>Computer Science</subject><subject>Convolution</subject><subject>Cryptology</subject><subject>Data Storage Representation</subject><subject>Feature extraction</subject><subject>Multilayer perceptrons</subject><subject>Multimedia Information Systems</subject><subject>Operating Systems</subject><subject>Regular Paper</subject><subject>Visual tasks</subject><issn>0942-4962</issn><issn>1432-1882</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2025</creationdate><recordtype>article</recordtype><recordid>eNp9kFtLAzEQhYMoWKt_wKeAz9HJZbO7j1K8QVEQ-xyGXNotbbZudi3-e9Ou4JsvMzBzvjPMIeSawy0HKO8SQCGBgVAMuC4k25-QCVdSMF5V4pRMoFaCqVqLc3KR0hqAl1rChCzeffLY2RVtI222uPTU4q5v2tjEJR3Sobpmg7131Lbxq90MhyXN2KvvKUZHse99PA633q4wNml7Sc4CbpK_-u1Tsnh8-Jg9s_nb08vsfs6sAOgZV0FVIVhuUdda8VJa7TBIL0WJqLx36BS6AgUPwfG6kNKjhiCAB11l4ZTcjL67rv0cfOrNuh26mE8ayQvBKwmFyioxqmzXptT5YHZdfrX7NhzMIT4zxmdyfOYYn9lnSI5QyuK49N2f9T_UDzoTdNs</recordid><startdate>20250201</startdate><enddate>20250201</enddate><creator>Li, Haisheng</creator><creator>Yuan, Rongrong</creator><creator>Li, Qiuyi</creator><creator>Hu, Cong</creator><general>Springer Berlin Heidelberg</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20250201</creationdate><title>Research on image captioning using dilated convolution ResNet and attention mechanism</title><author>Li, Haisheng ; Yuan, Rongrong ; Li, Qiuyi ; Hu, Cong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c200t-14f48ffc1ca6964173c6daf3e327aa4eedad4ad5a21ffd19533ea60f201f68af3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2025</creationdate><topic>Accuracy</topic><topic>Attention</topic><topic>Computer Communication Networks</topic><topic>Computer Graphics</topic><topic>Computer Science</topic><topic>Convolution</topic><topic>Cryptology</topic><topic>Data Storage Representation</topic><topic>Feature extraction</topic><topic>Multilayer perceptrons</topic><topic>Multimedia Information Systems</topic><topic>Operating Systems</topic><topic>Regular Paper</topic><topic>Visual tasks</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Li, Haisheng</creatorcontrib><creatorcontrib>Yuan, Rongrong</creatorcontrib><creatorcontrib>Li, Qiuyi</creatorcontrib><creatorcontrib>Hu, Cong</creatorcontrib><collection>CrossRef</collection><jtitle>Multimedia systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Haisheng</au><au>Yuan, Rongrong</au><au>Li, Qiuyi</au><au>Hu, Cong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Research on image captioning using dilated convolution ResNet and attention mechanism: Research on image captioning using dilated convolution ResNet and attention mechanism</atitle><jtitle>Multimedia systems</jtitle><stitle>Multimedia Systems</stitle><date>2025-02-01</date><risdate>2025</risdate><volume>31</volume><issue>1</issue><artnum>47</artnum><issn>0942-4962</issn><eissn>1432-1882</eissn><abstract>Image captioning, which refers to generating a textual description of the image content from a given image, has been recognized as a key problem in visual-to-linguistic tasks. In this work, we introduce dilated convolution to increase the perceptual field, which can better capture an image’s details and contextual information and extract richer image features. A sparse multilayer perceptron is introduced and combined with an attention mechanism to enhance the extraction of detailed features and attention to essential feature regions, thus improving the network’s expressive ability and feature selection. Furthermore, the residual squeeze-and-excitation module is added to help the model better understand the image content, thus improving the accuracy of the image captioning task. However, the main challenge is achieving high accuracy in capturing both local and global image features simultaneously while maintaining model efficiency and reducing computational costs. The experimental results on the Flickr8k and Flickr30k datasets show that our proposed method has improved the generation accuracy and diversity, which can better capture image features and improve the accuracy of generated captions.</abstract><cop>Berlin/Heidelberg</cop><pub>Springer Berlin Heidelberg</pub><doi>10.1007/s00530-024-01653-w</doi></addata></record>
fulltext fulltext
identifier ISSN: 0942-4962
ispartof Multimedia systems, 2025-02, Vol.31 (1), Article 47
issn 0942-4962
1432-1882
language eng
recordid cdi_proquest_journals_3152183054
source Springer Nature
subjects Accuracy
Attention
Computer Communication Networks
Computer Graphics
Computer Science
Convolution
Cryptology
Data Storage Representation
Feature extraction
Multilayer perceptrons
Multimedia Information Systems
Operating Systems
Regular Paper
Visual tasks
title Research on image captioning using dilated convolution ResNet and attention mechanism: Research on image captioning using dilated convolution ResNet and attention mechanism
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T13%3A08%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Research%20on%20image%20captioning%20using%20dilated%20convolution%20ResNet%20and%20attention%20mechanism:%20Research%20on%20image%20captioning%20using%20dilated%20convolution%20ResNet%20and%20attention%20mechanism&rft.jtitle=Multimedia%20systems&rft.au=Li,%20Haisheng&rft.date=2025-02-01&rft.volume=31&rft.issue=1&rft.artnum=47&rft.issn=0942-4962&rft.eissn=1432-1882&rft_id=info:doi/10.1007/s00530-024-01653-w&rft_dat=%3Cproquest_cross%3E3152183054%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c200t-14f48ffc1ca6964173c6daf3e327aa4eedad4ad5a21ffd19533ea60f201f68af3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3152183054&rft_id=info:pmid/&rfr_iscdi=true