Loading…
Research on image captioning using dilated convolution ResNet and attention mechanism: Research on image captioning using dilated convolution ResNet and attention mechanism
Image captioning, which refers to generating a textual description of the image content from a given image, has been recognized as a key problem in visual-to-linguistic tasks. In this work, we introduce dilated convolution to increase the perceptual field, which can better capture an image’s details...
Saved in:
Published in: | Multimedia systems 2025-02, Vol.31 (1), Article 47 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | cdi_FETCH-LOGICAL-c200t-14f48ffc1ca6964173c6daf3e327aa4eedad4ad5a21ffd19533ea60f201f68af3 |
container_end_page | |
container_issue | 1 |
container_start_page | |
container_title | Multimedia systems |
container_volume | 31 |
creator | Li, Haisheng Yuan, Rongrong Li, Qiuyi Hu, Cong |
description | Image captioning, which refers to generating a textual description of the image content from a given image, has been recognized as a key problem in visual-to-linguistic tasks. In this work, we introduce dilated convolution to increase the perceptual field, which can better capture an image’s details and contextual information and extract richer image features. A sparse multilayer perceptron is introduced and combined with an attention mechanism to enhance the extraction of detailed features and attention to essential feature regions, thus improving the network’s expressive ability and feature selection. Furthermore, the residual squeeze-and-excitation module is added to help the model better understand the image content, thus improving the accuracy of the image captioning task. However, the main challenge is achieving high accuracy in capturing both local and global image features simultaneously while maintaining model efficiency and reducing computational costs. The experimental results on the Flickr8k and Flickr30k datasets show that our proposed method has improved the generation accuracy and diversity, which can better capture image features and improve the accuracy of generated captions. |
doi_str_mv | 10.1007/s00530-024-01653-w |
format | article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3152183054</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3152183054</sourcerecordid><originalsourceid>FETCH-LOGICAL-c200t-14f48ffc1ca6964173c6daf3e327aa4eedad4ad5a21ffd19533ea60f201f68af3</originalsourceid><addsrcrecordid>eNp9kFtLAzEQhYMoWKt_wKeAz9HJZbO7j1K8QVEQ-xyGXNotbbZudi3-e9Ou4JsvMzBzvjPMIeSawy0HKO8SQCGBgVAMuC4k25-QCVdSMF5V4pRMoFaCqVqLc3KR0hqAl1rChCzeffLY2RVtI222uPTU4q5v2tjEJR3Sobpmg7131Lbxq90MhyXN2KvvKUZHse99PA633q4wNml7Sc4CbpK_-u1Tsnh8-Jg9s_nb08vsfs6sAOgZV0FVIVhuUdda8VJa7TBIL0WJqLx36BS6AgUPwfG6kNKjhiCAB11l4ZTcjL67rv0cfOrNuh26mE8ayQvBKwmFyioxqmzXptT5YHZdfrX7NhzMIT4zxmdyfOYYn9lnSI5QyuK49N2f9T_UDzoTdNs</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3152183054</pqid></control><display><type>article</type><title>Research on image captioning using dilated convolution ResNet and attention mechanism: Research on image captioning using dilated convolution ResNet and attention mechanism</title><source>Springer Nature</source><creator>Li, Haisheng ; Yuan, Rongrong ; Li, Qiuyi ; Hu, Cong</creator><creatorcontrib>Li, Haisheng ; Yuan, Rongrong ; Li, Qiuyi ; Hu, Cong</creatorcontrib><description>Image captioning, which refers to generating a textual description of the image content from a given image, has been recognized as a key problem in visual-to-linguistic tasks. In this work, we introduce dilated convolution to increase the perceptual field, which can better capture an image’s details and contextual information and extract richer image features. A sparse multilayer perceptron is introduced and combined with an attention mechanism to enhance the extraction of detailed features and attention to essential feature regions, thus improving the network’s expressive ability and feature selection. Furthermore, the residual squeeze-and-excitation module is added to help the model better understand the image content, thus improving the accuracy of the image captioning task. However, the main challenge is achieving high accuracy in capturing both local and global image features simultaneously while maintaining model efficiency and reducing computational costs. The experimental results on the Flickr8k and Flickr30k datasets show that our proposed method has improved the generation accuracy and diversity, which can better capture image features and improve the accuracy of generated captions.</description><identifier>ISSN: 0942-4962</identifier><identifier>EISSN: 1432-1882</identifier><identifier>DOI: 10.1007/s00530-024-01653-w</identifier><language>eng</language><publisher>Berlin/Heidelberg: Springer Berlin Heidelberg</publisher><subject>Accuracy ; Attention ; Computer Communication Networks ; Computer Graphics ; Computer Science ; Convolution ; Cryptology ; Data Storage Representation ; Feature extraction ; Multilayer perceptrons ; Multimedia Information Systems ; Operating Systems ; Regular Paper ; Visual tasks</subject><ispartof>Multimedia systems, 2025-02, Vol.31 (1), Article 47</ispartof><rights>The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2025 Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><rights>Copyright Springer Nature B.V. 2025</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c200t-14f48ffc1ca6964173c6daf3e327aa4eedad4ad5a21ffd19533ea60f201f68af3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Li, Haisheng</creatorcontrib><creatorcontrib>Yuan, Rongrong</creatorcontrib><creatorcontrib>Li, Qiuyi</creatorcontrib><creatorcontrib>Hu, Cong</creatorcontrib><title>Research on image captioning using dilated convolution ResNet and attention mechanism: Research on image captioning using dilated convolution ResNet and attention mechanism</title><title>Multimedia systems</title><addtitle>Multimedia Systems</addtitle><description>Image captioning, which refers to generating a textual description of the image content from a given image, has been recognized as a key problem in visual-to-linguistic tasks. In this work, we introduce dilated convolution to increase the perceptual field, which can better capture an image’s details and contextual information and extract richer image features. A sparse multilayer perceptron is introduced and combined with an attention mechanism to enhance the extraction of detailed features and attention to essential feature regions, thus improving the network’s expressive ability and feature selection. Furthermore, the residual squeeze-and-excitation module is added to help the model better understand the image content, thus improving the accuracy of the image captioning task. However, the main challenge is achieving high accuracy in capturing both local and global image features simultaneously while maintaining model efficiency and reducing computational costs. The experimental results on the Flickr8k and Flickr30k datasets show that our proposed method has improved the generation accuracy and diversity, which can better capture image features and improve the accuracy of generated captions.</description><subject>Accuracy</subject><subject>Attention</subject><subject>Computer Communication Networks</subject><subject>Computer Graphics</subject><subject>Computer Science</subject><subject>Convolution</subject><subject>Cryptology</subject><subject>Data Storage Representation</subject><subject>Feature extraction</subject><subject>Multilayer perceptrons</subject><subject>Multimedia Information Systems</subject><subject>Operating Systems</subject><subject>Regular Paper</subject><subject>Visual tasks</subject><issn>0942-4962</issn><issn>1432-1882</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2025</creationdate><recordtype>article</recordtype><recordid>eNp9kFtLAzEQhYMoWKt_wKeAz9HJZbO7j1K8QVEQ-xyGXNotbbZudi3-e9Ou4JsvMzBzvjPMIeSawy0HKO8SQCGBgVAMuC4k25-QCVdSMF5V4pRMoFaCqVqLc3KR0hqAl1rChCzeffLY2RVtI222uPTU4q5v2tjEJR3Sobpmg7131Lbxq90MhyXN2KvvKUZHse99PA633q4wNml7Sc4CbpK_-u1Tsnh8-Jg9s_nb08vsfs6sAOgZV0FVIVhuUdda8VJa7TBIL0WJqLx36BS6AgUPwfG6kNKjhiCAB11l4ZTcjL67rv0cfOrNuh26mE8ayQvBKwmFyioxqmzXptT5YHZdfrX7NhzMIT4zxmdyfOYYn9lnSI5QyuK49N2f9T_UDzoTdNs</recordid><startdate>20250201</startdate><enddate>20250201</enddate><creator>Li, Haisheng</creator><creator>Yuan, Rongrong</creator><creator>Li, Qiuyi</creator><creator>Hu, Cong</creator><general>Springer Berlin Heidelberg</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20250201</creationdate><title>Research on image captioning using dilated convolution ResNet and attention mechanism</title><author>Li, Haisheng ; Yuan, Rongrong ; Li, Qiuyi ; Hu, Cong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c200t-14f48ffc1ca6964173c6daf3e327aa4eedad4ad5a21ffd19533ea60f201f68af3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2025</creationdate><topic>Accuracy</topic><topic>Attention</topic><topic>Computer Communication Networks</topic><topic>Computer Graphics</topic><topic>Computer Science</topic><topic>Convolution</topic><topic>Cryptology</topic><topic>Data Storage Representation</topic><topic>Feature extraction</topic><topic>Multilayer perceptrons</topic><topic>Multimedia Information Systems</topic><topic>Operating Systems</topic><topic>Regular Paper</topic><topic>Visual tasks</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Li, Haisheng</creatorcontrib><creatorcontrib>Yuan, Rongrong</creatorcontrib><creatorcontrib>Li, Qiuyi</creatorcontrib><creatorcontrib>Hu, Cong</creatorcontrib><collection>CrossRef</collection><jtitle>Multimedia systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Haisheng</au><au>Yuan, Rongrong</au><au>Li, Qiuyi</au><au>Hu, Cong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Research on image captioning using dilated convolution ResNet and attention mechanism: Research on image captioning using dilated convolution ResNet and attention mechanism</atitle><jtitle>Multimedia systems</jtitle><stitle>Multimedia Systems</stitle><date>2025-02-01</date><risdate>2025</risdate><volume>31</volume><issue>1</issue><artnum>47</artnum><issn>0942-4962</issn><eissn>1432-1882</eissn><abstract>Image captioning, which refers to generating a textual description of the image content from a given image, has been recognized as a key problem in visual-to-linguistic tasks. In this work, we introduce dilated convolution to increase the perceptual field, which can better capture an image’s details and contextual information and extract richer image features. A sparse multilayer perceptron is introduced and combined with an attention mechanism to enhance the extraction of detailed features and attention to essential feature regions, thus improving the network’s expressive ability and feature selection. Furthermore, the residual squeeze-and-excitation module is added to help the model better understand the image content, thus improving the accuracy of the image captioning task. However, the main challenge is achieving high accuracy in capturing both local and global image features simultaneously while maintaining model efficiency and reducing computational costs. The experimental results on the Flickr8k and Flickr30k datasets show that our proposed method has improved the generation accuracy and diversity, which can better capture image features and improve the accuracy of generated captions.</abstract><cop>Berlin/Heidelberg</cop><pub>Springer Berlin Heidelberg</pub><doi>10.1007/s00530-024-01653-w</doi></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0942-4962 |
ispartof | Multimedia systems, 2025-02, Vol.31 (1), Article 47 |
issn | 0942-4962 1432-1882 |
language | eng |
recordid | cdi_proquest_journals_3152183054 |
source | Springer Nature |
subjects | Accuracy Attention Computer Communication Networks Computer Graphics Computer Science Convolution Cryptology Data Storage Representation Feature extraction Multilayer perceptrons Multimedia Information Systems Operating Systems Regular Paper Visual tasks |
title | Research on image captioning using dilated convolution ResNet and attention mechanism: Research on image captioning using dilated convolution ResNet and attention mechanism |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T13%3A08%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Research%20on%20image%20captioning%20using%20dilated%20convolution%20ResNet%20and%20attention%20mechanism:%20Research%20on%20image%20captioning%20using%20dilated%20convolution%20ResNet%20and%20attention%20mechanism&rft.jtitle=Multimedia%20systems&rft.au=Li,%20Haisheng&rft.date=2025-02-01&rft.volume=31&rft.issue=1&rft.artnum=47&rft.issn=0942-4962&rft.eissn=1432-1882&rft_id=info:doi/10.1007/s00530-024-01653-w&rft_dat=%3Cproquest_cross%3E3152183054%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c200t-14f48ffc1ca6964173c6daf3e327aa4eedad4ad5a21ffd19533ea60f201f68af3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3152183054&rft_id=info:pmid/&rfr_iscdi=true |