Loading…

Caption TLSTMs: combining transformer with LSTMs for image captioning

Image to captions has attracted widespread attention over the years. Recurrent neural networks (RNN) and their corresponding variants have been the mainstream when it comes to dealing with image captioning task for a long time. However, transformer-based models have shown powerful and promising perf...

Full description

Saved in:
Bibliographic Details
Published in:International journal of multimedia information retrieval 2022-06, Vol.11 (2), p.111-121
Main Authors: Yan, Jie, Xie, Yuxiang, Luan, Xidao, Guo, Yanming, Gong, Quanzhi, Feng, Suru
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c249t-c15da9a99d7c3ca593093785b8ca62424e1d4304d5056534c2cc8c7a0a9264de3
cites cdi_FETCH-LOGICAL-c249t-c15da9a99d7c3ca593093785b8ca62424e1d4304d5056534c2cc8c7a0a9264de3
container_end_page 121
container_issue 2
container_start_page 111
container_title International journal of multimedia information retrieval
container_volume 11
creator Yan, Jie
Xie, Yuxiang
Luan, Xidao
Guo, Yanming
Gong, Quanzhi
Feng, Suru
description Image to captions has attracted widespread attention over the years. Recurrent neural networks (RNN) and their corresponding variants have been the mainstream when it comes to dealing with image captioning task for a long time. However, transformer-based models have shown powerful and promising performance on visual tasks contrary to classic neural networks. In order to extract richer and more robust multimodal intersection feature representation, we improve the original abstract scene graph to caption model and propose the Caption TLSTMs which is made up of two LSTMs with T ransformer blocks in the middle of them in this paper. Compared with the model before improvement, the architecture of our Caption TLSTMs enables the entire network to make the most of the long-term dependencies and feature representation ability of the LSTM, while encoding the multimodal textual, visual and graphic information with the transformer blocks as well. Finally, experiments on VisualGenome and MSCOCO datasets have shown good performance in improving the general image caption generation quality, demonstrating the effectiveness of the Caption TLSTMs model.
doi_str_mv 10.1007/s13735-022-00228-7
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2919497074</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2919497074</sourcerecordid><originalsourceid>FETCH-LOGICAL-c249t-c15da9a99d7c3ca593093785b8ca62424e1d4304d5056534c2cc8c7a0a9264de3</originalsourceid><addsrcrecordid>eNp9kMFKAzEQhoMoWGpfwFPAc3QySTYbb1KqFSoeXMFbSLNp3WJ3a7JFfHtjV_TmHGaG4ftnhp-Qcw6XHEBfJS60UAwQGeRUMn1ERsgNsqLAl-PfnvNTMklpAzlKLDjoEZlN3a5vupZWi6fqIV1T322XTdu0a9pH16ZVF7ch0o-mf6UHguYJbbZuHagfpJk9Iycr95bC5KeOyfPtrJrO2eLx7n56s2AepemZ56p2xhlTay-8U0aAEbpUy9K7AiXKwGspQNYKVKGE9Oh96bUDZ7CQdRBjcjHs3cXufR9SbzfdPrb5pEXDjTQatMwUDpSPXUoxrOwu5o_jp-Vgvx2zg2M2m2UPjlmdRWIQpQy36xD_Vv-j-gJ7E2ya</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2919497074</pqid></control><display><type>article</type><title>Caption TLSTMs: combining transformer with LSTMs for image captioning</title><source>Springer Link</source><creator>Yan, Jie ; Xie, Yuxiang ; Luan, Xidao ; Guo, Yanming ; Gong, Quanzhi ; Feng, Suru</creator><creatorcontrib>Yan, Jie ; Xie, Yuxiang ; Luan, Xidao ; Guo, Yanming ; Gong, Quanzhi ; Feng, Suru</creatorcontrib><description>Image to captions has attracted widespread attention over the years. Recurrent neural networks (RNN) and their corresponding variants have been the mainstream when it comes to dealing with image captioning task for a long time. However, transformer-based models have shown powerful and promising performance on visual tasks contrary to classic neural networks. In order to extract richer and more robust multimodal intersection feature representation, we improve the original abstract scene graph to caption model and propose the Caption TLSTMs which is made up of two LSTMs with T ransformer blocks in the middle of them in this paper. Compared with the model before improvement, the architecture of our Caption TLSTMs enables the entire network to make the most of the long-term dependencies and feature representation ability of the LSTM, while encoding the multimodal textual, visual and graphic information with the transformer blocks as well. Finally, experiments on VisualGenome and MSCOCO datasets have shown good performance in improving the general image caption generation quality, demonstrating the effectiveness of the Caption TLSTMs model.</description><identifier>ISSN: 2192-6611</identifier><identifier>EISSN: 2192-662X</identifier><identifier>DOI: 10.1007/s13735-022-00228-7</identifier><language>eng</language><publisher>London: Springer London</publisher><subject>Artificial intelligence ; Computer Science ; Data Mining and Knowledge Discovery ; Database Management ; Deep learning ; Graphs ; Image Processing and Computer Vision ; Image quality ; Information Storage and Retrieval ; Information Systems Applications (incl.Internet) ; Language ; Machine translation ; Multimedia Information Systems ; Natural language ; Neural networks ; Recurrent neural networks ; Regular Paper ; Representations ; Semantics ; Transformers ; Visual tasks</subject><ispartof>International journal of multimedia information retrieval, 2022-06, Vol.11 (2), p.111-121</ispartof><rights>The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022</rights><rights>The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c249t-c15da9a99d7c3ca593093785b8ca62424e1d4304d5056534c2cc8c7a0a9264de3</citedby><cites>FETCH-LOGICAL-c249t-c15da9a99d7c3ca593093785b8ca62424e1d4304d5056534c2cc8c7a0a9264de3</cites><orcidid>0000-0002-6079-6907</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Yan, Jie</creatorcontrib><creatorcontrib>Xie, Yuxiang</creatorcontrib><creatorcontrib>Luan, Xidao</creatorcontrib><creatorcontrib>Guo, Yanming</creatorcontrib><creatorcontrib>Gong, Quanzhi</creatorcontrib><creatorcontrib>Feng, Suru</creatorcontrib><title>Caption TLSTMs: combining transformer with LSTMs for image captioning</title><title>International journal of multimedia information retrieval</title><addtitle>Int J Multimed Info Retr</addtitle><description>Image to captions has attracted widespread attention over the years. Recurrent neural networks (RNN) and their corresponding variants have been the mainstream when it comes to dealing with image captioning task for a long time. However, transformer-based models have shown powerful and promising performance on visual tasks contrary to classic neural networks. In order to extract richer and more robust multimodal intersection feature representation, we improve the original abstract scene graph to caption model and propose the Caption TLSTMs which is made up of two LSTMs with T ransformer blocks in the middle of them in this paper. Compared with the model before improvement, the architecture of our Caption TLSTMs enables the entire network to make the most of the long-term dependencies and feature representation ability of the LSTM, while encoding the multimodal textual, visual and graphic information with the transformer blocks as well. Finally, experiments on VisualGenome and MSCOCO datasets have shown good performance in improving the general image caption generation quality, demonstrating the effectiveness of the Caption TLSTMs model.</description><subject>Artificial intelligence</subject><subject>Computer Science</subject><subject>Data Mining and Knowledge Discovery</subject><subject>Database Management</subject><subject>Deep learning</subject><subject>Graphs</subject><subject>Image Processing and Computer Vision</subject><subject>Image quality</subject><subject>Information Storage and Retrieval</subject><subject>Information Systems Applications (incl.Internet)</subject><subject>Language</subject><subject>Machine translation</subject><subject>Multimedia Information Systems</subject><subject>Natural language</subject><subject>Neural networks</subject><subject>Recurrent neural networks</subject><subject>Regular Paper</subject><subject>Representations</subject><subject>Semantics</subject><subject>Transformers</subject><subject>Visual tasks</subject><issn>2192-6611</issn><issn>2192-662X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNp9kMFKAzEQhoMoWGpfwFPAc3QySTYbb1KqFSoeXMFbSLNp3WJ3a7JFfHtjV_TmHGaG4ftnhp-Qcw6XHEBfJS60UAwQGeRUMn1ERsgNsqLAl-PfnvNTMklpAzlKLDjoEZlN3a5vupZWi6fqIV1T322XTdu0a9pH16ZVF7ch0o-mf6UHguYJbbZuHagfpJk9Iycr95bC5KeOyfPtrJrO2eLx7n56s2AepemZ56p2xhlTay-8U0aAEbpUy9K7AiXKwGspQNYKVKGE9Oh96bUDZ7CQdRBjcjHs3cXufR9SbzfdPrb5pEXDjTQatMwUDpSPXUoxrOwu5o_jp-Vgvx2zg2M2m2UPjlmdRWIQpQy36xD_Vv-j-gJ7E2ya</recordid><startdate>20220601</startdate><enddate>20220601</enddate><creator>Yan, Jie</creator><creator>Xie, Yuxiang</creator><creator>Luan, Xidao</creator><creator>Guo, Yanming</creator><creator>Gong, Quanzhi</creator><creator>Feng, Suru</creator><general>Springer London</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>L6V</scope><scope>M7S</scope><scope>P5Z</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PTHSS</scope><orcidid>https://orcid.org/0000-0002-6079-6907</orcidid></search><sort><creationdate>20220601</creationdate><title>Caption TLSTMs: combining transformer with LSTMs for image captioning</title><author>Yan, Jie ; Xie, Yuxiang ; Luan, Xidao ; Guo, Yanming ; Gong, Quanzhi ; Feng, Suru</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c249t-c15da9a99d7c3ca593093785b8ca62424e1d4304d5056534c2cc8c7a0a9264de3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Artificial intelligence</topic><topic>Computer Science</topic><topic>Data Mining and Knowledge Discovery</topic><topic>Database Management</topic><topic>Deep learning</topic><topic>Graphs</topic><topic>Image Processing and Computer Vision</topic><topic>Image quality</topic><topic>Information Storage and Retrieval</topic><topic>Information Systems Applications (incl.Internet)</topic><topic>Language</topic><topic>Machine translation</topic><topic>Multimedia Information Systems</topic><topic>Natural language</topic><topic>Neural networks</topic><topic>Recurrent neural networks</topic><topic>Regular Paper</topic><topic>Representations</topic><topic>Semantics</topic><topic>Transformers</topic><topic>Visual tasks</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Yan, Jie</creatorcontrib><creatorcontrib>Xie, Yuxiang</creatorcontrib><creatorcontrib>Luan, Xidao</creatorcontrib><creatorcontrib>Guo, Yanming</creatorcontrib><creatorcontrib>Gong, Quanzhi</creatorcontrib><creatorcontrib>Feng, Suru</creatorcontrib><collection>CrossRef</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>Proquest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>Engineering Collection</collection><jtitle>International journal of multimedia information retrieval</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yan, Jie</au><au>Xie, Yuxiang</au><au>Luan, Xidao</au><au>Guo, Yanming</au><au>Gong, Quanzhi</au><au>Feng, Suru</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Caption TLSTMs: combining transformer with LSTMs for image captioning</atitle><jtitle>International journal of multimedia information retrieval</jtitle><stitle>Int J Multimed Info Retr</stitle><date>2022-06-01</date><risdate>2022</risdate><volume>11</volume><issue>2</issue><spage>111</spage><epage>121</epage><pages>111-121</pages><issn>2192-6611</issn><eissn>2192-662X</eissn><abstract>Image to captions has attracted widespread attention over the years. Recurrent neural networks (RNN) and their corresponding variants have been the mainstream when it comes to dealing with image captioning task for a long time. However, transformer-based models have shown powerful and promising performance on visual tasks contrary to classic neural networks. In order to extract richer and more robust multimodal intersection feature representation, we improve the original abstract scene graph to caption model and propose the Caption TLSTMs which is made up of two LSTMs with T ransformer blocks in the middle of them in this paper. Compared with the model before improvement, the architecture of our Caption TLSTMs enables the entire network to make the most of the long-term dependencies and feature representation ability of the LSTM, while encoding the multimodal textual, visual and graphic information with the transformer blocks as well. Finally, experiments on VisualGenome and MSCOCO datasets have shown good performance in improving the general image caption generation quality, demonstrating the effectiveness of the Caption TLSTMs model.</abstract><cop>London</cop><pub>Springer London</pub><doi>10.1007/s13735-022-00228-7</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0002-6079-6907</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 2192-6611
ispartof International journal of multimedia information retrieval, 2022-06, Vol.11 (2), p.111-121
issn 2192-6611
2192-662X
language eng
recordid cdi_proquest_journals_2919497074
source Springer Link
subjects Artificial intelligence
Computer Science
Data Mining and Knowledge Discovery
Database Management
Deep learning
Graphs
Image Processing and Computer Vision
Image quality
Information Storage and Retrieval
Information Systems Applications (incl.Internet)
Language
Machine translation
Multimedia Information Systems
Natural language
Neural networks
Recurrent neural networks
Regular Paper
Representations
Semantics
Transformers
Visual tasks
title Caption TLSTMs: combining transformer with LSTMs for image captioning
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T11%3A11%3A20IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Caption%20TLSTMs:%20combining%20transformer%20with%20LSTMs%20for%20image%20captioning&rft.jtitle=International%20journal%20of%20multimedia%20information%20retrieval&rft.au=Yan,%20Jie&rft.date=2022-06-01&rft.volume=11&rft.issue=2&rft.spage=111&rft.epage=121&rft.pages=111-121&rft.issn=2192-6611&rft.eissn=2192-662X&rft_id=info:doi/10.1007/s13735-022-00228-7&rft_dat=%3Cproquest_cross%3E2919497074%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c249t-c15da9a99d7c3ca593093785b8ca62424e1d4304d5056534c2cc8c7a0a9264de3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2919497074&rft_id=info:pmid/&rfr_iscdi=true