Loading…

Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism

Research on uni-modal sentiment analysis has achieved great success, but emotions in real life are mostly multi-modal; there are not only texts but also images, audio, video, and other forms. The various modes play a role in mutual promotion. If the connection between various modalities can be mined...

Full description

Saved in:

Bibliographic Details
Published in:	Electronics (Basel) 2024-06, Vol.13 (11), p.2069
Main Authors:	Li, Hongchan, Lu, Yantong, Zhu, Haodong
Format:	Article
Language:	English
Subjects:	Ablation Classification Computational linguistics Data mining Deep learning Dictionaries Emotions Esports Image classification Language processing Machine learning Natural language interfaces Neural networks Product reviews Research methodology Semantics Sentiment analysis Social networks
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites	cdi_FETCH-LOGICAL-c311t-4a7fac16f705f12a8d637ebc1901161d84159f40828291e053dcea81b0ac2cb3
container_end_page
container_issue	11
container_start_page	2069
container_title	Electronics (Basel)
container_volume	13
creator	Li, Hongchan Lu, Yantong Zhu, Haodong
description	Research on uni-modal sentiment analysis has achieved great success, but emotions in real life are mostly multi-modal; there are not only texts but also images, audio, video, and other forms. The various modes play a role in mutual promotion. If the connection between various modalities can be mined, the accuracy of sentiment analysis will be further improved. To this end, this paper introduces a cross-attention-based multi-modal fusion model for images and text, namely, MCAM. First, we use the ALBert pre-training model to extract text features for text; then, we use BiLSTM to extract text context features; then, we use DenseNet121 to extract image features for images; and then, we use CBAM to extract specific areas related to emotion in images. Finally, we utilize multi-modal cross-attention to fuse the extracted features from the text and image, and we classify the output to determine the emotional polarity. In the experimental comparative analysis of MVSA and TumEmo public datasets, the model in this article is better than the baseline model, with accuracy and F1 scores reaching 86.5% and 75.3% and 85.5% and 76.7%, respectively. In addition, we also conducted ablation experiments, which confirmed that sentiment analysis with multi-modal fusion is better than single-modal sentiment analysis.
doi_str_mv	10.3390/electronics13112069
format	article
fullrecord	<record><control><sourceid>gale_proqu</sourceid><recordid>TN_cdi_proquest_journals_3067424248</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A797898520</galeid><sourcerecordid>A797898520</sourcerecordid><originalsourceid>FETCH-LOGICAL-c311t-4a7fac16f705f12a8d637ebc1901161d84159f40828291e053dcea81b0ac2cb3</originalsourceid><addsrcrecordid>eNptUEtLAzEQDqJg0f4CLwHPW_PYR3Jci9VCiwd7X9PspKbsZmuSBfvvTamIB-eDeX4zzAxCd5TMOJfkATrQ0Q_O6kA5pYyU8gJNGKlkJplkl3_8azQNYU-SSMoFJxP0vh67aLP10KoOv4GLtk8K1051x2ADflQBWjw4vOzVDrByLd7AV8SLMdiU_S3P_RBCVsd4GpHiNegP5Wzob9GVUV2A6Y-9QZvF02b-kq1en5fzepXptHPMclUZpWlpKlIYypRoS17BVlNJKC1pK3JaSJMTwQSTFEjBWw1K0C1Rmuktv0H357EHP3yOEGKzH0afrggNJ2WVswSRWLMza6c6aKwzQ_RKJ7TQWz04MDbl60pWQoqCkdTAzw36dJ8H0xy87ZU_NpQ0p-83_3yffwNEj3o5</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3067424248</pqid></control><display><type>article</type><title>Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism</title><source>Publicly Available Content (ProQuest)</source><creator>Li, Hongchan ; Lu, Yantong ; Zhu, Haodong</creator><creatorcontrib>Li, Hongchan ; Lu, Yantong ; Zhu, Haodong</creatorcontrib><description>Research on uni-modal sentiment analysis has achieved great success, but emotions in real life are mostly multi-modal; there are not only texts but also images, audio, video, and other forms. The various modes play a role in mutual promotion. If the connection between various modalities can be mined, the accuracy of sentiment analysis will be further improved. To this end, this paper introduces a cross-attention-based multi-modal fusion model for images and text, namely, MCAM. First, we use the ALBert pre-training model to extract text features for text; then, we use BiLSTM to extract text context features; then, we use DenseNet121 to extract image features for images; and then, we use CBAM to extract specific areas related to emotion in images. Finally, we utilize multi-modal cross-attention to fuse the extracted features from the text and image, and we classify the output to determine the emotional polarity. In the experimental comparative analysis of MVSA and TumEmo public datasets, the model in this article is better than the baseline model, with accuracy and F1 scores reaching 86.5% and 75.3% and 85.5% and 76.7%, respectively. In addition, we also conducted ablation experiments, which confirmed that sentiment analysis with multi-modal fusion is better than single-modal sentiment analysis.</description><identifier>ISSN: 2079-9292</identifier><identifier>EISSN: 2079-9292</identifier><identifier>DOI: 10.3390/electronics13112069</identifier><language>eng</language><publisher>Basel: MDPI AG</publisher><subject>Ablation ; Classification ; Computational linguistics ; Data mining ; Deep learning ; Dictionaries ; Emotions ; Esports ; Image classification ; Language processing ; Machine learning ; Natural language interfaces ; Neural networks ; Product reviews ; Research methodology ; Semantics ; Sentiment analysis ; Social networks</subject><ispartof>Electronics (Basel), 2024-06, Vol.13 (11), p.2069</ispartof><rights>COPYRIGHT 2024 MDPI AG</rights><rights>2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c311t-4a7fac16f705f12a8d637ebc1901161d84159f40828291e053dcea81b0ac2cb3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/3067424248/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/3067424248?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,25731,27901,27902,36989,44566,74869</link.rule.ids></links><search><creatorcontrib>Li, Hongchan</creatorcontrib><creatorcontrib>Lu, Yantong</creatorcontrib><creatorcontrib>Zhu, Haodong</creatorcontrib><title>Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism</title><title>Electronics (Basel)</title><description>Research on uni-modal sentiment analysis has achieved great success, but emotions in real life are mostly multi-modal; there are not only texts but also images, audio, video, and other forms. The various modes play a role in mutual promotion. If the connection between various modalities can be mined, the accuracy of sentiment analysis will be further improved. To this end, this paper introduces a cross-attention-based multi-modal fusion model for images and text, namely, MCAM. First, we use the ALBert pre-training model to extract text features for text; then, we use BiLSTM to extract text context features; then, we use DenseNet121 to extract image features for images; and then, we use CBAM to extract specific areas related to emotion in images. Finally, we utilize multi-modal cross-attention to fuse the extracted features from the text and image, and we classify the output to determine the emotional polarity. In the experimental comparative analysis of MVSA and TumEmo public datasets, the model in this article is better than the baseline model, with accuracy and F1 scores reaching 86.5% and 75.3% and 85.5% and 76.7%, respectively. In addition, we also conducted ablation experiments, which confirmed that sentiment analysis with multi-modal fusion is better than single-modal sentiment analysis.</description><subject>Ablation</subject><subject>Classification</subject><subject>Computational linguistics</subject><subject>Data mining</subject><subject>Deep learning</subject><subject>Dictionaries</subject><subject>Emotions</subject><subject>Esports</subject><subject>Image classification</subject><subject>Language processing</subject><subject>Machine learning</subject><subject>Natural language interfaces</subject><subject>Neural networks</subject><subject>Product reviews</subject><subject>Research methodology</subject><subject>Semantics</subject><subject>Sentiment analysis</subject><subject>Social networks</subject><issn>2079-9292</issn><issn>2079-9292</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNptUEtLAzEQDqJg0f4CLwHPW_PYR3Jci9VCiwd7X9PspKbsZmuSBfvvTamIB-eDeX4zzAxCd5TMOJfkATrQ0Q_O6kA5pYyU8gJNGKlkJplkl3_8azQNYU-SSMoFJxP0vh67aLP10KoOv4GLtk8K1051x2ADflQBWjw4vOzVDrByLd7AV8SLMdiU_S3P_RBCVsd4GpHiNegP5Wzob9GVUV2A6Y-9QZvF02b-kq1en5fzepXptHPMclUZpWlpKlIYypRoS17BVlNJKC1pK3JaSJMTwQSTFEjBWw1K0C1Rmuktv0H357EHP3yOEGKzH0afrggNJ2WVswSRWLMza6c6aKwzQ_RKJ7TQWz04MDbl60pWQoqCkdTAzw36dJ8H0xy87ZU_NpQ0p-83_3yffwNEj3o5</recordid><startdate>20240601</startdate><enddate>20240601</enddate><creator>Li, Hongchan</creator><creator>Lu, Yantong</creator><creator>Zhu, Haodong</creator><general>MDPI AG</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SP</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L7M</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope></search><sort><creationdate>20240601</creationdate><title>Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism</title><author>Li, Hongchan ; Lu, Yantong ; Zhu, Haodong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c311t-4a7fac16f705f12a8d637ebc1901161d84159f40828291e053dcea81b0ac2cb3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Ablation</topic><topic>Classification</topic><topic>Computational linguistics</topic><topic>Data mining</topic><topic>Deep learning</topic><topic>Dictionaries</topic><topic>Emotions</topic><topic>Esports</topic><topic>Image classification</topic><topic>Language processing</topic><topic>Machine learning</topic><topic>Natural language interfaces</topic><topic>Neural networks</topic><topic>Product reviews</topic><topic>Research methodology</topic><topic>Semantics</topic><topic>Sentiment analysis</topic><topic>Social networks</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Li, Hongchan</creatorcontrib><creatorcontrib>Lu, Yantong</creatorcontrib><creatorcontrib>Zhu, Haodong</creatorcontrib><collection>CrossRef</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>ProQuest advanced technologies & aerospace journals</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><jtitle>Electronics (Basel)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Hongchan</au><au>Lu, Yantong</au><au>Zhu, Haodong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism</atitle><jtitle>Electronics (Basel)</jtitle><date>2024-06-01</date><risdate>2024</risdate><volume>13</volume><issue>11</issue><spage>2069</spage><pages>2069-</pages><issn>2079-9292</issn><eissn>2079-9292</eissn><abstract>Research on uni-modal sentiment analysis has achieved great success, but emotions in real life are mostly multi-modal; there are not only texts but also images, audio, video, and other forms. The various modes play a role in mutual promotion. If the connection between various modalities can be mined, the accuracy of sentiment analysis will be further improved. To this end, this paper introduces a cross-attention-based multi-modal fusion model for images and text, namely, MCAM. First, we use the ALBert pre-training model to extract text features for text; then, we use BiLSTM to extract text context features; then, we use DenseNet121 to extract image features for images; and then, we use CBAM to extract specific areas related to emotion in images. Finally, we utilize multi-modal cross-attention to fuse the extracted features from the text and image, and we classify the output to determine the emotional polarity. In the experimental comparative analysis of MVSA and TumEmo public datasets, the model in this article is better than the baseline model, with accuracy and F1 scores reaching 86.5% and 75.3% and 85.5% and 76.7%, respectively. In addition, we also conducted ablation experiments, which confirmed that sentiment analysis with multi-modal fusion is better than single-modal sentiment analysis.</abstract><cop>Basel</cop><pub>MDPI AG</pub><doi>10.3390/electronics13112069</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2079-9292
ispartof	Electronics (Basel), 2024-06, Vol.13 (11), p.2069
issn	2079-9292 2079-9292
language	eng
recordid	cdi_proquest_journals_3067424248
source	Publicly Available Content (ProQuest)
subjects	Ablation Classification Computational linguistics Data mining Deep learning Dictionaries Emotions Esports Image classification Language processing Machine learning Natural language interfaces Neural networks Product reviews Research methodology Semantics Sentiment analysis Social networks
title	Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-04T14%3A21%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_proqu&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multi-Modal%20Sentiment%20Analysis%20Based%20on%20Image%20and%20Text%20Fusion%20Based%20on%20Cross-Attention%20Mechanism&rft.jtitle=Electronics%20(Basel)&rft.au=Li,%20Hongchan&rft.date=2024-06-01&rft.volume=13&rft.issue=11&rft.spage=2069&rft.pages=2069-&rft.issn=2079-9292&rft.eissn=2079-9292&rft_id=info:doi/10.3390/electronics13112069&rft_dat=%3Cgale_proqu%3EA797898520%3C/gale_proqu%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c311t-4a7fac16f705f12a8d637ebc1901161d84159f40828291e053dcea81b0ac2cb3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3067424248&rft_id=info:pmid/&rft_galeid=A797898520&rfr_iscdi=true