Loading…

Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism

Research on uni-modal sentiment analysis has achieved great success, but emotions in real life are mostly multi-modal; there are not only texts but also images, audio, video, and other forms. The various modes play a role in mutual promotion. If the connection between various modalities can be mined...

Full description

Saved in:
Bibliographic Details
Published in:Electronics (Basel) 2024-06, Vol.13 (11), p.2069
Main Authors: Li, Hongchan, Lu, Yantong, Zhu, Haodong
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c311t-4a7fac16f705f12a8d637ebc1901161d84159f40828291e053dcea81b0ac2cb3
container_end_page
container_issue 11
container_start_page 2069
container_title Electronics (Basel)
container_volume 13
creator Li, Hongchan
Lu, Yantong
Zhu, Haodong
description Research on uni-modal sentiment analysis has achieved great success, but emotions in real life are mostly multi-modal; there are not only texts but also images, audio, video, and other forms. The various modes play a role in mutual promotion. If the connection between various modalities can be mined, the accuracy of sentiment analysis will be further improved. To this end, this paper introduces a cross-attention-based multi-modal fusion model for images and text, namely, MCAM. First, we use the ALBert pre-training model to extract text features for text; then, we use BiLSTM to extract text context features; then, we use DenseNet121 to extract image features for images; and then, we use CBAM to extract specific areas related to emotion in images. Finally, we utilize multi-modal cross-attention to fuse the extracted features from the text and image, and we classify the output to determine the emotional polarity. In the experimental comparative analysis of MVSA and TumEmo public datasets, the model in this article is better than the baseline model, with accuracy and F1 scores reaching 86.5% and 75.3% and 85.5% and 76.7%, respectively. In addition, we also conducted ablation experiments, which confirmed that sentiment analysis with multi-modal fusion is better than single-modal sentiment analysis.
doi_str_mv 10.3390/electronics13112069
format article
fullrecord <record><control><sourceid>gale_proqu</sourceid><recordid>TN_cdi_proquest_journals_3067424248</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A797898520</galeid><sourcerecordid>A797898520</sourcerecordid><originalsourceid>FETCH-LOGICAL-c311t-4a7fac16f705f12a8d637ebc1901161d84159f40828291e053dcea81b0ac2cb3</originalsourceid><addsrcrecordid>eNptUEtLAzEQDqJg0f4CLwHPW_PYR3Jci9VCiwd7X9PspKbsZmuSBfvvTamIB-eDeX4zzAxCd5TMOJfkATrQ0Q_O6kA5pYyU8gJNGKlkJplkl3_8azQNYU-SSMoFJxP0vh67aLP10KoOv4GLtk8K1051x2ADflQBWjw4vOzVDrByLd7AV8SLMdiU_S3P_RBCVsd4GpHiNegP5Wzob9GVUV2A6Y-9QZvF02b-kq1en5fzepXptHPMclUZpWlpKlIYypRoS17BVlNJKC1pK3JaSJMTwQSTFEjBWw1K0C1Rmuktv0H357EHP3yOEGKzH0afrggNJ2WVswSRWLMza6c6aKwzQ_RKJ7TQWz04MDbl60pWQoqCkdTAzw36dJ8H0xy87ZU_NpQ0p-83_3yffwNEj3o5</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3067424248</pqid></control><display><type>article</type><title>Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism</title><source>Publicly Available Content (ProQuest)</source><creator>Li, Hongchan ; Lu, Yantong ; Zhu, Haodong</creator><creatorcontrib>Li, Hongchan ; Lu, Yantong ; Zhu, Haodong</creatorcontrib><description>Research on uni-modal sentiment analysis has achieved great success, but emotions in real life are mostly multi-modal; there are not only texts but also images, audio, video, and other forms. The various modes play a role in mutual promotion. If the connection between various modalities can be mined, the accuracy of sentiment analysis will be further improved. To this end, this paper introduces a cross-attention-based multi-modal fusion model for images and text, namely, MCAM. First, we use the ALBert pre-training model to extract text features for text; then, we use BiLSTM to extract text context features; then, we use DenseNet121 to extract image features for images; and then, we use CBAM to extract specific areas related to emotion in images. Finally, we utilize multi-modal cross-attention to fuse the extracted features from the text and image, and we classify the output to determine the emotional polarity. In the experimental comparative analysis of MVSA and TumEmo public datasets, the model in this article is better than the baseline model, with accuracy and F1 scores reaching 86.5% and 75.3% and 85.5% and 76.7%, respectively. In addition, we also conducted ablation experiments, which confirmed that sentiment analysis with multi-modal fusion is better than single-modal sentiment analysis.</description><identifier>ISSN: 2079-9292</identifier><identifier>EISSN: 2079-9292</identifier><identifier>DOI: 10.3390/electronics13112069</identifier><language>eng</language><publisher>Basel: MDPI AG</publisher><subject>Ablation ; Classification ; Computational linguistics ; Data mining ; Deep learning ; Dictionaries ; Emotions ; Esports ; Image classification ; Language processing ; Machine learning ; Natural language interfaces ; Neural networks ; Product reviews ; Research methodology ; Semantics ; Sentiment analysis ; Social networks</subject><ispartof>Electronics (Basel), 2024-06, Vol.13 (11), p.2069</ispartof><rights>COPYRIGHT 2024 MDPI AG</rights><rights>2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c311t-4a7fac16f705f12a8d637ebc1901161d84159f40828291e053dcea81b0ac2cb3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/3067424248/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/3067424248?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,25731,27901,27902,36989,44566,74869</link.rule.ids></links><search><creatorcontrib>Li, Hongchan</creatorcontrib><creatorcontrib>Lu, Yantong</creatorcontrib><creatorcontrib>Zhu, Haodong</creatorcontrib><title>Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism</title><title>Electronics (Basel)</title><description>Research on uni-modal sentiment analysis has achieved great success, but emotions in real life are mostly multi-modal; there are not only texts but also images, audio, video, and other forms. The various modes play a role in mutual promotion. If the connection between various modalities can be mined, the accuracy of sentiment analysis will be further improved. To this end, this paper introduces a cross-attention-based multi-modal fusion model for images and text, namely, MCAM. First, we use the ALBert pre-training model to extract text features for text; then, we use BiLSTM to extract text context features; then, we use DenseNet121 to extract image features for images; and then, we use CBAM to extract specific areas related to emotion in images. Finally, we utilize multi-modal cross-attention to fuse the extracted features from the text and image, and we classify the output to determine the emotional polarity. In the experimental comparative analysis of MVSA and TumEmo public datasets, the model in this article is better than the baseline model, with accuracy and F1 scores reaching 86.5% and 75.3% and 85.5% and 76.7%, respectively. In addition, we also conducted ablation experiments, which confirmed that sentiment analysis with multi-modal fusion is better than single-modal sentiment analysis.</description><subject>Ablation</subject><subject>Classification</subject><subject>Computational linguistics</subject><subject>Data mining</subject><subject>Deep learning</subject><subject>Dictionaries</subject><subject>Emotions</subject><subject>Esports</subject><subject>Image classification</subject><subject>Language processing</subject><subject>Machine learning</subject><subject>Natural language interfaces</subject><subject>Neural networks</subject><subject>Product reviews</subject><subject>Research methodology</subject><subject>Semantics</subject><subject>Sentiment analysis</subject><subject>Social networks</subject><issn>2079-9292</issn><issn>2079-9292</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNptUEtLAzEQDqJg0f4CLwHPW_PYR3Jci9VCiwd7X9PspKbsZmuSBfvvTamIB-eDeX4zzAxCd5TMOJfkATrQ0Q_O6kA5pYyU8gJNGKlkJplkl3_8azQNYU-SSMoFJxP0vh67aLP10KoOv4GLtk8K1051x2ADflQBWjw4vOzVDrByLd7AV8SLMdiU_S3P_RBCVsd4GpHiNegP5Wzob9GVUV2A6Y-9QZvF02b-kq1en5fzepXptHPMclUZpWlpKlIYypRoS17BVlNJKC1pK3JaSJMTwQSTFEjBWw1K0C1Rmuktv0H357EHP3yOEGKzH0afrggNJ2WVswSRWLMza6c6aKwzQ_RKJ7TQWz04MDbl60pWQoqCkdTAzw36dJ8H0xy87ZU_NpQ0p-83_3yffwNEj3o5</recordid><startdate>20240601</startdate><enddate>20240601</enddate><creator>Li, Hongchan</creator><creator>Lu, Yantong</creator><creator>Zhu, Haodong</creator><general>MDPI AG</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SP</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L7M</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope></search><sort><creationdate>20240601</creationdate><title>Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism</title><author>Li, Hongchan ; Lu, Yantong ; Zhu, Haodong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c311t-4a7fac16f705f12a8d637ebc1901161d84159f40828291e053dcea81b0ac2cb3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Ablation</topic><topic>Classification</topic><topic>Computational linguistics</topic><topic>Data mining</topic><topic>Deep learning</topic><topic>Dictionaries</topic><topic>Emotions</topic><topic>Esports</topic><topic>Image classification</topic><topic>Language processing</topic><topic>Machine learning</topic><topic>Natural language interfaces</topic><topic>Neural networks</topic><topic>Product reviews</topic><topic>Research methodology</topic><topic>Semantics</topic><topic>Sentiment analysis</topic><topic>Social networks</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Li, Hongchan</creatorcontrib><creatorcontrib>Lu, Yantong</creatorcontrib><creatorcontrib>Zhu, Haodong</creatorcontrib><collection>CrossRef</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>ProQuest advanced technologies &amp; aerospace journals</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><jtitle>Electronics (Basel)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Hongchan</au><au>Lu, Yantong</au><au>Zhu, Haodong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism</atitle><jtitle>Electronics (Basel)</jtitle><date>2024-06-01</date><risdate>2024</risdate><volume>13</volume><issue>11</issue><spage>2069</spage><pages>2069-</pages><issn>2079-9292</issn><eissn>2079-9292</eissn><abstract>Research on uni-modal sentiment analysis has achieved great success, but emotions in real life are mostly multi-modal; there are not only texts but also images, audio, video, and other forms. The various modes play a role in mutual promotion. If the connection between various modalities can be mined, the accuracy of sentiment analysis will be further improved. To this end, this paper introduces a cross-attention-based multi-modal fusion model for images and text, namely, MCAM. First, we use the ALBert pre-training model to extract text features for text; then, we use BiLSTM to extract text context features; then, we use DenseNet121 to extract image features for images; and then, we use CBAM to extract specific areas related to emotion in images. Finally, we utilize multi-modal cross-attention to fuse the extracted features from the text and image, and we classify the output to determine the emotional polarity. In the experimental comparative analysis of MVSA and TumEmo public datasets, the model in this article is better than the baseline model, with accuracy and F1 scores reaching 86.5% and 75.3% and 85.5% and 76.7%, respectively. In addition, we also conducted ablation experiments, which confirmed that sentiment analysis with multi-modal fusion is better than single-modal sentiment analysis.</abstract><cop>Basel</cop><pub>MDPI AG</pub><doi>10.3390/electronics13112069</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2079-9292
ispartof Electronics (Basel), 2024-06, Vol.13 (11), p.2069
issn 2079-9292
2079-9292
language eng
recordid cdi_proquest_journals_3067424248
source Publicly Available Content (ProQuest)
subjects Ablation
Classification
Computational linguistics
Data mining
Deep learning
Dictionaries
Emotions
Esports
Image classification
Language processing
Machine learning
Natural language interfaces
Neural networks
Product reviews
Research methodology
Semantics
Sentiment analysis
Social networks
title Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-04T14%3A21%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_proqu&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multi-Modal%20Sentiment%20Analysis%20Based%20on%20Image%20and%20Text%20Fusion%20Based%20on%20Cross-Attention%20Mechanism&rft.jtitle=Electronics%20(Basel)&rft.au=Li,%20Hongchan&rft.date=2024-06-01&rft.volume=13&rft.issue=11&rft.spage=2069&rft.pages=2069-&rft.issn=2079-9292&rft.eissn=2079-9292&rft_id=info:doi/10.3390/electronics13112069&rft_dat=%3Cgale_proqu%3EA797898520%3C/gale_proqu%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c311t-4a7fac16f705f12a8d637ebc1901161d84159f40828291e053dcea81b0ac2cb3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3067424248&rft_id=info:pmid/&rft_galeid=A797898520&rfr_iscdi=true