Loading…
Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism
Research on uni-modal sentiment analysis has achieved great success, but emotions in real life are mostly multi-modal; there are not only texts but also images, audio, video, and other forms. The various modes play a role in mutual promotion. If the connection between various modalities can be mined...
Saved in:
Published in: | Electronics (Basel) 2024-06, Vol.13 (11), p.2069 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | cdi_FETCH-LOGICAL-c311t-4a7fac16f705f12a8d637ebc1901161d84159f40828291e053dcea81b0ac2cb3 |
container_end_page | |
container_issue | 11 |
container_start_page | 2069 |
container_title | Electronics (Basel) |
container_volume | 13 |
creator | Li, Hongchan Lu, Yantong Zhu, Haodong |
description | Research on uni-modal sentiment analysis has achieved great success, but emotions in real life are mostly multi-modal; there are not only texts but also images, audio, video, and other forms. The various modes play a role in mutual promotion. If the connection between various modalities can be mined, the accuracy of sentiment analysis will be further improved. To this end, this paper introduces a cross-attention-based multi-modal fusion model for images and text, namely, MCAM. First, we use the ALBert pre-training model to extract text features for text; then, we use BiLSTM to extract text context features; then, we use DenseNet121 to extract image features for images; and then, we use CBAM to extract specific areas related to emotion in images. Finally, we utilize multi-modal cross-attention to fuse the extracted features from the text and image, and we classify the output to determine the emotional polarity. In the experimental comparative analysis of MVSA and TumEmo public datasets, the model in this article is better than the baseline model, with accuracy and F1 scores reaching 86.5% and 75.3% and 85.5% and 76.7%, respectively. In addition, we also conducted ablation experiments, which confirmed that sentiment analysis with multi-modal fusion is better than single-modal sentiment analysis. |
doi_str_mv | 10.3390/electronics13112069 |
format | article |
fullrecord | <record><control><sourceid>gale_proqu</sourceid><recordid>TN_cdi_proquest_journals_3067424248</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A797898520</galeid><sourcerecordid>A797898520</sourcerecordid><originalsourceid>FETCH-LOGICAL-c311t-4a7fac16f705f12a8d637ebc1901161d84159f40828291e053dcea81b0ac2cb3</originalsourceid><addsrcrecordid>eNptUEtLAzEQDqJg0f4CLwHPW_PYR3Jci9VCiwd7X9PspKbsZmuSBfvvTamIB-eDeX4zzAxCd5TMOJfkATrQ0Q_O6kA5pYyU8gJNGKlkJplkl3_8azQNYU-SSMoFJxP0vh67aLP10KoOv4GLtk8K1051x2ADflQBWjw4vOzVDrByLd7AV8SLMdiU_S3P_RBCVsd4GpHiNegP5Wzob9GVUV2A6Y-9QZvF02b-kq1en5fzepXptHPMclUZpWlpKlIYypRoS17BVlNJKC1pK3JaSJMTwQSTFEjBWw1K0C1Rmuktv0H357EHP3yOEGKzH0afrggNJ2WVswSRWLMza6c6aKwzQ_RKJ7TQWz04MDbl60pWQoqCkdTAzw36dJ8H0xy87ZU_NpQ0p-83_3yffwNEj3o5</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3067424248</pqid></control><display><type>article</type><title>Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism</title><source>Publicly Available Content (ProQuest)</source><creator>Li, Hongchan ; Lu, Yantong ; Zhu, Haodong</creator><creatorcontrib>Li, Hongchan ; Lu, Yantong ; Zhu, Haodong</creatorcontrib><description>Research on uni-modal sentiment analysis has achieved great success, but emotions in real life are mostly multi-modal; there are not only texts but also images, audio, video, and other forms. The various modes play a role in mutual promotion. If the connection between various modalities can be mined, the accuracy of sentiment analysis will be further improved. To this end, this paper introduces a cross-attention-based multi-modal fusion model for images and text, namely, MCAM. First, we use the ALBert pre-training model to extract text features for text; then, we use BiLSTM to extract text context features; then, we use DenseNet121 to extract image features for images; and then, we use CBAM to extract specific areas related to emotion in images. Finally, we utilize multi-modal cross-attention to fuse the extracted features from the text and image, and we classify the output to determine the emotional polarity. In the experimental comparative analysis of MVSA and TumEmo public datasets, the model in this article is better than the baseline model, with accuracy and F1 scores reaching 86.5% and 75.3% and 85.5% and 76.7%, respectively. In addition, we also conducted ablation experiments, which confirmed that sentiment analysis with multi-modal fusion is better than single-modal sentiment analysis.</description><identifier>ISSN: 2079-9292</identifier><identifier>EISSN: 2079-9292</identifier><identifier>DOI: 10.3390/electronics13112069</identifier><language>eng</language><publisher>Basel: MDPI AG</publisher><subject>Ablation ; Classification ; Computational linguistics ; Data mining ; Deep learning ; Dictionaries ; Emotions ; Esports ; Image classification ; Language processing ; Machine learning ; Natural language interfaces ; Neural networks ; Product reviews ; Research methodology ; Semantics ; Sentiment analysis ; Social networks</subject><ispartof>Electronics (Basel), 2024-06, Vol.13 (11), p.2069</ispartof><rights>COPYRIGHT 2024 MDPI AG</rights><rights>2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c311t-4a7fac16f705f12a8d637ebc1901161d84159f40828291e053dcea81b0ac2cb3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/3067424248/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/3067424248?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,25731,27901,27902,36989,44566,74869</link.rule.ids></links><search><creatorcontrib>Li, Hongchan</creatorcontrib><creatorcontrib>Lu, Yantong</creatorcontrib><creatorcontrib>Zhu, Haodong</creatorcontrib><title>Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism</title><title>Electronics (Basel)</title><description>Research on uni-modal sentiment analysis has achieved great success, but emotions in real life are mostly multi-modal; there are not only texts but also images, audio, video, and other forms. The various modes play a role in mutual promotion. If the connection between various modalities can be mined, the accuracy of sentiment analysis will be further improved. To this end, this paper introduces a cross-attention-based multi-modal fusion model for images and text, namely, MCAM. First, we use the ALBert pre-training model to extract text features for text; then, we use BiLSTM to extract text context features; then, we use DenseNet121 to extract image features for images; and then, we use CBAM to extract specific areas related to emotion in images. Finally, we utilize multi-modal cross-attention to fuse the extracted features from the text and image, and we classify the output to determine the emotional polarity. In the experimental comparative analysis of MVSA and TumEmo public datasets, the model in this article is better than the baseline model, with accuracy and F1 scores reaching 86.5% and 75.3% and 85.5% and 76.7%, respectively. In addition, we also conducted ablation experiments, which confirmed that sentiment analysis with multi-modal fusion is better than single-modal sentiment analysis.</description><subject>Ablation</subject><subject>Classification</subject><subject>Computational linguistics</subject><subject>Data mining</subject><subject>Deep learning</subject><subject>Dictionaries</subject><subject>Emotions</subject><subject>Esports</subject><subject>Image classification</subject><subject>Language processing</subject><subject>Machine learning</subject><subject>Natural language interfaces</subject><subject>Neural networks</subject><subject>Product reviews</subject><subject>Research methodology</subject><subject>Semantics</subject><subject>Sentiment analysis</subject><subject>Social networks</subject><issn>2079-9292</issn><issn>2079-9292</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNptUEtLAzEQDqJg0f4CLwHPW_PYR3Jci9VCiwd7X9PspKbsZmuSBfvvTamIB-eDeX4zzAxCd5TMOJfkATrQ0Q_O6kA5pYyU8gJNGKlkJplkl3_8azQNYU-SSMoFJxP0vh67aLP10KoOv4GLtk8K1051x2ADflQBWjw4vOzVDrByLd7AV8SLMdiU_S3P_RBCVsd4GpHiNegP5Wzob9GVUV2A6Y-9QZvF02b-kq1en5fzepXptHPMclUZpWlpKlIYypRoS17BVlNJKC1pK3JaSJMTwQSTFEjBWw1K0C1Rmuktv0H357EHP3yOEGKzH0afrggNJ2WVswSRWLMza6c6aKwzQ_RKJ7TQWz04MDbl60pWQoqCkdTAzw36dJ8H0xy87ZU_NpQ0p-83_3yffwNEj3o5</recordid><startdate>20240601</startdate><enddate>20240601</enddate><creator>Li, Hongchan</creator><creator>Lu, Yantong</creator><creator>Zhu, Haodong</creator><general>MDPI AG</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SP</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L7M</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope></search><sort><creationdate>20240601</creationdate><title>Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism</title><author>Li, Hongchan ; Lu, Yantong ; Zhu, Haodong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c311t-4a7fac16f705f12a8d637ebc1901161d84159f40828291e053dcea81b0ac2cb3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Ablation</topic><topic>Classification</topic><topic>Computational linguistics</topic><topic>Data mining</topic><topic>Deep learning</topic><topic>Dictionaries</topic><topic>Emotions</topic><topic>Esports</topic><topic>Image classification</topic><topic>Language processing</topic><topic>Machine learning</topic><topic>Natural language interfaces</topic><topic>Neural networks</topic><topic>Product reviews</topic><topic>Research methodology</topic><topic>Semantics</topic><topic>Sentiment analysis</topic><topic>Social networks</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Li, Hongchan</creatorcontrib><creatorcontrib>Lu, Yantong</creatorcontrib><creatorcontrib>Zhu, Haodong</creatorcontrib><collection>CrossRef</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>ProQuest advanced technologies & aerospace journals</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><jtitle>Electronics (Basel)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Hongchan</au><au>Lu, Yantong</au><au>Zhu, Haodong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism</atitle><jtitle>Electronics (Basel)</jtitle><date>2024-06-01</date><risdate>2024</risdate><volume>13</volume><issue>11</issue><spage>2069</spage><pages>2069-</pages><issn>2079-9292</issn><eissn>2079-9292</eissn><abstract>Research on uni-modal sentiment analysis has achieved great success, but emotions in real life are mostly multi-modal; there are not only texts but also images, audio, video, and other forms. The various modes play a role in mutual promotion. If the connection between various modalities can be mined, the accuracy of sentiment analysis will be further improved. To this end, this paper introduces a cross-attention-based multi-modal fusion model for images and text, namely, MCAM. First, we use the ALBert pre-training model to extract text features for text; then, we use BiLSTM to extract text context features; then, we use DenseNet121 to extract image features for images; and then, we use CBAM to extract specific areas related to emotion in images. Finally, we utilize multi-modal cross-attention to fuse the extracted features from the text and image, and we classify the output to determine the emotional polarity. In the experimental comparative analysis of MVSA and TumEmo public datasets, the model in this article is better than the baseline model, with accuracy and F1 scores reaching 86.5% and 75.3% and 85.5% and 76.7%, respectively. In addition, we also conducted ablation experiments, which confirmed that sentiment analysis with multi-modal fusion is better than single-modal sentiment analysis.</abstract><cop>Basel</cop><pub>MDPI AG</pub><doi>10.3390/electronics13112069</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2079-9292 |
ispartof | Electronics (Basel), 2024-06, Vol.13 (11), p.2069 |
issn | 2079-9292 2079-9292 |
language | eng |
recordid | cdi_proquest_journals_3067424248 |
source | Publicly Available Content (ProQuest) |
subjects | Ablation Classification Computational linguistics Data mining Deep learning Dictionaries Emotions Esports Image classification Language processing Machine learning Natural language interfaces Neural networks Product reviews Research methodology Semantics Sentiment analysis Social networks |
title | Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-04T14%3A21%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_proqu&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multi-Modal%20Sentiment%20Analysis%20Based%20on%20Image%20and%20Text%20Fusion%20Based%20on%20Cross-Attention%20Mechanism&rft.jtitle=Electronics%20(Basel)&rft.au=Li,%20Hongchan&rft.date=2024-06-01&rft.volume=13&rft.issue=11&rft.spage=2069&rft.pages=2069-&rft.issn=2079-9292&rft.eissn=2079-9292&rft_id=info:doi/10.3390/electronics13112069&rft_dat=%3Cgale_proqu%3EA797898520%3C/gale_proqu%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c311t-4a7fac16f705f12a8d637ebc1901161d84159f40828291e053dcea81b0ac2cb3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3067424248&rft_id=info:pmid/&rft_galeid=A797898520&rfr_iscdi=true |