Loading…

Improving multimodal fusion with Main Modal Transformer for emotion recognition in conversation

Emotion recognition in conversation (ERC) is essential for developing empathic conversation systems. In conversation, emotions can exist in multiple modalities, i.e., audio, text, and visual. Due to the inherent characteristics of each modality, it is not easy for the model to use all modalities eff...

Full description

Saved in:

Bibliographic Details
Published in:	Knowledge-based systems 2022-12, Vol.258, p.109978, Article 109978
Main Authors:	Zou, ShiHao, Huang, Xianying, Shen, XuDong, Liu, Hankai
Format:	Article
Language:	English
Subjects:	Emotion recognition in conversation Emotional cues Main modal Multihead attention Transformer
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Emotion recognition in conversation (ERC) is essential for developing empathic conversation systems. In conversation, emotions can exist in multiple modalities, i.e., audio, text, and visual. Due to the inherent characteristics of each modality, it is not easy for the model to use all modalities effectively when fusing modal information. However, existing approaches consider the same representation ability of each modality, resulting in unsatisfactory fusion across modalities. Therefore, we consider different modalities with different representation abilities, propose the concept of the main modal, i.e., the modal with stronger representation ability after feature extraction, and then propose the method of Main Modal Transformer (MMTr) to improve the effect of multimodal fusion. The method preserves the integrity of the main modal features and enhances the representation of weak modalities by using multihead attention to learn the information interactions between modalities. In addition, we designed a new emotional cue extractor that extracts emotional cues from two levels (the speaker’s self-context and the contextual context in conversation) to enrich the conversation information obtained by each modal. Extensive experiments on two benchmark datasets validate the effectiveness and superiority of our model. •Modal with different representational abilities should be learned differently.•Modal with stronger representation ability after feature extraction as the main modal.•Preserve the integrity of the main modal features, enhancing the weak modal feature.•Design an emotional cue extractor to enrich the conversation information.
ISSN:	0950-7051 1872-7409
DOI:	10.1016/j.knosys.2022.109978