Loading…

Discriminative feature learning based on multi-view attention network with diffusion joint loss for speech emotion recognition

In speech emotion recognition, existing models often struggle to accurately classify emotions with high similarity. In this paper, we propose a novel architecture that integrates a multi-view attention network (MVAN) and diffusion joint loss to alleviate confusion by placing a stronger focus on emot...

Full description

Saved in:
Bibliographic Details
Published in:Engineering applications of artificial intelligence 2024-11, Vol.137, p.109219, Article 109219
Main Authors: Liu, Yang, Chen, Xin, Song, Yuan, Li, Yarong, Wang, Shengbei, Yuan, Weitao, Li, Yongwei, Zhao, Zhen
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In speech emotion recognition, existing models often struggle to accurately classify emotions with high similarity. In this paper, we propose a novel architecture that integrates a multi-view attention network (MVAN) and diffusion joint loss to alleviate confusion by placing a stronger focus on emotions that are challenging to classify accurately. First, we use logarithmic Mel-spectrograms (log-Mels), deltas, and delta-deltas of log-Mels as three-dimensional features to minimize external interference. Then, we design the MVAN to extract effective multi-time scale emotion features, where the channel and spatial attention are used to selectively localize the regions in the input features related to the target emotion. A Multi-time view bidirectional long and short-term memory network is used to extract the shallow edge features and deep semantic features, and multi-scale self-attention fuses these features through cross-scale attention fusion to obtain multi-time scale emotion features. Finally, a diffusion joint loss strategy is introduced to distinguish the emotional embeddings with high similarity by the generated complex emotion triplets in a diffusing fashion. We evaluated our proposed method on the Interactive Emotional Mood Binary Motion Capture (IEMOCAP), Chinese Academy of Sciences Automation Institute of Automation (CASIA), and Berlin German Emotion Speech Bank (EMODB) corpus. The results show significant improvements over existing methods, achieving 86.87% WA, 86.60% UA, and 86.82% WF1 on IEMOCAP; 70.74% WA, 70.74% UA, and 70.25% WF1 on CASIA; and 93.65% WA, 91.13% UA, and 92.26% WF1 on EMODB. These results confirm the superiority of our method. Our code and model are available at https://github.com/Littleznnz/MVAN-DiffSEG.
ISSN:0952-1976
DOI:10.1016/j.engappai.2024.109219