Loading…

ATDA: Attentional temporal dynamic activation for speech emotion recognition

Speech emotion recognition (SER) plays a vital role in intelligent human–computer interaction (HCI). The Convolutional Neural Network (CNN) is widely used in SER, effectively capturing the static local features but ignoring the temporal dynamic features necessary for SER. To solve this problem, we p...

Full description

Saved in:
Bibliographic Details
Published in:Knowledge-based systems 2022-05, Vol.243, p.108472, Article 108472
Main Authors: Liu, Lu-Yao, Liu, Wen-Zhe, Zhou, Jian, Deng, Hui-Yuan, Feng, Lin
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Speech emotion recognition (SER) plays a vital role in intelligent human–computer interaction (HCI). The Convolutional Neural Network (CNN) is widely used in SER, effectively capturing the static local features but ignoring the temporal dynamic features necessary for SER. To solve this problem, we place an Attentional Temporal Dynamic Activation (ATDA) module into the CNN-based model to empower it to learn the static and dynamic features simultaneously. In particular, the ATDA module comprises a Temporal Dynamic Activation (TDA) block followed by a Multi-view and Multi-granularity Attention (MMA) block. The TDA block calculates the temporal difference at the feature level to activate the dynamic information and generate the fundamental dynamic feature. The MMA block further detects and amplifies the emotion-related dynamic features based on multiple attention views and granularities. These two blocks within the ATDA module cooperate to activate and extract the dynamic emotional features. Meanwhile, the static features are obtained by a convolutional layer, which are then combined with the dynamic features to generate the final emotional representations. Finally, experiments on the IEMOCAP, MSP-IMPROV, and MELD datasets reveal that the proposed ATDA-CNN model achieves competitive results and enhances SER accuracy by learning meaningful emotional representations.
ISSN:0950-7051
1872-7409
DOI:10.1016/j.knosys.2022.108472