Loading…

ATDA: Attentional temporal dynamic activation for speech emotion recognition

Speech emotion recognition (SER) plays a vital role in intelligent human–computer interaction (HCI). The Convolutional Neural Network (CNN) is widely used in SER, effectively capturing the static local features but ignoring the temporal dynamic features necessary for SER. To solve this problem, we p...

Full description

Saved in:

Bibliographic Details
Published in:	Knowledge-based systems 2022-05, Vol.243, p.108472, Article 108472
Main Authors:	Liu, Lu-Yao, Liu, Wen-Zhe, Zhou, Jian, Deng, Hui-Yuan, Feng, Lin
Format:	Article
Language:	English
Subjects:	Artificial neural networks Emotion recognition Emotions Experimental analysis Feature extraction Human-computer interaction Modules Multi-view and multi-granularity attention Representations Speech emotion recognition Speech recognition Temporal dynamic activation
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Speech emotion recognition (SER) plays a vital role in intelligent human–computer interaction (HCI). The Convolutional Neural Network (CNN) is widely used in SER, effectively capturing the static local features but ignoring the temporal dynamic features necessary for SER. To solve this problem, we place an Attentional Temporal Dynamic Activation (ATDA) module into the CNN-based model to empower it to learn the static and dynamic features simultaneously. In particular, the ATDA module comprises a Temporal Dynamic Activation (TDA) block followed by a Multi-view and Multi-granularity Attention (MMA) block. The TDA block calculates the temporal difference at the feature level to activate the dynamic information and generate the fundamental dynamic feature. The MMA block further detects and amplifies the emotion-related dynamic features based on multiple attention views and granularities. These two blocks within the ATDA module cooperate to activate and extract the dynamic emotional features. Meanwhile, the static features are obtained by a convolutional layer, which are then combined with the dynamic features to generate the final emotional representations. Finally, experiments on the IEMOCAP, MSP-IMPROV, and MELD datasets reveal that the proposed ATDA-CNN model achieves competitive results and enhances SER accuracy by learning meaningful emotional representations.
ISSN:	0950-7051 1872-7409
DOI:	10.1016/j.knosys.2022.108472