Loading…
ATDA: Attentional temporal dynamic activation for speech emotion recognition
Speech emotion recognition (SER) plays a vital role in intelligent human–computer interaction (HCI). The Convolutional Neural Network (CNN) is widely used in SER, effectively capturing the static local features but ignoring the temporal dynamic features necessary for SER. To solve this problem, we p...
Saved in:
Published in: | Knowledge-based systems 2022-05, Vol.243, p.108472, Article 108472 |
---|---|
Main Authors: | , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Speech emotion recognition (SER) plays a vital role in intelligent human–computer interaction (HCI). The Convolutional Neural Network (CNN) is widely used in SER, effectively capturing the static local features but ignoring the temporal dynamic features necessary for SER. To solve this problem, we place an Attentional Temporal Dynamic Activation (ATDA) module into the CNN-based model to empower it to learn the static and dynamic features simultaneously. In particular, the ATDA module comprises a Temporal Dynamic Activation (TDA) block followed by a Multi-view and Multi-granularity Attention (MMA) block. The TDA block calculates the temporal difference at the feature level to activate the dynamic information and generate the fundamental dynamic feature. The MMA block further detects and amplifies the emotion-related dynamic features based on multiple attention views and granularities. These two blocks within the ATDA module cooperate to activate and extract the dynamic emotional features. Meanwhile, the static features are obtained by a convolutional layer, which are then combined with the dynamic features to generate the final emotional representations. Finally, experiments on the IEMOCAP, MSP-IMPROV, and MELD datasets reveal that the proposed ATDA-CNN model achieves competitive results and enhances SER accuracy by learning meaningful emotional representations. |
---|---|
ISSN: | 0950-7051 1872-7409 |
DOI: | 10.1016/j.knosys.2022.108472 |