Loading…

Speech emotion recognition based on bi-directional acoustic–articulatory conversion

Acoustic and articulatory signals are naturally coupled and complementary. The challenge of acquiring articulatory data and the nonlinear ill-posedness of acoustic–articulatory conversions have resulted in previous studies on speech emotion recognition (SER) primarily relying on unidirectional acous...

Full description

Saved in:

Bibliographic Details
Published in:	Knowledge-based systems 2024-09, Vol.299, p.112123, Article 112123
Main Authors:	Li, Haifeng, Zhang, Xueying, Duan, Shufei, Liang, Huizhi
Format:	Article
Language:	English
Subjects:	Acoustic and articulatory conversions Contrastive learning Cycle consistent generative adversarial networks Speech emotion recognition Temporal convolutional network
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Acoustic and articulatory signals are naturally coupled and complementary. The challenge of acquiring articulatory data and the nonlinear ill-posedness of acoustic–articulatory conversions have resulted in previous studies on speech emotion recognition (SER) primarily relying on unidirectional acoustic–articulatory conversions. However, these studies have ignored the potential benefits of bi-directional acoustic–articulatory conversion. Addressing the problem of nonlinear ill-posedness and effectively extracting and utilizing these two modal features in SER remain open research questions. To bridge this gap, this study proposes a Bi-A2CEmo framework that simultaneously addresses the bi-directional acoustic–articulatory conversion for SER. This framework comprises three components: a Bi-MGAN that addresses the nonlinear ill-posedness problem, KCLNet that enhances the emotional attributes of the mapped features, and ResTCN-FDA that fully exploits the emotional attributes of the features. Another challenge is the absence of a parallel acoustic–articulatory emotion database. To overcome this issue, this study utilizes electromagnetic articulography (EMA) to create a multi-modal acoustic–articulatory emotion database for Mandarin Chinese called STEM-E2VA. A comparative analysis is then conducted between the proposed method and state-of-the-art models to evaluate the effectiveness of the framework. Bi-A2CEmo achieves an accuracy of 89.04% in SER, which is an improvement of 5.27% compared with the actual acoustic and articulatory features recorded by the EMA. The results for the STEM-E2VA dataset show that Bi-MGAN achieves a higher accuracy in mapping and inversion than conventional conversion networks. Visualization of the mapped features before and after enhancement reveals that KCLNet reduces the intra-class spacing while increasing the inter-class spacing of the features. ResTCN-FDA demonstrates high recognition accuracy on three publicly available datasets. The experimental results show that the proposed bi-directional acoustic–articulatory conversion framework can significantly improve the SER performance. [Display omitted]
ISSN:	0950-7051 1872-7409
DOI:	10.1016/j.knosys.2024.112123