Loading…

Deep Convolutional Neural Network and Gray Wolf Optimization Algorithm for Speech Emotion Recognition

Speech emotion recognition (SER), an important method of emotional human–machine interaction, has been the focus of much research in recent years. Motivated by powerful Deep Convolutional Neural Networks (DCNNs) to learn features and the landmark success of these networks in the field of image class...

Full description

Saved in:

Bibliographic Details
Published in:	Circuits, systems, and signal processing systems, and signal processing, 2023, Vol.42 (1), p.449-492
Main Authors:	Falahzadeh, Mohammad Reza, Farokhi, Fardad, Harimi, Ali, Sabbaghi-Nadooshan, Reza
Format:	Article
Language:	English
Subjects:	Algorithms Artificial neural networks Circuits and Systems Color imagery Datasets Electrical Engineering Electronics and Microelectronics Emotion recognition Emotions Engineering Image classification Image enhancement Instrumentation Neural networks Optimization Signal,Image and Speech Processing Speech recognition Tensors
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Speech emotion recognition (SER), an important method of emotional human–machine interaction, has been the focus of much research in recent years. Motivated by powerful Deep Convolutional Neural Networks (DCNNs) to learn features and the landmark success of these networks in the field of image classification, the present study aimed to prepare a pre-trained DCNN model for SER and provide compatible input to these networks by converting a speech signal into a 3D tensor. First, using a reconstructed phase space, speech samples are reconstructed in a 3D phase space. Studies have shown that the patterns formed in this space contain meaningful emotional features of the speaker. To provide an input that is compatible with DCNN, a new speech signal representation called Chaogram was introduced as the projection of these patterns, and three channels similar to RGB images were obtained. In the next step, image enhancement techniques were used to highlight the details of Chaogram images. Then, the Visual Geometry Group (VGG) DCNN pre-trained on the large ImageNet dataset is utilized to learn Chaogram high-level features and corresponding emotion classes. Finally, transfer learning is performed on the proposed model, and the presented model is fine-tuned on our datasets. To optimize the hyper-parameter arrangement of architecture-determined CNNs, an innovative DCNN-GWO (gray wolf optimization) is also presented. The results of this study on two public datasets of emotions, i.e., EMO-DB and eNTERFACE05, show the promising performance of the proposed model, which can greatly improve SER applications.
ISSN:	0278-081X 1531-5878
DOI:	10.1007/s00034-022-02130-3