Loading…

SENet-based speech emotion recognition using synthesis-style transfer data augmentation

This paper addresses speech emotion recognition using a channel-attention mechanism with a synthesized data augmentation approach. Convolutional neural network (CNN) produces channel attention map by exploiting the inter-channel relationship of features. The main issue faced in the speech emotion re...

Full description

Saved in:

Bibliographic Details
Published in:	International journal of speech technology 2023-12, Vol.26 (4), p.1017-1030
Main Authors:	Rajan, Rajeev, Hridya Raj, T. V.
Format:	Article
Language:	English
Subjects:	Artificial Intelligence Artificial neural networks Attention Corpus linguistics Data augmentation Emotion recognition Emotions Engineering Mass media Modules Recognition Signal,Image and Speech Processing Social Sciences Speech Speech recognition Speech synthesis Synthesis Text-to-speech
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites	cdi_FETCH-LOGICAL-c1858-a9a695c875e5e3db5ad6105db8c6c5acf3b9c9fe62e03c4fb1f62c0b47f7b8cf3
container_end_page	1030
container_issue	4
container_start_page	1017
container_title	International journal of speech technology
container_volume	26
creator	Rajan, Rajeev Hridya Raj, T. V.
description	This paper addresses speech emotion recognition using a channel-attention mechanism with a synthesized data augmentation approach. Convolutional neural network (CNN) produces channel attention map by exploiting the inter-channel relationship of features. The main issue faced in the speech emotion recognition domain is insufficient data for building an efficient model. The proposed work uses a style transfer scheme to achieve data augmentation by multi-voice synthesis from the text. It consists of text-to-speech (TTS) and style transfer modules. Synthesized speech is generated from the text for a target speaker’s voice by a TTS converter in the front end. Later, the emotion of the synthesized speech is obtained based on the emotional content fed to the style-transfer module. The text-to-speech module is trained using LibriSpeech and NUS-48E corpus. The quality of the synthesized speech samples is also rated using subjective evaluation through mean opinion score (MOS). The speech emotion recognition approach is systematically evaluated using the Berlin EMO-DB corpus. The channel-attention-based Squeeze and Excitation Network (SEnet) shows its promise in the speech emotion recognition experiment.
doi_str_mv	10.1007/s10772-023-10071-8
format	article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2913469088</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2913469088</sourcerecordid><originalsourceid>FETCH-LOGICAL-c1858-a9a695c875e5e3db5ad6105db8c6c5acf3b9c9fe62e03c4fb1f62c0b47f7b8cf3</originalsourceid><addsrcrecordid>eNp9kEtPwzAQhC0EEqXwBzhF4mzwIw_7iKrykCo4AOJoOc46TdU6xesc-u9JKBI3Tju7-mZWGkKuObvljFV3yFlVCcqEpNPOqTohM16MJ8U5Ox21VJyKnJfn5AJxwxjTlRYz8vm2fIFEa4vQZLgHcOsMdn3q-pBFcH0buh89YBfaDA8hrQE7pJgOW8hStAE9xKyxyWZ2aHcQkp0Ml-TM2y3C1e-ck4-H5fviia5eH58X9yvquCoUtdqWunCqKqAA2dSFbUrOiqZWrnSFdV7W2mkPpQAmXe5r7kvhWJ1XvhoZL-fk5pi7j_3XAJjMph9iGF8aobnMS82UGilxpFzsESN4s4_dzsaD4cxMhZljgWYs8GfnZjLJowlHOLQQ_6L_cX0DZu91gA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2913469088</pqid></control><display><type>article</type><title>SENet-based speech emotion recognition using synthesis-style transfer data augmentation</title><source>Springer Nature</source><source>Linguistics and Language Behavior Abstracts (LLBA)</source><creator>Rajan, Rajeev ; Hridya Raj, T. V.</creator><creatorcontrib>Rajan, Rajeev ; Hridya Raj, T. V.</creatorcontrib><description>This paper addresses speech emotion recognition using a channel-attention mechanism with a synthesized data augmentation approach. Convolutional neural network (CNN) produces channel attention map by exploiting the inter-channel relationship of features. The main issue faced in the speech emotion recognition domain is insufficient data for building an efficient model. The proposed work uses a style transfer scheme to achieve data augmentation by multi-voice synthesis from the text. It consists of text-to-speech (TTS) and style transfer modules. Synthesized speech is generated from the text for a target speaker’s voice by a TTS converter in the front end. Later, the emotion of the synthesized speech is obtained based on the emotional content fed to the style-transfer module. The text-to-speech module is trained using LibriSpeech and NUS-48E corpus. The quality of the synthesized speech samples is also rated using subjective evaluation through mean opinion score (MOS). The speech emotion recognition approach is systematically evaluated using the Berlin EMO-DB corpus. The channel-attention-based Squeeze and Excitation Network (SEnet) shows its promise in the speech emotion recognition experiment.</description><identifier>ISSN: 1381-2416</identifier><identifier>EISSN: 1572-8110</identifier><identifier>DOI: 10.1007/s10772-023-10071-8</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Artificial Intelligence ; Artificial neural networks ; Attention ; Corpus linguistics ; Data augmentation ; Emotion recognition ; Emotions ; Engineering ; Mass media ; Modules ; Recognition ; Signal,Image and Speech Processing ; Social Sciences ; Speech ; Speech recognition ; Speech synthesis ; Synthesis ; Text-to-speech</subject><ispartof>International journal of speech technology, 2023-12, Vol.26 (4), p.1017-1030</ispartof><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c1858-a9a695c875e5e3db5ad6105db8c6c5acf3b9c9fe62e03c4fb1f62c0b47f7b8cf3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925,31269</link.rule.ids></links><search><creatorcontrib>Rajan, Rajeev</creatorcontrib><creatorcontrib>Hridya Raj, T. V.</creatorcontrib><title>SENet-based speech emotion recognition using synthesis-style transfer data augmentation</title><title>International journal of speech technology</title><addtitle>Int J Speech Technol</addtitle><description>This paper addresses speech emotion recognition using a channel-attention mechanism with a synthesized data augmentation approach. Convolutional neural network (CNN) produces channel attention map by exploiting the inter-channel relationship of features. The main issue faced in the speech emotion recognition domain is insufficient data for building an efficient model. The proposed work uses a style transfer scheme to achieve data augmentation by multi-voice synthesis from the text. It consists of text-to-speech (TTS) and style transfer modules. Synthesized speech is generated from the text for a target speaker’s voice by a TTS converter in the front end. Later, the emotion of the synthesized speech is obtained based on the emotional content fed to the style-transfer module. The text-to-speech module is trained using LibriSpeech and NUS-48E corpus. The quality of the synthesized speech samples is also rated using subjective evaluation through mean opinion score (MOS). The speech emotion recognition approach is systematically evaluated using the Berlin EMO-DB corpus. The channel-attention-based Squeeze and Excitation Network (SEnet) shows its promise in the speech emotion recognition experiment.</description><subject>Artificial Intelligence</subject><subject>Artificial neural networks</subject><subject>Attention</subject><subject>Corpus linguistics</subject><subject>Data augmentation</subject><subject>Emotion recognition</subject><subject>Emotions</subject><subject>Engineering</subject><subject>Mass media</subject><subject>Modules</subject><subject>Recognition</subject><subject>Signal,Image and Speech Processing</subject><subject>Social Sciences</subject><subject>Speech</subject><subject>Speech recognition</subject><subject>Speech synthesis</subject><subject>Synthesis</subject><subject>Text-to-speech</subject><issn>1381-2416</issn><issn>1572-8110</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>7T9</sourceid><recordid>eNp9kEtPwzAQhC0EEqXwBzhF4mzwIw_7iKrykCo4AOJoOc46TdU6xesc-u9JKBI3Tju7-mZWGkKuObvljFV3yFlVCcqEpNPOqTohM16MJ8U5Ox21VJyKnJfn5AJxwxjTlRYz8vm2fIFEa4vQZLgHcOsMdn3q-pBFcH0buh89YBfaDA8hrQE7pJgOW8hStAE9xKyxyWZ2aHcQkp0Ml-TM2y3C1e-ck4-H5fviia5eH58X9yvquCoUtdqWunCqKqAA2dSFbUrOiqZWrnSFdV7W2mkPpQAmXe5r7kvhWJ1XvhoZL-fk5pi7j_3XAJjMph9iGF8aobnMS82UGilxpFzsESN4s4_dzsaD4cxMhZljgWYs8GfnZjLJowlHOLQQ_6L_cX0DZu91gA</recordid><startdate>20231201</startdate><enddate>20231201</enddate><creator>Rajan, Rajeev</creator><creator>Hridya Raj, T. V.</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7T9</scope></search><sort><creationdate>20231201</creationdate><title>SENet-based speech emotion recognition using synthesis-style transfer data augmentation</title><author>Rajan, Rajeev ; Hridya Raj, T. V.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c1858-a9a695c875e5e3db5ad6105db8c6c5acf3b9c9fe62e03c4fb1f62c0b47f7b8cf3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Artificial Intelligence</topic><topic>Artificial neural networks</topic><topic>Attention</topic><topic>Corpus linguistics</topic><topic>Data augmentation</topic><topic>Emotion recognition</topic><topic>Emotions</topic><topic>Engineering</topic><topic>Mass media</topic><topic>Modules</topic><topic>Recognition</topic><topic>Signal,Image and Speech Processing</topic><topic>Social Sciences</topic><topic>Speech</topic><topic>Speech recognition</topic><topic>Speech synthesis</topic><topic>Synthesis</topic><topic>Text-to-speech</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Rajan, Rajeev</creatorcontrib><creatorcontrib>Hridya Raj, T. V.</creatorcontrib><collection>CrossRef</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><jtitle>International journal of speech technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Rajan, Rajeev</au><au>Hridya Raj, T. V.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SENet-based speech emotion recognition using synthesis-style transfer data augmentation</atitle><jtitle>International journal of speech technology</jtitle><stitle>Int J Speech Technol</stitle><date>2023-12-01</date><risdate>2023</risdate><volume>26</volume><issue>4</issue><spage>1017</spage><epage>1030</epage><pages>1017-1030</pages><issn>1381-2416</issn><eissn>1572-8110</eissn><abstract>This paper addresses speech emotion recognition using a channel-attention mechanism with a synthesized data augmentation approach. Convolutional neural network (CNN) produces channel attention map by exploiting the inter-channel relationship of features. The main issue faced in the speech emotion recognition domain is insufficient data for building an efficient model. The proposed work uses a style transfer scheme to achieve data augmentation by multi-voice synthesis from the text. It consists of text-to-speech (TTS) and style transfer modules. Synthesized speech is generated from the text for a target speaker’s voice by a TTS converter in the front end. Later, the emotion of the synthesized speech is obtained based on the emotional content fed to the style-transfer module. The text-to-speech module is trained using LibriSpeech and NUS-48E corpus. The quality of the synthesized speech samples is also rated using subjective evaluation through mean opinion score (MOS). The speech emotion recognition approach is systematically evaluated using the Berlin EMO-DB corpus. The channel-attention-based Squeeze and Excitation Network (SEnet) shows its promise in the speech emotion recognition experiment.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s10772-023-10071-8</doi><tpages>14</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 1381-2416
ispartof	International journal of speech technology, 2023-12, Vol.26 (4), p.1017-1030
issn	1381-2416 1572-8110
language	eng
recordid	cdi_proquest_journals_2913469088
source	Springer Nature; Linguistics and Language Behavior Abstracts (LLBA)
subjects	Artificial Intelligence Artificial neural networks Attention Corpus linguistics Data augmentation Emotion recognition Emotions Engineering Mass media Modules Recognition Signal,Image and Speech Processing Social Sciences Speech Speech recognition Speech synthesis Synthesis Text-to-speech
title	SENet-based speech emotion recognition using synthesis-style transfer data augmentation
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T15%3A24%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SENet-based%20speech%20emotion%20recognition%20using%20synthesis-style%20transfer%20data%20augmentation&rft.jtitle=International%20journal%20of%20speech%20technology&rft.au=Rajan,%20Rajeev&rft.date=2023-12-01&rft.volume=26&rft.issue=4&rft.spage=1017&rft.epage=1030&rft.pages=1017-1030&rft.issn=1381-2416&rft.eissn=1572-8110&rft_id=info:doi/10.1007/s10772-023-10071-8&rft_dat=%3Cproquest_cross%3E2913469088%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c1858-a9a695c875e5e3db5ad6105db8c6c5acf3b9c9fe62e03c4fb1f62c0b47f7b8cf3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2913469088&rft_id=info:pmid/&rfr_iscdi=true