Loading…

Stylespeech: Self-Supervised Style Enhancing with VQ-VAE-Based Pre-Training for Expressive Audiobook Speech Synthesis

The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution in the training data. To address these issues, in this paper, we propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive a...

Full description

Saved in:

Bibliographic Details
Main Authors:	Chen, Xueyuan, Wang, Xi, Zhang, Shaofei, He, Lei, Wu, Zhiyong, Wu, Xixin, Meng, Helen
Format:	Conference Proceeding
Language:	English
Subjects:	Acoustics Data mining Data models expressive speech synthesis pre-training self-supervised style enhancing Spectrogram Speech enhancement Speech synthesis Training data VQ-VAE
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	12320
container_issue
container_start_page	12316
container_title
container_volume
creator	Chen, Xueyuan Wang, Xi Zhang, Shaofei He, Lei Wu, Zhiyong Wu, Xixin Meng, Helen
description	The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution in the training data. To address these issues, in this paper, we propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis. Firstly, a text style encoder is pre-trained with a large amount of unlabeled text-only data. Secondly, a spectrogram style extractor based on VQ-VAE is pre-trained in a self-supervised manner, with plenty of audio data that covers complex style variations. Then a novel architecture with two encoder-decoder paths is specially designed to model the pronunciation and high-level style expressiveness respectively, with the guidance of the style extractor. Both objective and subjective evaluations demonstrate that our proposed method can effectively improve the naturalness and expressiveness of the synthesized speech in audiobook synthesis especially for the role and out-of-domain scenarios. 1
doi_str_mv	10.1109/ICASSP48485.2024.10446352
format	conference_proceeding
fullrecord	<record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10446352</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10446352</ieee_id><sourcerecordid>10446352</sourcerecordid><originalsourceid>FETCH-LOGICAL-i722-4004faaeef488507fb2dcf2f867d5ff23302e4818842e955cd000ee4e69e5cb63</originalsourceid><addsrcrecordid>eNo1kMtOAjEYhauJiYi8gYv6AMVeZ1p3SMZLQiKmhLgjZeavU8WZSQsoby-grs7iO_mScxC6ZnTIGDU3T-ORtVOppVZDTrkcMiplJhQ_QQOTGy0UFXIP2SnqcZEbwgx9PUcXKb1TSnUudQ9t7Hq3gtQBlPUttrDyxG46iNuQoMJHiIumdk0Zmjf8FdY1nr-Q-aggd-7QmEYgs-hCc8C-jbj47iKkFLaAR5sqtMu2_cD26Md216xrSCFdojPvVgkGf9lHs_tiNn4kk-eH_aoJCTnnRFIqvXMAXmqtaO6XvCo99zrLK-U9F4JykJppLTkYpcpqPwtAQmZAlctM9NHVrzYAwKKL4dPF3eL_JfED9kFefg</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Stylespeech: Self-Supervised Style Enhancing with VQ-VAE-Based Pre-Training for Expressive Audiobook Speech Synthesis</title><source>IEEE Xplore All Conference Series</source><creator>Chen, Xueyuan ; Wang, Xi ; Zhang, Shaofei ; He, Lei ; Wu, Zhiyong ; Wu, Xixin ; Meng, Helen</creator><creatorcontrib>Chen, Xueyuan ; Wang, Xi ; Zhang, Shaofei ; He, Lei ; Wu, Zhiyong ; Wu, Xixin ; Meng, Helen</creatorcontrib><description>The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution in the training data. To address these issues, in this paper, we propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis. Firstly, a text style encoder is pre-trained with a large amount of unlabeled text-only data. Secondly, a spectrogram style extractor based on VQ-VAE is pre-trained in a self-supervised manner, with plenty of audio data that covers complex style variations. Then a novel architecture with two encoder-decoder paths is specially designed to model the pronunciation and high-level style expressiveness respectively, with the guidance of the style extractor. Both objective and subjective evaluations demonstrate that our proposed method can effectively improve the naturalness and expressiveness of the synthesized speech in audiobook synthesis especially for the role and out-of-domain scenarios. 1</description><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 9798350344851</identifier><identifier>DOI: 10.1109/ICASSP48485.2024.10446352</identifier><language>eng</language><publisher>IEEE</publisher><subject>Acoustics ; Data mining ; Data models ; expressive speech synthesis ; pre-training ; self-supervised style enhancing ; Spectrogram ; Speech enhancement ; Speech synthesis ; Training data ; VQ-VAE</subject><ispartof>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, p.12316-12320</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10446352$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,777,781,786,787,27906,54536,54913</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10446352$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Chen, Xueyuan</creatorcontrib><creatorcontrib>Wang, Xi</creatorcontrib><creatorcontrib>Zhang, Shaofei</creatorcontrib><creatorcontrib>He, Lei</creatorcontrib><creatorcontrib>Wu, Zhiyong</creatorcontrib><creatorcontrib>Wu, Xixin</creatorcontrib><creatorcontrib>Meng, Helen</creatorcontrib><title>Stylespeech: Self-Supervised Style Enhancing with VQ-VAE-Based Pre-Training for Expressive Audiobook Speech Synthesis</title><title>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title><addtitle>ICASSP</addtitle><description>The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution in the training data. To address these issues, in this paper, we propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis. Firstly, a text style encoder is pre-trained with a large amount of unlabeled text-only data. Secondly, a spectrogram style extractor based on VQ-VAE is pre-trained in a self-supervised manner, with plenty of audio data that covers complex style variations. Then a novel architecture with two encoder-decoder paths is specially designed to model the pronunciation and high-level style expressiveness respectively, with the guidance of the style extractor. Both objective and subjective evaluations demonstrate that our proposed method can effectively improve the naturalness and expressiveness of the synthesized speech in audiobook synthesis especially for the role and out-of-domain scenarios. 1</description><subject>Acoustics</subject><subject>Data mining</subject><subject>Data models</subject><subject>expressive speech synthesis</subject><subject>pre-training</subject><subject>self-supervised style enhancing</subject><subject>Spectrogram</subject><subject>Speech enhancement</subject><subject>Speech synthesis</subject><subject>Training data</subject><subject>VQ-VAE</subject><issn>2379-190X</issn><isbn>9798350344851</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2024</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo1kMtOAjEYhauJiYi8gYv6AMVeZ1p3SMZLQiKmhLgjZeavU8WZSQsoby-grs7iO_mScxC6ZnTIGDU3T-ORtVOppVZDTrkcMiplJhQ_QQOTGy0UFXIP2SnqcZEbwgx9PUcXKb1TSnUudQ9t7Hq3gtQBlPUttrDyxG46iNuQoMJHiIumdk0Zmjf8FdY1nr-Q-aggd-7QmEYgs-hCc8C-jbj47iKkFLaAR5sqtMu2_cD26Md216xrSCFdojPvVgkGf9lHs_tiNn4kk-eH_aoJCTnnRFIqvXMAXmqtaO6XvCo99zrLK-U9F4JykJppLTkYpcpqPwtAQmZAlctM9NHVrzYAwKKL4dPF3eL_JfED9kFefg</recordid><startdate>20240414</startdate><enddate>20240414</enddate><creator>Chen, Xueyuan</creator><creator>Wang, Xi</creator><creator>Zhang, Shaofei</creator><creator>He, Lei</creator><creator>Wu, Zhiyong</creator><creator>Wu, Xixin</creator><creator>Meng, Helen</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>20240414</creationdate><title>Stylespeech: Self-Supervised Style Enhancing with VQ-VAE-Based Pre-Training for Expressive Audiobook Speech Synthesis</title><author>Chen, Xueyuan ; Wang, Xi ; Zhang, Shaofei ; He, Lei ; Wu, Zhiyong ; Wu, Xixin ; Meng, Helen</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i722-4004faaeef488507fb2dcf2f867d5ff23302e4818842e955cd000ee4e69e5cb63</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Acoustics</topic><topic>Data mining</topic><topic>Data models</topic><topic>expressive speech synthesis</topic><topic>pre-training</topic><topic>self-supervised style enhancing</topic><topic>Spectrogram</topic><topic>Speech enhancement</topic><topic>Speech synthesis</topic><topic>Training data</topic><topic>VQ-VAE</topic><toplevel>online_resources</toplevel><creatorcontrib>Chen, Xueyuan</creatorcontrib><creatorcontrib>Wang, Xi</creatorcontrib><creatorcontrib>Zhang, Shaofei</creatorcontrib><creatorcontrib>He, Lei</creatorcontrib><creatorcontrib>Wu, Zhiyong</creatorcontrib><creatorcontrib>Wu, Xixin</creatorcontrib><creatorcontrib>Meng, Helen</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chen, Xueyuan</au><au>Wang, Xi</au><au>Zhang, Shaofei</au><au>He, Lei</au><au>Wu, Zhiyong</au><au>Wu, Xixin</au><au>Meng, Helen</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Stylespeech: Self-Supervised Style Enhancing with VQ-VAE-Based Pre-Training for Expressive Audiobook Speech Synthesis</atitle><btitle>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</btitle><stitle>ICASSP</stitle><date>2024-04-14</date><risdate>2024</risdate><spage>12316</spage><epage>12320</epage><pages>12316-12320</pages><eissn>2379-190X</eissn><eisbn>9798350344851</eisbn><abstract>The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution in the training data. To address these issues, in this paper, we propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis. Firstly, a text style encoder is pre-trained with a large amount of unlabeled text-only data. Secondly, a spectrogram style extractor based on VQ-VAE is pre-trained in a self-supervised manner, with plenty of audio data that covers complex style variations. Then a novel architecture with two encoder-decoder paths is specially designed to model the pronunciation and high-level style expressiveness respectively, with the guidance of the style extractor. Both objective and subjective evaluations demonstrate that our proposed method can effectively improve the naturalness and expressiveness of the synthesized speech in audiobook synthesis especially for the role and out-of-domain scenarios. 1</abstract><pub>IEEE</pub><doi>10.1109/ICASSP48485.2024.10446352</doi><tpages>5</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	EISSN: 2379-190X
ispartof	ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, p.12316-12320
issn	2379-190X
language	eng
recordid	cdi_ieee_primary_10446352
source	IEEE Xplore All Conference Series
subjects	Acoustics Data mining Data models expressive speech synthesis pre-training self-supervised style enhancing Spectrogram Speech enhancement Speech synthesis Training data VQ-VAE
title	Stylespeech: Self-Supervised Style Enhancing with VQ-VAE-Based Pre-Training for Expressive Audiobook Speech Synthesis
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T01%3A16%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Stylespeech:%20Self-Supervised%20Style%20Enhancing%20with%20VQ-VAE-Based%20Pre-Training%20for%20Expressive%20Audiobook%20Speech%20Synthesis&rft.btitle=ICASSP%202024%20-%202024%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing%20(ICASSP)&rft.au=Chen,%20Xueyuan&rft.date=2024-04-14&rft.spage=12316&rft.epage=12320&rft.pages=12316-12320&rft.eissn=2379-190X&rft_id=info:doi/10.1109/ICASSP48485.2024.10446352&rft.eisbn=9798350344851&rft_dat=%3Cieee_CHZPO%3E10446352%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i722-4004faaeef488507fb2dcf2f867d5ff23302e4818842e955cd000ee4e69e5cb63%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10446352&rfr_iscdi=true