Loading…

Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading&...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2023-02
Main Authors:	Kharitonov, Eugene, Vincent, Damien, Borsos, Zalán, Marinier, Raphaël, Girgin, Sertan, Pietquin, Olivier, Sharifi, Matt, Tagliasacchi, Marco, Zeghidour, Neil
Format:	Article
Language:	English
Subjects:	Audio data Decoupling Representations Semantics Speaking Speech recognition Training
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Kharitonov, Eugene Vincent, Damien Borsos, Zalán Marinier, Raphaël Girgin, Sertan Pietquin, Olivier Sharifi, Matt Tagliasacchi, Marco Zeghidour, Neil
description	We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables training of the "speaking" module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the "reading" component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker-id labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data, while matching ground-truth speech in terms of naturalness and acoustic quality, as measured in subjective tests.
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2774362669</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2774362669</sourcerecordid><originalsourceid>FETCH-proquest_journals_27743626693</originalsourceid><addsrcrecordid>eNqNyk0LgjAcgPERBEn5HQZdG9ims7pG4kWI8i4j_-VMt7XNXr59HvoAnZ7D85uggDK2JpuY0hkKnWujKKI8pUnCAlScDYj7Cp9A1FioGh-t7o3f4VzeGpLJGjrpP7iEtydek1HDpcEv6RtcSCV70eHzYMA-pZNaLdD0KjoH4a9ztMwO5T4nxurHAM5XrR6sGldF0zRmnHK-Zf-pL_9NPRA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2774362669</pqid></control><display><type>article</type><title>Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision</title><source>Access via ProQuest (Open Access)</source><creator>Kharitonov, Eugene ; Vincent, Damien ; Borsos, Zalán ; Marinier, Raphaël ; Girgin, Sertan ; Pietquin, Olivier ; Sharifi, Matt ; Tagliasacchi, Marco ; Zeghidour, Neil</creator><creatorcontrib>Kharitonov, Eugene ; Vincent, Damien ; Borsos, Zalán ; Marinier, Raphaël ; Girgin, Sertan ; Pietquin, Olivier ; Sharifi, Matt ; Tagliasacchi, Marco ; Zeghidour, Neil</creatorcontrib><description>We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables training of the "speaking" module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the "reading" component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker-id labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data, while matching ground-truth speech in terms of naturalness and acoustic quality, as measured in subjective tests.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Audio data ; Decoupling ; Representations ; Semantics ; Speaking ; Speech recognition ; Training</subject><ispartof>arXiv.org, 2023-02</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2774362669?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Kharitonov, Eugene</creatorcontrib><creatorcontrib>Vincent, Damien</creatorcontrib><creatorcontrib>Borsos, Zalán</creatorcontrib><creatorcontrib>Marinier, Raphaël</creatorcontrib><creatorcontrib>Girgin, Sertan</creatorcontrib><creatorcontrib>Pietquin, Olivier</creatorcontrib><creatorcontrib>Sharifi, Matt</creatorcontrib><creatorcontrib>Tagliasacchi, Marco</creatorcontrib><creatorcontrib>Zeghidour, Neil</creatorcontrib><title>Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision</title><title>arXiv.org</title><description>We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables training of the "speaking" module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the "reading" component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker-id labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data, while matching ground-truth speech in terms of naturalness and acoustic quality, as measured in subjective tests.</description><subject>Audio data</subject><subject>Decoupling</subject><subject>Representations</subject><subject>Semantics</subject><subject>Speaking</subject><subject>Speech recognition</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNyk0LgjAcgPERBEn5HQZdG9ims7pG4kWI8i4j_-VMt7XNXr59HvoAnZ7D85uggDK2JpuY0hkKnWujKKI8pUnCAlScDYj7Cp9A1FioGh-t7o3f4VzeGpLJGjrpP7iEtydek1HDpcEv6RtcSCV70eHzYMA-pZNaLdD0KjoH4a9ztMwO5T4nxurHAM5XrR6sGldF0zRmnHK-Zf-pL_9NPRA</recordid><startdate>20230207</startdate><enddate>20230207</enddate><creator>Kharitonov, Eugene</creator><creator>Vincent, Damien</creator><creator>Borsos, Zalán</creator><creator>Marinier, Raphaël</creator><creator>Girgin, Sertan</creator><creator>Pietquin, Olivier</creator><creator>Sharifi, Matt</creator><creator>Tagliasacchi, Marco</creator><creator>Zeghidour, Neil</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20230207</creationdate><title>Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision</title><author>Kharitonov, Eugene ; Vincent, Damien ; Borsos, Zalán ; Marinier, Raphaël ; Girgin, Sertan ; Pietquin, Olivier ; Sharifi, Matt ; Tagliasacchi, Marco ; Zeghidour, Neil</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_27743626693</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Audio data</topic><topic>Decoupling</topic><topic>Representations</topic><topic>Semantics</topic><topic>Speaking</topic><topic>Speech recognition</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Kharitonov, Eugene</creatorcontrib><creatorcontrib>Vincent, Damien</creatorcontrib><creatorcontrib>Borsos, Zalán</creatorcontrib><creatorcontrib>Marinier, Raphaël</creatorcontrib><creatorcontrib>Girgin, Sertan</creatorcontrib><creatorcontrib>Pietquin, Olivier</creatorcontrib><creatorcontrib>Sharifi, Matt</creatorcontrib><creatorcontrib>Tagliasacchi, Marco</creatorcontrib><creatorcontrib>Zeghidour, Neil</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kharitonov, Eugene</au><au>Vincent, Damien</au><au>Borsos, Zalán</au><au>Marinier, Raphaël</au><au>Girgin, Sertan</au><au>Pietquin, Olivier</au><au>Sharifi, Matt</au><au>Tagliasacchi, Marco</au><au>Zeghidour, Neil</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision</atitle><jtitle>arXiv.org</jtitle><date>2023-02-07</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables training of the "speaking" module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the "reading" component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker-id labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data, while matching ground-truth speech in terms of naturalness and acoustic quality, as measured in subjective tests.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2023-02
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2774362669
source	Access via ProQuest (Open Access)
subjects	Audio data Decoupling Representations Semantics Speaking Speech recognition Training
title	Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T07%3A43%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Speak,%20Read%20and%20Prompt:%20High-Fidelity%20Text-to-Speech%20with%20Minimal%20Supervision&rft.jtitle=arXiv.org&rft.au=Kharitonov,%20Eugene&rft.date=2023-02-07&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2774362669%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_27743626693%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2774362669&rft_id=info:pmid/&rfr_iscdi=true