Loading…

Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading&...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2023-02
Main Authors: Kharitonov, Eugene, Vincent, Damien, Borsos, Zalán, Marinier, Raphaël, Girgin, Sertan, Pietquin, Olivier, Sharifi, Matt, Tagliasacchi, Marco, Zeghidour, Neil
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Kharitonov, Eugene
Vincent, Damien
Borsos, Zalán
Marinier, Raphaël
Girgin, Sertan
Pietquin, Olivier
Sharifi, Matt
Tagliasacchi, Marco
Zeghidour, Neil
description We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables training of the "speaking" module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the "reading" component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker-id labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data, while matching ground-truth speech in terms of naturalness and acoustic quality, as measured in subjective tests.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2774362669</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2774362669</sourcerecordid><originalsourceid>FETCH-proquest_journals_27743626693</originalsourceid><addsrcrecordid>eNqNyk0LgjAcgPERBEn5HQZdG9ims7pG4kWI8i4j_-VMt7XNXr59HvoAnZ7D85uggDK2JpuY0hkKnWujKKI8pUnCAlScDYj7Cp9A1FioGh-t7o3f4VzeGpLJGjrpP7iEtydek1HDpcEv6RtcSCV70eHzYMA-pZNaLdD0KjoH4a9ztMwO5T4nxurHAM5XrR6sGldF0zRmnHK-Zf-pL_9NPRA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2774362669</pqid></control><display><type>article</type><title>Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision</title><source>Access via ProQuest (Open Access)</source><creator>Kharitonov, Eugene ; Vincent, Damien ; Borsos, Zalán ; Marinier, Raphaël ; Girgin, Sertan ; Pietquin, Olivier ; Sharifi, Matt ; Tagliasacchi, Marco ; Zeghidour, Neil</creator><creatorcontrib>Kharitonov, Eugene ; Vincent, Damien ; Borsos, Zalán ; Marinier, Raphaël ; Girgin, Sertan ; Pietquin, Olivier ; Sharifi, Matt ; Tagliasacchi, Marco ; Zeghidour, Neil</creatorcontrib><description>We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables training of the "speaking" module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the "reading" component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker-id labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data, while matching ground-truth speech in terms of naturalness and acoustic quality, as measured in subjective tests.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Audio data ; Decoupling ; Representations ; Semantics ; Speaking ; Speech recognition ; Training</subject><ispartof>arXiv.org, 2023-02</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2774362669?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Kharitonov, Eugene</creatorcontrib><creatorcontrib>Vincent, Damien</creatorcontrib><creatorcontrib>Borsos, Zalán</creatorcontrib><creatorcontrib>Marinier, Raphaël</creatorcontrib><creatorcontrib>Girgin, Sertan</creatorcontrib><creatorcontrib>Pietquin, Olivier</creatorcontrib><creatorcontrib>Sharifi, Matt</creatorcontrib><creatorcontrib>Tagliasacchi, Marco</creatorcontrib><creatorcontrib>Zeghidour, Neil</creatorcontrib><title>Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision</title><title>arXiv.org</title><description>We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables training of the "speaking" module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the "reading" component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker-id labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data, while matching ground-truth speech in terms of naturalness and acoustic quality, as measured in subjective tests.</description><subject>Audio data</subject><subject>Decoupling</subject><subject>Representations</subject><subject>Semantics</subject><subject>Speaking</subject><subject>Speech recognition</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNyk0LgjAcgPERBEn5HQZdG9ims7pG4kWI8i4j_-VMt7XNXr59HvoAnZ7D85uggDK2JpuY0hkKnWujKKI8pUnCAlScDYj7Cp9A1FioGh-t7o3f4VzeGpLJGjrpP7iEtydek1HDpcEv6RtcSCV70eHzYMA-pZNaLdD0KjoH4a9ztMwO5T4nxurHAM5XrR6sGldF0zRmnHK-Zf-pL_9NPRA</recordid><startdate>20230207</startdate><enddate>20230207</enddate><creator>Kharitonov, Eugene</creator><creator>Vincent, Damien</creator><creator>Borsos, Zalán</creator><creator>Marinier, Raphaël</creator><creator>Girgin, Sertan</creator><creator>Pietquin, Olivier</creator><creator>Sharifi, Matt</creator><creator>Tagliasacchi, Marco</creator><creator>Zeghidour, Neil</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20230207</creationdate><title>Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision</title><author>Kharitonov, Eugene ; Vincent, Damien ; Borsos, Zalán ; Marinier, Raphaël ; Girgin, Sertan ; Pietquin, Olivier ; Sharifi, Matt ; Tagliasacchi, Marco ; Zeghidour, Neil</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_27743626693</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Audio data</topic><topic>Decoupling</topic><topic>Representations</topic><topic>Semantics</topic><topic>Speaking</topic><topic>Speech recognition</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Kharitonov, Eugene</creatorcontrib><creatorcontrib>Vincent, Damien</creatorcontrib><creatorcontrib>Borsos, Zalán</creatorcontrib><creatorcontrib>Marinier, Raphaël</creatorcontrib><creatorcontrib>Girgin, Sertan</creatorcontrib><creatorcontrib>Pietquin, Olivier</creatorcontrib><creatorcontrib>Sharifi, Matt</creatorcontrib><creatorcontrib>Tagliasacchi, Marco</creatorcontrib><creatorcontrib>Zeghidour, Neil</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kharitonov, Eugene</au><au>Vincent, Damien</au><au>Borsos, Zalán</au><au>Marinier, Raphaël</au><au>Girgin, Sertan</au><au>Pietquin, Olivier</au><au>Sharifi, Matt</au><au>Tagliasacchi, Marco</au><au>Zeghidour, Neil</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision</atitle><jtitle>arXiv.org</jtitle><date>2023-02-07</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables training of the "speaking" module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the "reading" component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker-id labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data, while matching ground-truth speech in terms of naturalness and acoustic quality, as measured in subjective tests.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2023-02
issn 2331-8422
language eng
recordid cdi_proquest_journals_2774362669
source Access via ProQuest (Open Access)
subjects Audio data
Decoupling
Representations
Semantics
Speaking
Speech recognition
Training
title Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T07%3A43%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Speak,%20Read%20and%20Prompt:%20High-Fidelity%20Text-to-Speech%20with%20Minimal%20Supervision&rft.jtitle=arXiv.org&rft.au=Kharitonov,%20Eugene&rft.date=2023-02-07&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2774362669%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_27743626693%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2774362669&rft_id=info:pmid/&rfr_iscdi=true