Loading…
Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision
We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading&...
Saved in:
Published in: | arXiv.org 2023-02 |
---|---|
Main Authors: | , , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | |
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Kharitonov, Eugene Vincent, Damien Borsos, Zalán Marinier, Raphaël Girgin, Sertan Pietquin, Olivier Sharifi, Matt Tagliasacchi, Marco Zeghidour, Neil |
description | We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables training of the "speaking" module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the "reading" component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker-id labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data, while matching ground-truth speech in terms of naturalness and acoustic quality, as measured in subjective tests. |
format | article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2774362669</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2774362669</sourcerecordid><originalsourceid>FETCH-proquest_journals_27743626693</originalsourceid><addsrcrecordid>eNqNyk0LgjAcgPERBEn5HQZdG9ims7pG4kWI8i4j_-VMt7XNXr59HvoAnZ7D85uggDK2JpuY0hkKnWujKKI8pUnCAlScDYj7Cp9A1FioGh-t7o3f4VzeGpLJGjrpP7iEtydek1HDpcEv6RtcSCV70eHzYMA-pZNaLdD0KjoH4a9ztMwO5T4nxurHAM5XrR6sGldF0zRmnHK-Zf-pL_9NPRA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2774362669</pqid></control><display><type>article</type><title>Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision</title><source>Access via ProQuest (Open Access)</source><creator>Kharitonov, Eugene ; Vincent, Damien ; Borsos, Zalán ; Marinier, Raphaël ; Girgin, Sertan ; Pietquin, Olivier ; Sharifi, Matt ; Tagliasacchi, Marco ; Zeghidour, Neil</creator><creatorcontrib>Kharitonov, Eugene ; Vincent, Damien ; Borsos, Zalán ; Marinier, Raphaël ; Girgin, Sertan ; Pietquin, Olivier ; Sharifi, Matt ; Tagliasacchi, Marco ; Zeghidour, Neil</creatorcontrib><description>We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables training of the "speaking" module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the "reading" component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker-id labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data, while matching ground-truth speech in terms of naturalness and acoustic quality, as measured in subjective tests.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Audio data ; Decoupling ; Representations ; Semantics ; Speaking ; Speech recognition ; Training</subject><ispartof>arXiv.org, 2023-02</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2774362669?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Kharitonov, Eugene</creatorcontrib><creatorcontrib>Vincent, Damien</creatorcontrib><creatorcontrib>Borsos, Zalán</creatorcontrib><creatorcontrib>Marinier, Raphaël</creatorcontrib><creatorcontrib>Girgin, Sertan</creatorcontrib><creatorcontrib>Pietquin, Olivier</creatorcontrib><creatorcontrib>Sharifi, Matt</creatorcontrib><creatorcontrib>Tagliasacchi, Marco</creatorcontrib><creatorcontrib>Zeghidour, Neil</creatorcontrib><title>Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision</title><title>arXiv.org</title><description>We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables training of the "speaking" module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the "reading" component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker-id labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data, while matching ground-truth speech in terms of naturalness and acoustic quality, as measured in subjective tests.</description><subject>Audio data</subject><subject>Decoupling</subject><subject>Representations</subject><subject>Semantics</subject><subject>Speaking</subject><subject>Speech recognition</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNyk0LgjAcgPERBEn5HQZdG9ims7pG4kWI8i4j_-VMt7XNXr59HvoAnZ7D85uggDK2JpuY0hkKnWujKKI8pUnCAlScDYj7Cp9A1FioGh-t7o3f4VzeGpLJGjrpP7iEtydek1HDpcEv6RtcSCV70eHzYMA-pZNaLdD0KjoH4a9ztMwO5T4nxurHAM5XrR6sGldF0zRmnHK-Zf-pL_9NPRA</recordid><startdate>20230207</startdate><enddate>20230207</enddate><creator>Kharitonov, Eugene</creator><creator>Vincent, Damien</creator><creator>Borsos, Zalán</creator><creator>Marinier, Raphaël</creator><creator>Girgin, Sertan</creator><creator>Pietquin, Olivier</creator><creator>Sharifi, Matt</creator><creator>Tagliasacchi, Marco</creator><creator>Zeghidour, Neil</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20230207</creationdate><title>Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision</title><author>Kharitonov, Eugene ; Vincent, Damien ; Borsos, Zalán ; Marinier, Raphaël ; Girgin, Sertan ; Pietquin, Olivier ; Sharifi, Matt ; Tagliasacchi, Marco ; Zeghidour, Neil</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_27743626693</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Audio data</topic><topic>Decoupling</topic><topic>Representations</topic><topic>Semantics</topic><topic>Speaking</topic><topic>Speech recognition</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Kharitonov, Eugene</creatorcontrib><creatorcontrib>Vincent, Damien</creatorcontrib><creatorcontrib>Borsos, Zalán</creatorcontrib><creatorcontrib>Marinier, Raphaël</creatorcontrib><creatorcontrib>Girgin, Sertan</creatorcontrib><creatorcontrib>Pietquin, Olivier</creatorcontrib><creatorcontrib>Sharifi, Matt</creatorcontrib><creatorcontrib>Tagliasacchi, Marco</creatorcontrib><creatorcontrib>Zeghidour, Neil</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kharitonov, Eugene</au><au>Vincent, Damien</au><au>Borsos, Zalán</au><au>Marinier, Raphaël</au><au>Girgin, Sertan</au><au>Pietquin, Olivier</au><au>Sharifi, Matt</au><au>Tagliasacchi, Marco</au><au>Zeghidour, Neil</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision</atitle><jtitle>arXiv.org</jtitle><date>2023-02-07</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables training of the "speaking" module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the "reading" component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker-id labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data, while matching ground-truth speech in terms of naturalness and acoustic quality, as measured in subjective tests.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2023-02 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2774362669 |
source | Access via ProQuest (Open Access) |
subjects | Audio data Decoupling Representations Semantics Speaking Speech recognition Training |
title | Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T07%3A43%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Speak,%20Read%20and%20Prompt:%20High-Fidelity%20Text-to-Speech%20with%20Minimal%20Supervision&rft.jtitle=arXiv.org&rft.au=Kharitonov,%20Eugene&rft.date=2023-02-07&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2774362669%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_27743626693%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2774362669&rft_id=info:pmid/&rfr_iscdi=true |