Loading…
A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units
Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech features segments, selected from a recorded speech database. Consequently, CTTS systems enable speech synthesis with natural quality. Howeve...
Saved in:
Published in: | IEEE transactions on audio, speech, and language processing speech, and language processing, 2011-07, Vol.19 (5), p.1278-1288 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c355t-9a0efd5bca75c363df8c1b98adbff5b15962eb1f842e236a05e48e5b96fd7aca3 |
---|---|
cites | cdi_FETCH-LOGICAL-c355t-9a0efd5bca75c363df8c1b98adbff5b15962eb1f842e236a05e48e5b96fd7aca3 |
container_end_page | 1288 |
container_issue | 5 |
container_start_page | 1278 |
container_title | IEEE transactions on audio, speech, and language processing |
container_volume | 19 |
creator | Tiomkin, S Malah, D Shechtman, S Kons, Z |
description | Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech features segments, selected from a recorded speech database. Consequently, CTTS systems enable speech synthesis with natural quality. However, as the footprint of the stored data is reduced, desired segments are not always available in the stored data, and audible discontinuities may result. On the other hand, statistical TTS (STTS) systems, in spite of having a smaller footprint than CTTS, synthesize speech that is free of such discontinuities. Yet, in general, STTS produces lower quality speech than CTTS, in terms of naturalness, as it is often sounding muffled. The muffling effect is due to over-smoothing of model-generated speech features. In order to gain from the advantages of each of the two approaches, we propose in this work to combine CTTS and STTS into a hybrid TTS (HTTS) system. Each utterance representation in HTTS is constructed from natural segments and model generated segments in an interweaved fashion via a hybrid dynamic path algorithm. Reported listening tests demonstrate the validity of the proposed approach. |
doi_str_mv | 10.1109/TASL.2010.2089679 |
format | article |
fullrecord | <record><control><sourceid>proquest_pasca</sourceid><recordid>TN_cdi_pascalfrancis_primary_24286310</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>5609194</ieee_id><sourcerecordid>1671420945</sourcerecordid><originalsourceid>FETCH-LOGICAL-c355t-9a0efd5bca75c363df8c1b98adbff5b15962eb1f842e236a05e48e5b96fd7aca3</originalsourceid><addsrcrecordid>eNpdkEFLwzAYhosoOKc_QLwUQfDSmaRJ2hzHUCcMPLS7eAlp-oVlbOlsMnH_3pSNHTx970ee9yM8SXKP0QRjJF7qabWYEBRXgkrBC3GRjDBjZVYIQi_PGfPr5Mb7NUI05xSPkq9pOj80vW3TGn5DFrqs2gHoVVodfIBtWq9USGfdtrEOfAxOqwBOBfsDqXJtWoWYfbBabWLFhRV469Ols8HfJldGbTzcneY4Wb691rN5tvh8_5hNF5nOGQuZUAhMyxqtCqZznrem1LgRpWobY1iDmeAEGmxKSoDkXCEGtATWCG7aQmmVj5Pn491d333vwQe5tV7DZqMcdHsvMS8wJUhQFtHHf-i62_cu_k4KzDhHjAwQPkK677zvwchdb7eqP0iM5CBbDrLlIFueZMfO0-mw8lGF6ZXT1p-LhJKS5xhF7uHIWQA4PzOOBBY0_wPSDIiJ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>915660525</pqid></control><display><type>article</type><title>A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Tiomkin, S ; Malah, D ; Shechtman, S ; Kons, Z</creator><creatorcontrib>Tiomkin, S ; Malah, D ; Shechtman, S ; Kons, Z</creatorcontrib><description>Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech features segments, selected from a recorded speech database. Consequently, CTTS systems enable speech synthesis with natural quality. However, as the footprint of the stored data is reduced, desired segments are not always available in the stored data, and audible discontinuities may result. On the other hand, statistical TTS (STTS) systems, in spite of having a smaller footprint than CTTS, synthesize speech that is free of such discontinuities. Yet, in general, STTS produces lower quality speech than CTTS, in terms of naturalness, as it is often sounding muffled. The muffling effect is due to over-smoothing of model-generated speech features. In order to gain from the advantages of each of the two approaches, we propose in this work to combine CTTS and STTS into a hybrid TTS (HTTS) system. Each utterance representation in HTTS is constructed from natural segments and model generated segments in an interweaved fashion via a hybrid dynamic path algorithm. Reported listening tests demonstrate the validity of the proposed approach.</description><identifier>ISSN: 1558-7916</identifier><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 1558-7924</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASL.2010.2089679</identifier><identifier>CODEN: ITASD8</identifier><language>eng</language><publisher>Piscataway, NJ: IEEE</publisher><subject>Applied sciences ; Concatenative text-to-speech (CTTS) ; Discontinuity ; dynamic path ; Dynamics ; Exact sciences and technology ; Footprints ; Heuristic algorithms ; Hidden Markov models ; Hybrid power systems ; hybrid TTS ; Information, signal and communications theory ; Natural languages ; Segments ; Signal processing ; Speech ; Speech processing ; Speech recognition ; statistical TTS ; Synthesis ; Telecommunications and information theory ; TTS synthesis</subject><ispartof>IEEE transactions on audio, speech, and language processing, 2011-07, Vol.19 (5), p.1278-1288</ispartof><rights>2015 INIST-CNRS</rights><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Jul 2011</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c355t-9a0efd5bca75c363df8c1b98adbff5b15962eb1f842e236a05e48e5b96fd7aca3</citedby><cites>FETCH-LOGICAL-c355t-9a0efd5bca75c363df8c1b98adbff5b15962eb1f842e236a05e48e5b96fd7aca3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/5609194$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,54796</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=24286310$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Tiomkin, S</creatorcontrib><creatorcontrib>Malah, D</creatorcontrib><creatorcontrib>Shechtman, S</creatorcontrib><creatorcontrib>Kons, Z</creatorcontrib><title>A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units</title><title>IEEE transactions on audio, speech, and language processing</title><addtitle>TASL</addtitle><description>Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech features segments, selected from a recorded speech database. Consequently, CTTS systems enable speech synthesis with natural quality. However, as the footprint of the stored data is reduced, desired segments are not always available in the stored data, and audible discontinuities may result. On the other hand, statistical TTS (STTS) systems, in spite of having a smaller footprint than CTTS, synthesize speech that is free of such discontinuities. Yet, in general, STTS produces lower quality speech than CTTS, in terms of naturalness, as it is often sounding muffled. The muffling effect is due to over-smoothing of model-generated speech features. In order to gain from the advantages of each of the two approaches, we propose in this work to combine CTTS and STTS into a hybrid TTS (HTTS) system. Each utterance representation in HTTS is constructed from natural segments and model generated segments in an interweaved fashion via a hybrid dynamic path algorithm. Reported listening tests demonstrate the validity of the proposed approach.</description><subject>Applied sciences</subject><subject>Concatenative text-to-speech (CTTS)</subject><subject>Discontinuity</subject><subject>dynamic path</subject><subject>Dynamics</subject><subject>Exact sciences and technology</subject><subject>Footprints</subject><subject>Heuristic algorithms</subject><subject>Hidden Markov models</subject><subject>Hybrid power systems</subject><subject>hybrid TTS</subject><subject>Information, signal and communications theory</subject><subject>Natural languages</subject><subject>Segments</subject><subject>Signal processing</subject><subject>Speech</subject><subject>Speech processing</subject><subject>Speech recognition</subject><subject>statistical TTS</subject><subject>Synthesis</subject><subject>Telecommunications and information theory</subject><subject>TTS synthesis</subject><issn>1558-7916</issn><issn>2329-9290</issn><issn>1558-7924</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2011</creationdate><recordtype>article</recordtype><recordid>eNpdkEFLwzAYhosoOKc_QLwUQfDSmaRJ2hzHUCcMPLS7eAlp-oVlbOlsMnH_3pSNHTx970ee9yM8SXKP0QRjJF7qabWYEBRXgkrBC3GRjDBjZVYIQi_PGfPr5Mb7NUI05xSPkq9pOj80vW3TGn5DFrqs2gHoVVodfIBtWq9USGfdtrEOfAxOqwBOBfsDqXJtWoWYfbBabWLFhRV469Ols8HfJldGbTzcneY4Wb691rN5tvh8_5hNF5nOGQuZUAhMyxqtCqZznrem1LgRpWobY1iDmeAEGmxKSoDkXCEGtATWCG7aQmmVj5Pn491d333vwQe5tV7DZqMcdHsvMS8wJUhQFtHHf-i62_cu_k4KzDhHjAwQPkK677zvwchdb7eqP0iM5CBbDrLlIFueZMfO0-mw8lGF6ZXT1p-LhJKS5xhF7uHIWQA4PzOOBBY0_wPSDIiJ</recordid><startdate>20110701</startdate><enddate>20110701</enddate><creator>Tiomkin, S</creator><creator>Malah, D</creator><creator>Shechtman, S</creator><creator>Kons, Z</creator><general>IEEE</general><general>Institute of Electrical and Electronics Engineers</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20110701</creationdate><title>A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units</title><author>Tiomkin, S ; Malah, D ; Shechtman, S ; Kons, Z</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c355t-9a0efd5bca75c363df8c1b98adbff5b15962eb1f842e236a05e48e5b96fd7aca3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2011</creationdate><topic>Applied sciences</topic><topic>Concatenative text-to-speech (CTTS)</topic><topic>Discontinuity</topic><topic>dynamic path</topic><topic>Dynamics</topic><topic>Exact sciences and technology</topic><topic>Footprints</topic><topic>Heuristic algorithms</topic><topic>Hidden Markov models</topic><topic>Hybrid power systems</topic><topic>hybrid TTS</topic><topic>Information, signal and communications theory</topic><topic>Natural languages</topic><topic>Segments</topic><topic>Signal processing</topic><topic>Speech</topic><topic>Speech processing</topic><topic>Speech recognition</topic><topic>statistical TTS</topic><topic>Synthesis</topic><topic>Telecommunications and information theory</topic><topic>TTS synthesis</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Tiomkin, S</creatorcontrib><creatorcontrib>Malah, D</creatorcontrib><creatorcontrib>Shechtman, S</creatorcontrib><creatorcontrib>Kons, Z</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE/IET Electronic Library</collection><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Tiomkin, S</au><au>Malah, D</au><au>Shechtman, S</au><au>Kons, Z</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units</atitle><jtitle>IEEE transactions on audio, speech, and language processing</jtitle><stitle>TASL</stitle><date>2011-07-01</date><risdate>2011</risdate><volume>19</volume><issue>5</issue><spage>1278</spage><epage>1288</epage><pages>1278-1288</pages><issn>1558-7916</issn><issn>2329-9290</issn><eissn>1558-7924</eissn><eissn>2329-9304</eissn><coden>ITASD8</coden><abstract>Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech features segments, selected from a recorded speech database. Consequently, CTTS systems enable speech synthesis with natural quality. However, as the footprint of the stored data is reduced, desired segments are not always available in the stored data, and audible discontinuities may result. On the other hand, statistical TTS (STTS) systems, in spite of having a smaller footprint than CTTS, synthesize speech that is free of such discontinuities. Yet, in general, STTS produces lower quality speech than CTTS, in terms of naturalness, as it is often sounding muffled. The muffling effect is due to over-smoothing of model-generated speech features. In order to gain from the advantages of each of the two approaches, we propose in this work to combine CTTS and STTS into a hybrid TTS (HTTS) system. Each utterance representation in HTTS is constructed from natural segments and model generated segments in an interweaved fashion via a hybrid dynamic path algorithm. Reported listening tests demonstrate the validity of the proposed approach.</abstract><cop>Piscataway, NJ</cop><pub>IEEE</pub><doi>10.1109/TASL.2010.2089679</doi><tpages>11</tpages></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1558-7916 |
ispartof | IEEE transactions on audio, speech, and language processing, 2011-07, Vol.19 (5), p.1278-1288 |
issn | 1558-7916 2329-9290 1558-7924 2329-9304 |
language | eng |
recordid | cdi_pascalfrancis_primary_24286310 |
source | IEEE Electronic Library (IEL) Journals |
subjects | Applied sciences Concatenative text-to-speech (CTTS) Discontinuity dynamic path Dynamics Exact sciences and technology Footprints Heuristic algorithms Hidden Markov models Hybrid power systems hybrid TTS Information, signal and communications theory Natural languages Segments Signal processing Speech Speech processing Speech recognition statistical TTS Synthesis Telecommunications and information theory TTS synthesis |
title | A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T01%3A25%3A27IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pasca&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Hybrid%20Text-to-Speech%20System%20That%20Combines%20Concatenative%20and%20Statistical%20Synthesis%20Units&rft.jtitle=IEEE%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Tiomkin,%20S&rft.date=2011-07-01&rft.volume=19&rft.issue=5&rft.spage=1278&rft.epage=1288&rft.pages=1278-1288&rft.issn=1558-7916&rft.eissn=1558-7924&rft.coden=ITASD8&rft_id=info:doi/10.1109/TASL.2010.2089679&rft_dat=%3Cproquest_pasca%3E1671420945%3C/proquest_pasca%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c355t-9a0efd5bca75c363df8c1b98adbff5b15962eb1f842e236a05e48e5b96fd7aca3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=915660525&rft_id=info:pmid/&rft_ieee_id=5609194&rfr_iscdi=true |