Loading…

A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units

Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech features segments, selected from a recorded speech database. Consequently, CTTS systems enable speech synthesis with natural quality. Howeve...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on audio, speech, and language processing speech, and language processing, 2011-07, Vol.19 (5), p.1278-1288
Main Authors: Tiomkin, S, Malah, D, Shechtman, S, Kons, Z
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c355t-9a0efd5bca75c363df8c1b98adbff5b15962eb1f842e236a05e48e5b96fd7aca3
cites cdi_FETCH-LOGICAL-c355t-9a0efd5bca75c363df8c1b98adbff5b15962eb1f842e236a05e48e5b96fd7aca3
container_end_page 1288
container_issue 5
container_start_page 1278
container_title IEEE transactions on audio, speech, and language processing
container_volume 19
creator Tiomkin, S
Malah, D
Shechtman, S
Kons, Z
description Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech features segments, selected from a recorded speech database. Consequently, CTTS systems enable speech synthesis with natural quality. However, as the footprint of the stored data is reduced, desired segments are not always available in the stored data, and audible discontinuities may result. On the other hand, statistical TTS (STTS) systems, in spite of having a smaller footprint than CTTS, synthesize speech that is free of such discontinuities. Yet, in general, STTS produces lower quality speech than CTTS, in terms of naturalness, as it is often sounding muffled. The muffling effect is due to over-smoothing of model-generated speech features. In order to gain from the advantages of each of the two approaches, we propose in this work to combine CTTS and STTS into a hybrid TTS (HTTS) system. Each utterance representation in HTTS is constructed from natural segments and model generated segments in an interweaved fashion via a hybrid dynamic path algorithm. Reported listening tests demonstrate the validity of the proposed approach.
doi_str_mv 10.1109/TASL.2010.2089679
format article
fullrecord <record><control><sourceid>proquest_pasca</sourceid><recordid>TN_cdi_pascalfrancis_primary_24286310</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>5609194</ieee_id><sourcerecordid>1671420945</sourcerecordid><originalsourceid>FETCH-LOGICAL-c355t-9a0efd5bca75c363df8c1b98adbff5b15962eb1f842e236a05e48e5b96fd7aca3</originalsourceid><addsrcrecordid>eNpdkEFLwzAYhosoOKc_QLwUQfDSmaRJ2hzHUCcMPLS7eAlp-oVlbOlsMnH_3pSNHTx970ee9yM8SXKP0QRjJF7qabWYEBRXgkrBC3GRjDBjZVYIQi_PGfPr5Mb7NUI05xSPkq9pOj80vW3TGn5DFrqs2gHoVVodfIBtWq9USGfdtrEOfAxOqwBOBfsDqXJtWoWYfbBabWLFhRV469Ols8HfJldGbTzcneY4Wb691rN5tvh8_5hNF5nOGQuZUAhMyxqtCqZznrem1LgRpWobY1iDmeAEGmxKSoDkXCEGtATWCG7aQmmVj5Pn491d333vwQe5tV7DZqMcdHsvMS8wJUhQFtHHf-i62_cu_k4KzDhHjAwQPkK677zvwchdb7eqP0iM5CBbDrLlIFueZMfO0-mw8lGF6ZXT1p-LhJKS5xhF7uHIWQA4PzOOBBY0_wPSDIiJ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>915660525</pqid></control><display><type>article</type><title>A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Tiomkin, S ; Malah, D ; Shechtman, S ; Kons, Z</creator><creatorcontrib>Tiomkin, S ; Malah, D ; Shechtman, S ; Kons, Z</creatorcontrib><description>Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech features segments, selected from a recorded speech database. Consequently, CTTS systems enable speech synthesis with natural quality. However, as the footprint of the stored data is reduced, desired segments are not always available in the stored data, and audible discontinuities may result. On the other hand, statistical TTS (STTS) systems, in spite of having a smaller footprint than CTTS, synthesize speech that is free of such discontinuities. Yet, in general, STTS produces lower quality speech than CTTS, in terms of naturalness, as it is often sounding muffled. The muffling effect is due to over-smoothing of model-generated speech features. In order to gain from the advantages of each of the two approaches, we propose in this work to combine CTTS and STTS into a hybrid TTS (HTTS) system. Each utterance representation in HTTS is constructed from natural segments and model generated segments in an interweaved fashion via a hybrid dynamic path algorithm. Reported listening tests demonstrate the validity of the proposed approach.</description><identifier>ISSN: 1558-7916</identifier><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 1558-7924</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASL.2010.2089679</identifier><identifier>CODEN: ITASD8</identifier><language>eng</language><publisher>Piscataway, NJ: IEEE</publisher><subject>Applied sciences ; Concatenative text-to-speech (CTTS) ; Discontinuity ; dynamic path ; Dynamics ; Exact sciences and technology ; Footprints ; Heuristic algorithms ; Hidden Markov models ; Hybrid power systems ; hybrid TTS ; Information, signal and communications theory ; Natural languages ; Segments ; Signal processing ; Speech ; Speech processing ; Speech recognition ; statistical TTS ; Synthesis ; Telecommunications and information theory ; TTS synthesis</subject><ispartof>IEEE transactions on audio, speech, and language processing, 2011-07, Vol.19 (5), p.1278-1288</ispartof><rights>2015 INIST-CNRS</rights><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Jul 2011</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c355t-9a0efd5bca75c363df8c1b98adbff5b15962eb1f842e236a05e48e5b96fd7aca3</citedby><cites>FETCH-LOGICAL-c355t-9a0efd5bca75c363df8c1b98adbff5b15962eb1f842e236a05e48e5b96fd7aca3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/5609194$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,54796</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&amp;idt=24286310$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Tiomkin, S</creatorcontrib><creatorcontrib>Malah, D</creatorcontrib><creatorcontrib>Shechtman, S</creatorcontrib><creatorcontrib>Kons, Z</creatorcontrib><title>A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units</title><title>IEEE transactions on audio, speech, and language processing</title><addtitle>TASL</addtitle><description>Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech features segments, selected from a recorded speech database. Consequently, CTTS systems enable speech synthesis with natural quality. However, as the footprint of the stored data is reduced, desired segments are not always available in the stored data, and audible discontinuities may result. On the other hand, statistical TTS (STTS) systems, in spite of having a smaller footprint than CTTS, synthesize speech that is free of such discontinuities. Yet, in general, STTS produces lower quality speech than CTTS, in terms of naturalness, as it is often sounding muffled. The muffling effect is due to over-smoothing of model-generated speech features. In order to gain from the advantages of each of the two approaches, we propose in this work to combine CTTS and STTS into a hybrid TTS (HTTS) system. Each utterance representation in HTTS is constructed from natural segments and model generated segments in an interweaved fashion via a hybrid dynamic path algorithm. Reported listening tests demonstrate the validity of the proposed approach.</description><subject>Applied sciences</subject><subject>Concatenative text-to-speech (CTTS)</subject><subject>Discontinuity</subject><subject>dynamic path</subject><subject>Dynamics</subject><subject>Exact sciences and technology</subject><subject>Footprints</subject><subject>Heuristic algorithms</subject><subject>Hidden Markov models</subject><subject>Hybrid power systems</subject><subject>hybrid TTS</subject><subject>Information, signal and communications theory</subject><subject>Natural languages</subject><subject>Segments</subject><subject>Signal processing</subject><subject>Speech</subject><subject>Speech processing</subject><subject>Speech recognition</subject><subject>statistical TTS</subject><subject>Synthesis</subject><subject>Telecommunications and information theory</subject><subject>TTS synthesis</subject><issn>1558-7916</issn><issn>2329-9290</issn><issn>1558-7924</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2011</creationdate><recordtype>article</recordtype><recordid>eNpdkEFLwzAYhosoOKc_QLwUQfDSmaRJ2hzHUCcMPLS7eAlp-oVlbOlsMnH_3pSNHTx970ee9yM8SXKP0QRjJF7qabWYEBRXgkrBC3GRjDBjZVYIQi_PGfPr5Mb7NUI05xSPkq9pOj80vW3TGn5DFrqs2gHoVVodfIBtWq9USGfdtrEOfAxOqwBOBfsDqXJtWoWYfbBabWLFhRV469Ols8HfJldGbTzcneY4Wb691rN5tvh8_5hNF5nOGQuZUAhMyxqtCqZznrem1LgRpWobY1iDmeAEGmxKSoDkXCEGtATWCG7aQmmVj5Pn491d333vwQe5tV7DZqMcdHsvMS8wJUhQFtHHf-i62_cu_k4KzDhHjAwQPkK677zvwchdb7eqP0iM5CBbDrLlIFueZMfO0-mw8lGF6ZXT1p-LhJKS5xhF7uHIWQA4PzOOBBY0_wPSDIiJ</recordid><startdate>20110701</startdate><enddate>20110701</enddate><creator>Tiomkin, S</creator><creator>Malah, D</creator><creator>Shechtman, S</creator><creator>Kons, Z</creator><general>IEEE</general><general>Institute of Electrical and Electronics Engineers</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20110701</creationdate><title>A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units</title><author>Tiomkin, S ; Malah, D ; Shechtman, S ; Kons, Z</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c355t-9a0efd5bca75c363df8c1b98adbff5b15962eb1f842e236a05e48e5b96fd7aca3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2011</creationdate><topic>Applied sciences</topic><topic>Concatenative text-to-speech (CTTS)</topic><topic>Discontinuity</topic><topic>dynamic path</topic><topic>Dynamics</topic><topic>Exact sciences and technology</topic><topic>Footprints</topic><topic>Heuristic algorithms</topic><topic>Hidden Markov models</topic><topic>Hybrid power systems</topic><topic>hybrid TTS</topic><topic>Information, signal and communications theory</topic><topic>Natural languages</topic><topic>Segments</topic><topic>Signal processing</topic><topic>Speech</topic><topic>Speech processing</topic><topic>Speech recognition</topic><topic>statistical TTS</topic><topic>Synthesis</topic><topic>Telecommunications and information theory</topic><topic>TTS synthesis</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Tiomkin, S</creatorcontrib><creatorcontrib>Malah, D</creatorcontrib><creatorcontrib>Shechtman, S</creatorcontrib><creatorcontrib>Kons, Z</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE/IET Electronic Library</collection><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Tiomkin, S</au><au>Malah, D</au><au>Shechtman, S</au><au>Kons, Z</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units</atitle><jtitle>IEEE transactions on audio, speech, and language processing</jtitle><stitle>TASL</stitle><date>2011-07-01</date><risdate>2011</risdate><volume>19</volume><issue>5</issue><spage>1278</spage><epage>1288</epage><pages>1278-1288</pages><issn>1558-7916</issn><issn>2329-9290</issn><eissn>1558-7924</eissn><eissn>2329-9304</eissn><coden>ITASD8</coden><abstract>Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech features segments, selected from a recorded speech database. Consequently, CTTS systems enable speech synthesis with natural quality. However, as the footprint of the stored data is reduced, desired segments are not always available in the stored data, and audible discontinuities may result. On the other hand, statistical TTS (STTS) systems, in spite of having a smaller footprint than CTTS, synthesize speech that is free of such discontinuities. Yet, in general, STTS produces lower quality speech than CTTS, in terms of naturalness, as it is often sounding muffled. The muffling effect is due to over-smoothing of model-generated speech features. In order to gain from the advantages of each of the two approaches, we propose in this work to combine CTTS and STTS into a hybrid TTS (HTTS) system. Each utterance representation in HTTS is constructed from natural segments and model generated segments in an interweaved fashion via a hybrid dynamic path algorithm. Reported listening tests demonstrate the validity of the proposed approach.</abstract><cop>Piscataway, NJ</cop><pub>IEEE</pub><doi>10.1109/TASL.2010.2089679</doi><tpages>11</tpages></addata></record>
fulltext fulltext
identifier ISSN: 1558-7916
ispartof IEEE transactions on audio, speech, and language processing, 2011-07, Vol.19 (5), p.1278-1288
issn 1558-7916
2329-9290
1558-7924
2329-9304
language eng
recordid cdi_pascalfrancis_primary_24286310
source IEEE Electronic Library (IEL) Journals
subjects Applied sciences
Concatenative text-to-speech (CTTS)
Discontinuity
dynamic path
Dynamics
Exact sciences and technology
Footprints
Heuristic algorithms
Hidden Markov models
Hybrid power systems
hybrid TTS
Information, signal and communications theory
Natural languages
Segments
Signal processing
Speech
Speech processing
Speech recognition
statistical TTS
Synthesis
Telecommunications and information theory
TTS synthesis
title A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T01%3A25%3A27IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pasca&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Hybrid%20Text-to-Speech%20System%20That%20Combines%20Concatenative%20and%20Statistical%20Synthesis%20Units&rft.jtitle=IEEE%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Tiomkin,%20S&rft.date=2011-07-01&rft.volume=19&rft.issue=5&rft.spage=1278&rft.epage=1288&rft.pages=1278-1288&rft.issn=1558-7916&rft.eissn=1558-7924&rft.coden=ITASD8&rft_id=info:doi/10.1109/TASL.2010.2089679&rft_dat=%3Cproquest_pasca%3E1671420945%3C/proquest_pasca%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c355t-9a0efd5bca75c363df8c1b98adbff5b15962eb1f842e236a05e48e5b96fd7aca3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=915660525&rft_id=info:pmid/&rft_ieee_id=5609194&rfr_iscdi=true