Loading…
Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression
In previous work we proposed a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into a hidden Markov model (HMM) based parametric speech synthesizer. In this method, a unified acoustic-articulatory model is trained, and context-dependent linear...
Saved in:
Published in: | IEEE transactions on audio, speech, and language processing speech, and language processing, 2013-01, Vol.21 (1), p.207-219 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c465t-9253162c8f5c8da847dc509476f272d3a8a29e2f55324ed32153dba258081d743 |
---|---|
cites | cdi_FETCH-LOGICAL-c465t-9253162c8f5c8da847dc509476f272d3a8a29e2f55324ed32153dba258081d743 |
container_end_page | 219 |
container_issue | 1 |
container_start_page | 207 |
container_title | IEEE transactions on audio, speech, and language processing |
container_volume | 21 |
creator | Zhen-Hua Ling Richmond, K. Yamagishi, J. |
description | In previous work we proposed a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into a hidden Markov model (HMM) based parametric speech synthesizer. In this method, a unified acoustic-articulatory model is trained, and context-dependent linear transforms are used to model the dependency between the two feature streams. In this paper, we go significantly further and propose a feature-space-switched multiple regression HMM to improve the performance of articulatory control. A multiple regression HMM (MRHMM) is adopted to model the distribution of acoustic features, with articulatory features used as exogenous "explanatory" variables. A separate Gaussian mixture model (GMM) is introduced to model the articulatory space, and articulatory-to-acoustic regression matrices are trained for each component of this GMM, instead of for the context-dependent states in the HMM. Furthermore, we propose a task-specific context feature tailoring method to ensure compatibility between state context features and articulatory features that are manipulated at synthesis time. The proposed method is evaluated on two tasks, using a speech database with acoustic waveforms and articulatory movements recorded in parallel by electromagnetic articulography (EMA). In a vowel identity modification task, the new method achieves better performance when reconstructing target vowels by varying articulatory inputs than our previous approach. A second vowel creation task shows our new method is highly effective at producing a new vowel from appropriate articulatory representations which, even though no acoustic samples for this vowel are present in the training data, is shown to sound highly natural. |
doi_str_mv | 10.1109/TASL.2012.2215600 |
format | article |
fullrecord | <record><control><sourceid>proquest_pasca</sourceid><recordid>TN_cdi_pascalfrancis_primary_26853343</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6289354</ieee_id><sourcerecordid>1221886174</sourcerecordid><originalsourceid>FETCH-LOGICAL-c465t-9253162c8f5c8da847dc509476f272d3a8a29e2f55324ed32153dba258081d743</originalsourceid><addsrcrecordid>eNpdkU1rGzEQhpfQQtI0PyDkIiiFXtbV6Gu1R9c0H2DTUidnoWhnY4X1aitpKf73WcfGh15mBuZ5h5l5i-Ia6AyA1t8f5-vljFFgM8ZAKkrPiguQUpdVzcSHUw3qvPiU0iulgisBF0U_j9m7sbM5xB1ZhD7H0JHQkvvVqvxhEzbkt412izl6R9YDotuQ9a7PG0w-kafk-xdyizaPEcv1YN0U__nsNpNwNXbZDx2SP_gSMSUf-s_Fx9Z2Ca-O-bJ4uv35uLgvl7_uHhbzZemEkrmsmeSgmNOtdLqxWlSNk7QWlWpZxRputWU1slZKzgQ2fDqZN8-WSU01NJXgl8W3w9whhr8jpmy2PjnsOttjGJOB6UtaK3hHv_yHvoYx9tN2BkCC0pWAPQUHysWQUsTWDNFvbdwZoGbvgNk7YPYOmKMDk-brcbJNznZttL3z6SRkSkvOBZ-4mwPnEfHUVkzXXAr-BjzGjgI</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1151687414</pqid></control><display><type>article</type><title>Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression</title><source>IEEE Xplore (Online service)</source><creator>Zhen-Hua Ling ; Richmond, K. ; Yamagishi, J.</creator><creatorcontrib>Zhen-Hua Ling ; Richmond, K. ; Yamagishi, J.</creatorcontrib><description>In previous work we proposed a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into a hidden Markov model (HMM) based parametric speech synthesizer. In this method, a unified acoustic-articulatory model is trained, and context-dependent linear transforms are used to model the dependency between the two feature streams. In this paper, we go significantly further and propose a feature-space-switched multiple regression HMM to improve the performance of articulatory control. A multiple regression HMM (MRHMM) is adopted to model the distribution of acoustic features, with articulatory features used as exogenous "explanatory" variables. A separate Gaussian mixture model (GMM) is introduced to model the articulatory space, and articulatory-to-acoustic regression matrices are trained for each component of this GMM, instead of for the context-dependent states in the HMM. Furthermore, we propose a task-specific context feature tailoring method to ensure compatibility between state context features and articulatory features that are manipulated at synthesis time. The proposed method is evaluated on two tasks, using a speech database with acoustic waveforms and articulatory movements recorded in parallel by electromagnetic articulography (EMA). In a vowel identity modification task, the new method achieves better performance when reconstructing target vowels by varying articulatory inputs than our previous approach. A second vowel creation task shows our new method is highly effective at producing a new vowel from appropriate articulatory representations which, even though no acoustic samples for this vowel are present in the training data, is shown to sound highly natural.</description><identifier>ISSN: 1558-7916</identifier><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 1558-7924</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASL.2012.2215600</identifier><identifier>CODEN: ITASD8</identifier><language>eng</language><publisher>Piscataway, NJ: IEEE</publisher><subject>Acoustics ; Applied sciences ; Articulatory features ; Context ; Exact sciences and technology ; Gaussian mixture model ; Hidden Markov models ; Information, signal and communications theory ; Mathematical models ; multiple-regression hidden Markov model ; Pattern recognition ; Regression ; Signal processing ; Speech ; Speech processing ; Speech synthesis ; Studies ; Synthesizers ; Tasks ; Telecommunications and information theory ; Transforms ; Vowels</subject><ispartof>IEEE transactions on audio, speech, and language processing, 2013-01, Vol.21 (1), p.207-219</ispartof><rights>2014 INIST-CNRS</rights><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Jan 2013</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c465t-9253162c8f5c8da847dc509476f272d3a8a29e2f55324ed32153dba258081d743</citedby><cites>FETCH-LOGICAL-c465t-9253162c8f5c8da847dc509476f272d3a8a29e2f55324ed32153dba258081d743</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6289354$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,4009,27902,27903,27904,54774</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=26853343$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhen-Hua Ling</creatorcontrib><creatorcontrib>Richmond, K.</creatorcontrib><creatorcontrib>Yamagishi, J.</creatorcontrib><title>Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression</title><title>IEEE transactions on audio, speech, and language processing</title><addtitle>TASL</addtitle><description>In previous work we proposed a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into a hidden Markov model (HMM) based parametric speech synthesizer. In this method, a unified acoustic-articulatory model is trained, and context-dependent linear transforms are used to model the dependency between the two feature streams. In this paper, we go significantly further and propose a feature-space-switched multiple regression HMM to improve the performance of articulatory control. A multiple regression HMM (MRHMM) is adopted to model the distribution of acoustic features, with articulatory features used as exogenous "explanatory" variables. A separate Gaussian mixture model (GMM) is introduced to model the articulatory space, and articulatory-to-acoustic regression matrices are trained for each component of this GMM, instead of for the context-dependent states in the HMM. Furthermore, we propose a task-specific context feature tailoring method to ensure compatibility between state context features and articulatory features that are manipulated at synthesis time. The proposed method is evaluated on two tasks, using a speech database with acoustic waveforms and articulatory movements recorded in parallel by electromagnetic articulography (EMA). In a vowel identity modification task, the new method achieves better performance when reconstructing target vowels by varying articulatory inputs than our previous approach. A second vowel creation task shows our new method is highly effective at producing a new vowel from appropriate articulatory representations which, even though no acoustic samples for this vowel are present in the training data, is shown to sound highly natural.</description><subject>Acoustics</subject><subject>Applied sciences</subject><subject>Articulatory features</subject><subject>Context</subject><subject>Exact sciences and technology</subject><subject>Gaussian mixture model</subject><subject>Hidden Markov models</subject><subject>Information, signal and communications theory</subject><subject>Mathematical models</subject><subject>multiple-regression hidden Markov model</subject><subject>Pattern recognition</subject><subject>Regression</subject><subject>Signal processing</subject><subject>Speech</subject><subject>Speech processing</subject><subject>Speech synthesis</subject><subject>Studies</subject><subject>Synthesizers</subject><subject>Tasks</subject><subject>Telecommunications and information theory</subject><subject>Transforms</subject><subject>Vowels</subject><issn>1558-7916</issn><issn>2329-9290</issn><issn>1558-7924</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2013</creationdate><recordtype>article</recordtype><recordid>eNpdkU1rGzEQhpfQQtI0PyDkIiiFXtbV6Gu1R9c0H2DTUidnoWhnY4X1aitpKf73WcfGh15mBuZ5h5l5i-Ia6AyA1t8f5-vljFFgM8ZAKkrPiguQUpdVzcSHUw3qvPiU0iulgisBF0U_j9m7sbM5xB1ZhD7H0JHQkvvVqvxhEzbkt412izl6R9YDotuQ9a7PG0w-kafk-xdyizaPEcv1YN0U__nsNpNwNXbZDx2SP_gSMSUf-s_Fx9Z2Ca-O-bJ4uv35uLgvl7_uHhbzZemEkrmsmeSgmNOtdLqxWlSNk7QWlWpZxRputWU1slZKzgQ2fDqZN8-WSU01NJXgl8W3w9whhr8jpmy2PjnsOttjGJOB6UtaK3hHv_yHvoYx9tN2BkCC0pWAPQUHysWQUsTWDNFvbdwZoGbvgNk7YPYOmKMDk-brcbJNznZttL3z6SRkSkvOBZ-4mwPnEfHUVkzXXAr-BjzGjgI</recordid><startdate>201301</startdate><enddate>201301</enddate><creator>Zhen-Hua Ling</creator><creator>Richmond, K.</creator><creator>Yamagishi, J.</creator><general>IEEE</general><general>Institute of Electrical and Electronics Engineers</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>201301</creationdate><title>Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression</title><author>Zhen-Hua Ling ; Richmond, K. ; Yamagishi, J.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c465t-9253162c8f5c8da847dc509476f272d3a8a29e2f55324ed32153dba258081d743</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2013</creationdate><topic>Acoustics</topic><topic>Applied sciences</topic><topic>Articulatory features</topic><topic>Context</topic><topic>Exact sciences and technology</topic><topic>Gaussian mixture model</topic><topic>Hidden Markov models</topic><topic>Information, signal and communications theory</topic><topic>Mathematical models</topic><topic>multiple-regression hidden Markov model</topic><topic>Pattern recognition</topic><topic>Regression</topic><topic>Signal processing</topic><topic>Speech</topic><topic>Speech processing</topic><topic>Speech synthesis</topic><topic>Studies</topic><topic>Synthesizers</topic><topic>Tasks</topic><topic>Telecommunications and information theory</topic><topic>Transforms</topic><topic>Vowels</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhen-Hua Ling</creatorcontrib><creatorcontrib>Richmond, K.</creatorcontrib><creatorcontrib>Yamagishi, J.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005–Present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Xplore</collection><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhen-Hua Ling</au><au>Richmond, K.</au><au>Yamagishi, J.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression</atitle><jtitle>IEEE transactions on audio, speech, and language processing</jtitle><stitle>TASL</stitle><date>2013-01</date><risdate>2013</risdate><volume>21</volume><issue>1</issue><spage>207</spage><epage>219</epage><pages>207-219</pages><issn>1558-7916</issn><issn>2329-9290</issn><eissn>1558-7924</eissn><eissn>2329-9304</eissn><coden>ITASD8</coden><abstract>In previous work we proposed a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into a hidden Markov model (HMM) based parametric speech synthesizer. In this method, a unified acoustic-articulatory model is trained, and context-dependent linear transforms are used to model the dependency between the two feature streams. In this paper, we go significantly further and propose a feature-space-switched multiple regression HMM to improve the performance of articulatory control. A multiple regression HMM (MRHMM) is adopted to model the distribution of acoustic features, with articulatory features used as exogenous "explanatory" variables. A separate Gaussian mixture model (GMM) is introduced to model the articulatory space, and articulatory-to-acoustic regression matrices are trained for each component of this GMM, instead of for the context-dependent states in the HMM. Furthermore, we propose a task-specific context feature tailoring method to ensure compatibility between state context features and articulatory features that are manipulated at synthesis time. The proposed method is evaluated on two tasks, using a speech database with acoustic waveforms and articulatory movements recorded in parallel by electromagnetic articulography (EMA). In a vowel identity modification task, the new method achieves better performance when reconstructing target vowels by varying articulatory inputs than our previous approach. A second vowel creation task shows our new method is highly effective at producing a new vowel from appropriate articulatory representations which, even though no acoustic samples for this vowel are present in the training data, is shown to sound highly natural.</abstract><cop>Piscataway, NJ</cop><pub>IEEE</pub><doi>10.1109/TASL.2012.2215600</doi><tpages>13</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1558-7916 |
ispartof | IEEE transactions on audio, speech, and language processing, 2013-01, Vol.21 (1), p.207-219 |
issn | 1558-7916 2329-9290 1558-7924 2329-9304 |
language | eng |
recordid | cdi_pascalfrancis_primary_26853343 |
source | IEEE Xplore (Online service) |
subjects | Acoustics Applied sciences Articulatory features Context Exact sciences and technology Gaussian mixture model Hidden Markov models Information, signal and communications theory Mathematical models multiple-regression hidden Markov model Pattern recognition Regression Signal processing Speech Speech processing Speech synthesis Studies Synthesizers Tasks Telecommunications and information theory Transforms Vowels |
title | Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-24T19%3A00%3A06IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pasca&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Articulatory%20Control%20of%20HMM-Based%20Parametric%20Speech%20Synthesis%20Using%20Feature-Space-Switched%20Multiple%20Regression&rft.jtitle=IEEE%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Zhen-Hua%20Ling&rft.date=2013-01&rft.volume=21&rft.issue=1&rft.spage=207&rft.epage=219&rft.pages=207-219&rft.issn=1558-7916&rft.eissn=1558-7924&rft.coden=ITASD8&rft_id=info:doi/10.1109/TASL.2012.2215600&rft_dat=%3Cproquest_pasca%3E1221886174%3C/proquest_pasca%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c465t-9253162c8f5c8da847dc509476f272d3a8a29e2f55324ed32153dba258081d743%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=1151687414&rft_id=info:pmid/&rft_ieee_id=6289354&rfr_iscdi=true |