Loading…

Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression

In previous work we proposed a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into a hidden Markov model (HMM) based parametric speech synthesizer. In this method, a unified acoustic-articulatory model is trained, and context-dependent linear...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on audio, speech, and language processing speech, and language processing, 2013-01, Vol.21 (1), p.207-219
Main Authors:	Zhen-Hua Ling, Richmond, K., Yamagishi, J.
Format:	Article
Language:	English
Subjects:	Acoustics Applied sciences Articulatory features Context Exact sciences and technology Gaussian mixture model Hidden Markov models Information, signal and communications theory Mathematical models multiple-regression hidden Markov model Pattern recognition Regression Signal processing Speech Speech processing Speech synthesis Studies Synthesizers Tasks Telecommunications and information theory Transforms Vowels
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c465t-9253162c8f5c8da847dc509476f272d3a8a29e2f55324ed32153dba258081d743
cites	cdi_FETCH-LOGICAL-c465t-9253162c8f5c8da847dc509476f272d3a8a29e2f55324ed32153dba258081d743
container_end_page	219
container_issue	1
container_start_page	207
container_title	IEEE transactions on audio, speech, and language processing
container_volume	21
creator	Zhen-Hua Ling Richmond, K. Yamagishi, J.
description	In previous work we proposed a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into a hidden Markov model (HMM) based parametric speech synthesizer. In this method, a unified acoustic-articulatory model is trained, and context-dependent linear transforms are used to model the dependency between the two feature streams. In this paper, we go significantly further and propose a feature-space-switched multiple regression HMM to improve the performance of articulatory control. A multiple regression HMM (MRHMM) is adopted to model the distribution of acoustic features, with articulatory features used as exogenous "explanatory" variables. A separate Gaussian mixture model (GMM) is introduced to model the articulatory space, and articulatory-to-acoustic regression matrices are trained for each component of this GMM, instead of for the context-dependent states in the HMM. Furthermore, we propose a task-specific context feature tailoring method to ensure compatibility between state context features and articulatory features that are manipulated at synthesis time. The proposed method is evaluated on two tasks, using a speech database with acoustic waveforms and articulatory movements recorded in parallel by electromagnetic articulography (EMA). In a vowel identity modification task, the new method achieves better performance when reconstructing target vowels by varying articulatory inputs than our previous approach. A second vowel creation task shows our new method is highly effective at producing a new vowel from appropriate articulatory representations which, even though no acoustic samples for this vowel are present in the training data, is shown to sound highly natural.
doi_str_mv	10.1109/TASL.2012.2215600
format	article
fullrecord	<record><control><sourceid>proquest_pasca</sourceid><recordid>TN_cdi_pascalfrancis_primary_26853343</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6289354</ieee_id><sourcerecordid>1221886174</sourcerecordid><originalsourceid>FETCH-LOGICAL-c465t-9253162c8f5c8da847dc509476f272d3a8a29e2f55324ed32153dba258081d743</originalsourceid><addsrcrecordid>eNpdkU1rGzEQhpfQQtI0PyDkIiiFXtbV6Gu1R9c0H2DTUidnoWhnY4X1aitpKf73WcfGh15mBuZ5h5l5i-Ia6AyA1t8f5-vljFFgM8ZAKkrPiguQUpdVzcSHUw3qvPiU0iulgisBF0U_j9m7sbM5xB1ZhD7H0JHQkvvVqvxhEzbkt412izl6R9YDotuQ9a7PG0w-kafk-xdyizaPEcv1YN0U__nsNpNwNXbZDx2SP_gSMSUf-s_Fx9Z2Ca-O-bJ4uv35uLgvl7_uHhbzZemEkrmsmeSgmNOtdLqxWlSNk7QWlWpZxRputWU1slZKzgQ2fDqZN8-WSU01NJXgl8W3w9whhr8jpmy2PjnsOttjGJOB6UtaK3hHv_yHvoYx9tN2BkCC0pWAPQUHysWQUsTWDNFvbdwZoGbvgNk7YPYOmKMDk-brcbJNznZttL3z6SRkSkvOBZ-4mwPnEfHUVkzXXAr-BjzGjgI</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1151687414</pqid></control><display><type>article</type><title>Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression</title><source>IEEE Xplore (Online service)</source><creator>Zhen-Hua Ling ; Richmond, K. ; Yamagishi, J.</creator><creatorcontrib>Zhen-Hua Ling ; Richmond, K. ; Yamagishi, J.</creatorcontrib><description>In previous work we proposed a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into a hidden Markov model (HMM) based parametric speech synthesizer. In this method, a unified acoustic-articulatory model is trained, and context-dependent linear transforms are used to model the dependency between the two feature streams. In this paper, we go significantly further and propose a feature-space-switched multiple regression HMM to improve the performance of articulatory control. A multiple regression HMM (MRHMM) is adopted to model the distribution of acoustic features, with articulatory features used as exogenous "explanatory" variables. A separate Gaussian mixture model (GMM) is introduced to model the articulatory space, and articulatory-to-acoustic regression matrices are trained for each component of this GMM, instead of for the context-dependent states in the HMM. Furthermore, we propose a task-specific context feature tailoring method to ensure compatibility between state context features and articulatory features that are manipulated at synthesis time. The proposed method is evaluated on two tasks, using a speech database with acoustic waveforms and articulatory movements recorded in parallel by electromagnetic articulography (EMA). In a vowel identity modification task, the new method achieves better performance when reconstructing target vowels by varying articulatory inputs than our previous approach. A second vowel creation task shows our new method is highly effective at producing a new vowel from appropriate articulatory representations which, even though no acoustic samples for this vowel are present in the training data, is shown to sound highly natural.</description><identifier>ISSN: 1558-7916</identifier><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 1558-7924</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASL.2012.2215600</identifier><identifier>CODEN: ITASD8</identifier><language>eng</language><publisher>Piscataway, NJ: IEEE</publisher><subject>Acoustics ; Applied sciences ; Articulatory features ; Context ; Exact sciences and technology ; Gaussian mixture model ; Hidden Markov models ; Information, signal and communications theory ; Mathematical models ; multiple-regression hidden Markov model ; Pattern recognition ; Regression ; Signal processing ; Speech ; Speech processing ; Speech synthesis ; Studies ; Synthesizers ; Tasks ; Telecommunications and information theory ; Transforms ; Vowels</subject><ispartof>IEEE transactions on audio, speech, and language processing, 2013-01, Vol.21 (1), p.207-219</ispartof><rights>2014 INIST-CNRS</rights><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Jan 2013</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c465t-9253162c8f5c8da847dc509476f272d3a8a29e2f55324ed32153dba258081d743</citedby><cites>FETCH-LOGICAL-c465t-9253162c8f5c8da847dc509476f272d3a8a29e2f55324ed32153dba258081d743</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6289354$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,4009,27902,27903,27904,54774</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=26853343$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhen-Hua Ling</creatorcontrib><creatorcontrib>Richmond, K.</creatorcontrib><creatorcontrib>Yamagishi, J.</creatorcontrib><title>Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression</title><title>IEEE transactions on audio, speech, and language processing</title><addtitle>TASL</addtitle><description>In previous work we proposed a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into a hidden Markov model (HMM) based parametric speech synthesizer. In this method, a unified acoustic-articulatory model is trained, and context-dependent linear transforms are used to model the dependency between the two feature streams. In this paper, we go significantly further and propose a feature-space-switched multiple regression HMM to improve the performance of articulatory control. A multiple regression HMM (MRHMM) is adopted to model the distribution of acoustic features, with articulatory features used as exogenous "explanatory" variables. A separate Gaussian mixture model (GMM) is introduced to model the articulatory space, and articulatory-to-acoustic regression matrices are trained for each component of this GMM, instead of for the context-dependent states in the HMM. Furthermore, we propose a task-specific context feature tailoring method to ensure compatibility between state context features and articulatory features that are manipulated at synthesis time. The proposed method is evaluated on two tasks, using a speech database with acoustic waveforms and articulatory movements recorded in parallel by electromagnetic articulography (EMA). In a vowel identity modification task, the new method achieves better performance when reconstructing target vowels by varying articulatory inputs than our previous approach. A second vowel creation task shows our new method is highly effective at producing a new vowel from appropriate articulatory representations which, even though no acoustic samples for this vowel are present in the training data, is shown to sound highly natural.</description><subject>Acoustics</subject><subject>Applied sciences</subject><subject>Articulatory features</subject><subject>Context</subject><subject>Exact sciences and technology</subject><subject>Gaussian mixture model</subject><subject>Hidden Markov models</subject><subject>Information, signal and communications theory</subject><subject>Mathematical models</subject><subject>multiple-regression hidden Markov model</subject><subject>Pattern recognition</subject><subject>Regression</subject><subject>Signal processing</subject><subject>Speech</subject><subject>Speech processing</subject><subject>Speech synthesis</subject><subject>Studies</subject><subject>Synthesizers</subject><subject>Tasks</subject><subject>Telecommunications and information theory</subject><subject>Transforms</subject><subject>Vowels</subject><issn>1558-7916</issn><issn>2329-9290</issn><issn>1558-7924</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2013</creationdate><recordtype>article</recordtype><recordid>eNpdkU1rGzEQhpfQQtI0PyDkIiiFXtbV6Gu1R9c0H2DTUidnoWhnY4X1aitpKf73WcfGh15mBuZ5h5l5i-Ia6AyA1t8f5-vljFFgM8ZAKkrPiguQUpdVzcSHUw3qvPiU0iulgisBF0U_j9m7sbM5xB1ZhD7H0JHQkvvVqvxhEzbkt412izl6R9YDotuQ9a7PG0w-kafk-xdyizaPEcv1YN0U__nsNpNwNXbZDx2SP_gSMSUf-s_Fx9Z2Ca-O-bJ4uv35uLgvl7_uHhbzZemEkrmsmeSgmNOtdLqxWlSNk7QWlWpZxRputWU1slZKzgQ2fDqZN8-WSU01NJXgl8W3w9whhr8jpmy2PjnsOttjGJOB6UtaK3hHv_yHvoYx9tN2BkCC0pWAPQUHysWQUsTWDNFvbdwZoGbvgNk7YPYOmKMDk-brcbJNznZttL3z6SRkSkvOBZ-4mwPnEfHUVkzXXAr-BjzGjgI</recordid><startdate>201301</startdate><enddate>201301</enddate><creator>Zhen-Hua Ling</creator><creator>Richmond, K.</creator><creator>Yamagishi, J.</creator><general>IEEE</general><general>Institute of Electrical and Electronics Engineers</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>201301</creationdate><title>Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression</title><author>Zhen-Hua Ling ; Richmond, K. ; Yamagishi, J.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c465t-9253162c8f5c8da847dc509476f272d3a8a29e2f55324ed32153dba258081d743</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2013</creationdate><topic>Acoustics</topic><topic>Applied sciences</topic><topic>Articulatory features</topic><topic>Context</topic><topic>Exact sciences and technology</topic><topic>Gaussian mixture model</topic><topic>Hidden Markov models</topic><topic>Information, signal and communications theory</topic><topic>Mathematical models</topic><topic>multiple-regression hidden Markov model</topic><topic>Pattern recognition</topic><topic>Regression</topic><topic>Signal processing</topic><topic>Speech</topic><topic>Speech processing</topic><topic>Speech synthesis</topic><topic>Studies</topic><topic>Synthesizers</topic><topic>Tasks</topic><topic>Telecommunications and information theory</topic><topic>Transforms</topic><topic>Vowels</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhen-Hua Ling</creatorcontrib><creatorcontrib>Richmond, K.</creatorcontrib><creatorcontrib>Yamagishi, J.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005–Present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Xplore</collection><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhen-Hua Ling</au><au>Richmond, K.</au><au>Yamagishi, J.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression</atitle><jtitle>IEEE transactions on audio, speech, and language processing</jtitle><stitle>TASL</stitle><date>2013-01</date><risdate>2013</risdate><volume>21</volume><issue>1</issue><spage>207</spage><epage>219</epage><pages>207-219</pages><issn>1558-7916</issn><issn>2329-9290</issn><eissn>1558-7924</eissn><eissn>2329-9304</eissn><coden>ITASD8</coden><abstract>In previous work we proposed a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into a hidden Markov model (HMM) based parametric speech synthesizer. In this method, a unified acoustic-articulatory model is trained, and context-dependent linear transforms are used to model the dependency between the two feature streams. In this paper, we go significantly further and propose a feature-space-switched multiple regression HMM to improve the performance of articulatory control. A multiple regression HMM (MRHMM) is adopted to model the distribution of acoustic features, with articulatory features used as exogenous "explanatory" variables. A separate Gaussian mixture model (GMM) is introduced to model the articulatory space, and articulatory-to-acoustic regression matrices are trained for each component of this GMM, instead of for the context-dependent states in the HMM. Furthermore, we propose a task-specific context feature tailoring method to ensure compatibility between state context features and articulatory features that are manipulated at synthesis time. The proposed method is evaluated on two tasks, using a speech database with acoustic waveforms and articulatory movements recorded in parallel by electromagnetic articulography (EMA). In a vowel identity modification task, the new method achieves better performance when reconstructing target vowels by varying articulatory inputs than our previous approach. A second vowel creation task shows our new method is highly effective at producing a new vowel from appropriate articulatory representations which, even though no acoustic samples for this vowel are present in the training data, is shown to sound highly natural.</abstract><cop>Piscataway, NJ</cop><pub>IEEE</pub><doi>10.1109/TASL.2012.2215600</doi><tpages>13</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1558-7916
ispartof	IEEE transactions on audio, speech, and language processing, 2013-01, Vol.21 (1), p.207-219
issn	1558-7916 2329-9290 1558-7924 2329-9304
language	eng
recordid	cdi_pascalfrancis_primary_26853343
source	IEEE Xplore (Online service)
subjects	Acoustics Applied sciences Articulatory features Context Exact sciences and technology Gaussian mixture model Hidden Markov models Information, signal and communications theory Mathematical models multiple-regression hidden Markov model Pattern recognition Regression Signal processing Speech Speech processing Speech synthesis Studies Synthesizers Tasks Telecommunications and information theory Transforms Vowels
title	Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-24T19%3A00%3A06IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pasca&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Articulatory%20Control%20of%20HMM-Based%20Parametric%20Speech%20Synthesis%20Using%20Feature-Space-Switched%20Multiple%20Regression&rft.jtitle=IEEE%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Zhen-Hua%20Ling&rft.date=2013-01&rft.volume=21&rft.issue=1&rft.spage=207&rft.epage=219&rft.pages=207-219&rft.issn=1558-7916&rft.eissn=1558-7924&rft.coden=ITASD8&rft_id=info:doi/10.1109/TASL.2012.2215600&rft_dat=%3Cproquest_pasca%3E1221886174%3C/proquest_pasca%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c465t-9253162c8f5c8da847dc509476f272d3a8a29e2f55324ed32153dba258081d743%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=1151687414&rft_id=info:pmid/&rft_ieee_id=6289354&rfr_iscdi=true