Loading…

Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression

In previous work we proposed a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into a hidden Markov model (HMM) based parametric speech synthesizer. In this method, a unified acoustic-articulatory model is trained, and context-dependent linear...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on audio, speech, and language processing speech, and language processing, 2013-01, Vol.21 (1), p.207-219
Main Authors: Zhen-Hua Ling, Richmond, K., Yamagishi, J.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c465t-9253162c8f5c8da847dc509476f272d3a8a29e2f55324ed32153dba258081d743
cites cdi_FETCH-LOGICAL-c465t-9253162c8f5c8da847dc509476f272d3a8a29e2f55324ed32153dba258081d743
container_end_page 219
container_issue 1
container_start_page 207
container_title IEEE transactions on audio, speech, and language processing
container_volume 21
creator Zhen-Hua Ling
Richmond, K.
Yamagishi, J.
description In previous work we proposed a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into a hidden Markov model (HMM) based parametric speech synthesizer. In this method, a unified acoustic-articulatory model is trained, and context-dependent linear transforms are used to model the dependency between the two feature streams. In this paper, we go significantly further and propose a feature-space-switched multiple regression HMM to improve the performance of articulatory control. A multiple regression HMM (MRHMM) is adopted to model the distribution of acoustic features, with articulatory features used as exogenous "explanatory" variables. A separate Gaussian mixture model (GMM) is introduced to model the articulatory space, and articulatory-to-acoustic regression matrices are trained for each component of this GMM, instead of for the context-dependent states in the HMM. Furthermore, we propose a task-specific context feature tailoring method to ensure compatibility between state context features and articulatory features that are manipulated at synthesis time. The proposed method is evaluated on two tasks, using a speech database with acoustic waveforms and articulatory movements recorded in parallel by electromagnetic articulography (EMA). In a vowel identity modification task, the new method achieves better performance when reconstructing target vowels by varying articulatory inputs than our previous approach. A second vowel creation task shows our new method is highly effective at producing a new vowel from appropriate articulatory representations which, even though no acoustic samples for this vowel are present in the training data, is shown to sound highly natural.
doi_str_mv 10.1109/TASL.2012.2215600
format article
fullrecord <record><control><sourceid>proquest_pasca</sourceid><recordid>TN_cdi_pascalfrancis_primary_26853343</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6289354</ieee_id><sourcerecordid>1221886174</sourcerecordid><originalsourceid>FETCH-LOGICAL-c465t-9253162c8f5c8da847dc509476f272d3a8a29e2f55324ed32153dba258081d743</originalsourceid><addsrcrecordid>eNpdkU1rGzEQhpfQQtI0PyDkIiiFXtbV6Gu1R9c0H2DTUidnoWhnY4X1aitpKf73WcfGh15mBuZ5h5l5i-Ia6AyA1t8f5-vljFFgM8ZAKkrPiguQUpdVzcSHUw3qvPiU0iulgisBF0U_j9m7sbM5xB1ZhD7H0JHQkvvVqvxhEzbkt412izl6R9YDotuQ9a7PG0w-kafk-xdyizaPEcv1YN0U__nsNpNwNXbZDx2SP_gSMSUf-s_Fx9Z2Ca-O-bJ4uv35uLgvl7_uHhbzZemEkrmsmeSgmNOtdLqxWlSNk7QWlWpZxRputWU1slZKzgQ2fDqZN8-WSU01NJXgl8W3w9whhr8jpmy2PjnsOttjGJOB6UtaK3hHv_yHvoYx9tN2BkCC0pWAPQUHysWQUsTWDNFvbdwZoGbvgNk7YPYOmKMDk-brcbJNznZttL3z6SRkSkvOBZ-4mwPnEfHUVkzXXAr-BjzGjgI</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1151687414</pqid></control><display><type>article</type><title>Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression</title><source>IEEE Xplore (Online service)</source><creator>Zhen-Hua Ling ; Richmond, K. ; Yamagishi, J.</creator><creatorcontrib>Zhen-Hua Ling ; Richmond, K. ; Yamagishi, J.</creatorcontrib><description>In previous work we proposed a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into a hidden Markov model (HMM) based parametric speech synthesizer. In this method, a unified acoustic-articulatory model is trained, and context-dependent linear transforms are used to model the dependency between the two feature streams. In this paper, we go significantly further and propose a feature-space-switched multiple regression HMM to improve the performance of articulatory control. A multiple regression HMM (MRHMM) is adopted to model the distribution of acoustic features, with articulatory features used as exogenous "explanatory" variables. A separate Gaussian mixture model (GMM) is introduced to model the articulatory space, and articulatory-to-acoustic regression matrices are trained for each component of this GMM, instead of for the context-dependent states in the HMM. Furthermore, we propose a task-specific context feature tailoring method to ensure compatibility between state context features and articulatory features that are manipulated at synthesis time. The proposed method is evaluated on two tasks, using a speech database with acoustic waveforms and articulatory movements recorded in parallel by electromagnetic articulography (EMA). In a vowel identity modification task, the new method achieves better performance when reconstructing target vowels by varying articulatory inputs than our previous approach. A second vowel creation task shows our new method is highly effective at producing a new vowel from appropriate articulatory representations which, even though no acoustic samples for this vowel are present in the training data, is shown to sound highly natural.</description><identifier>ISSN: 1558-7916</identifier><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 1558-7924</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASL.2012.2215600</identifier><identifier>CODEN: ITASD8</identifier><language>eng</language><publisher>Piscataway, NJ: IEEE</publisher><subject>Acoustics ; Applied sciences ; Articulatory features ; Context ; Exact sciences and technology ; Gaussian mixture model ; Hidden Markov models ; Information, signal and communications theory ; Mathematical models ; multiple-regression hidden Markov model ; Pattern recognition ; Regression ; Signal processing ; Speech ; Speech processing ; Speech synthesis ; Studies ; Synthesizers ; Tasks ; Telecommunications and information theory ; Transforms ; Vowels</subject><ispartof>IEEE transactions on audio, speech, and language processing, 2013-01, Vol.21 (1), p.207-219</ispartof><rights>2014 INIST-CNRS</rights><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Jan 2013</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c465t-9253162c8f5c8da847dc509476f272d3a8a29e2f55324ed32153dba258081d743</citedby><cites>FETCH-LOGICAL-c465t-9253162c8f5c8da847dc509476f272d3a8a29e2f55324ed32153dba258081d743</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6289354$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,4009,27902,27903,27904,54774</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&amp;idt=26853343$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhen-Hua Ling</creatorcontrib><creatorcontrib>Richmond, K.</creatorcontrib><creatorcontrib>Yamagishi, J.</creatorcontrib><title>Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression</title><title>IEEE transactions on audio, speech, and language processing</title><addtitle>TASL</addtitle><description>In previous work we proposed a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into a hidden Markov model (HMM) based parametric speech synthesizer. In this method, a unified acoustic-articulatory model is trained, and context-dependent linear transforms are used to model the dependency between the two feature streams. In this paper, we go significantly further and propose a feature-space-switched multiple regression HMM to improve the performance of articulatory control. A multiple regression HMM (MRHMM) is adopted to model the distribution of acoustic features, with articulatory features used as exogenous "explanatory" variables. A separate Gaussian mixture model (GMM) is introduced to model the articulatory space, and articulatory-to-acoustic regression matrices are trained for each component of this GMM, instead of for the context-dependent states in the HMM. Furthermore, we propose a task-specific context feature tailoring method to ensure compatibility between state context features and articulatory features that are manipulated at synthesis time. The proposed method is evaluated on two tasks, using a speech database with acoustic waveforms and articulatory movements recorded in parallel by electromagnetic articulography (EMA). In a vowel identity modification task, the new method achieves better performance when reconstructing target vowels by varying articulatory inputs than our previous approach. A second vowel creation task shows our new method is highly effective at producing a new vowel from appropriate articulatory representations which, even though no acoustic samples for this vowel are present in the training data, is shown to sound highly natural.</description><subject>Acoustics</subject><subject>Applied sciences</subject><subject>Articulatory features</subject><subject>Context</subject><subject>Exact sciences and technology</subject><subject>Gaussian mixture model</subject><subject>Hidden Markov models</subject><subject>Information, signal and communications theory</subject><subject>Mathematical models</subject><subject>multiple-regression hidden Markov model</subject><subject>Pattern recognition</subject><subject>Regression</subject><subject>Signal processing</subject><subject>Speech</subject><subject>Speech processing</subject><subject>Speech synthesis</subject><subject>Studies</subject><subject>Synthesizers</subject><subject>Tasks</subject><subject>Telecommunications and information theory</subject><subject>Transforms</subject><subject>Vowels</subject><issn>1558-7916</issn><issn>2329-9290</issn><issn>1558-7924</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2013</creationdate><recordtype>article</recordtype><recordid>eNpdkU1rGzEQhpfQQtI0PyDkIiiFXtbV6Gu1R9c0H2DTUidnoWhnY4X1aitpKf73WcfGh15mBuZ5h5l5i-Ia6AyA1t8f5-vljFFgM8ZAKkrPiguQUpdVzcSHUw3qvPiU0iulgisBF0U_j9m7sbM5xB1ZhD7H0JHQkvvVqvxhEzbkt412izl6R9YDotuQ9a7PG0w-kafk-xdyizaPEcv1YN0U__nsNpNwNXbZDx2SP_gSMSUf-s_Fx9Z2Ca-O-bJ4uv35uLgvl7_uHhbzZemEkrmsmeSgmNOtdLqxWlSNk7QWlWpZxRputWU1slZKzgQ2fDqZN8-WSU01NJXgl8W3w9whhr8jpmy2PjnsOttjGJOB6UtaK3hHv_yHvoYx9tN2BkCC0pWAPQUHysWQUsTWDNFvbdwZoGbvgNk7YPYOmKMDk-brcbJNznZttL3z6SRkSkvOBZ-4mwPnEfHUVkzXXAr-BjzGjgI</recordid><startdate>201301</startdate><enddate>201301</enddate><creator>Zhen-Hua Ling</creator><creator>Richmond, K.</creator><creator>Yamagishi, J.</creator><general>IEEE</general><general>Institute of Electrical and Electronics Engineers</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>201301</creationdate><title>Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression</title><author>Zhen-Hua Ling ; Richmond, K. ; Yamagishi, J.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c465t-9253162c8f5c8da847dc509476f272d3a8a29e2f55324ed32153dba258081d743</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2013</creationdate><topic>Acoustics</topic><topic>Applied sciences</topic><topic>Articulatory features</topic><topic>Context</topic><topic>Exact sciences and technology</topic><topic>Gaussian mixture model</topic><topic>Hidden Markov models</topic><topic>Information, signal and communications theory</topic><topic>Mathematical models</topic><topic>multiple-regression hidden Markov model</topic><topic>Pattern recognition</topic><topic>Regression</topic><topic>Signal processing</topic><topic>Speech</topic><topic>Speech processing</topic><topic>Speech synthesis</topic><topic>Studies</topic><topic>Synthesizers</topic><topic>Tasks</topic><topic>Telecommunications and information theory</topic><topic>Transforms</topic><topic>Vowels</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhen-Hua Ling</creatorcontrib><creatorcontrib>Richmond, K.</creatorcontrib><creatorcontrib>Yamagishi, J.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005–Present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Xplore</collection><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhen-Hua Ling</au><au>Richmond, K.</au><au>Yamagishi, J.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression</atitle><jtitle>IEEE transactions on audio, speech, and language processing</jtitle><stitle>TASL</stitle><date>2013-01</date><risdate>2013</risdate><volume>21</volume><issue>1</issue><spage>207</spage><epage>219</epage><pages>207-219</pages><issn>1558-7916</issn><issn>2329-9290</issn><eissn>1558-7924</eissn><eissn>2329-9304</eissn><coden>ITASD8</coden><abstract>In previous work we proposed a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into a hidden Markov model (HMM) based parametric speech synthesizer. In this method, a unified acoustic-articulatory model is trained, and context-dependent linear transforms are used to model the dependency between the two feature streams. In this paper, we go significantly further and propose a feature-space-switched multiple regression HMM to improve the performance of articulatory control. A multiple regression HMM (MRHMM) is adopted to model the distribution of acoustic features, with articulatory features used as exogenous "explanatory" variables. A separate Gaussian mixture model (GMM) is introduced to model the articulatory space, and articulatory-to-acoustic regression matrices are trained for each component of this GMM, instead of for the context-dependent states in the HMM. Furthermore, we propose a task-specific context feature tailoring method to ensure compatibility between state context features and articulatory features that are manipulated at synthesis time. The proposed method is evaluated on two tasks, using a speech database with acoustic waveforms and articulatory movements recorded in parallel by electromagnetic articulography (EMA). In a vowel identity modification task, the new method achieves better performance when reconstructing target vowels by varying articulatory inputs than our previous approach. A second vowel creation task shows our new method is highly effective at producing a new vowel from appropriate articulatory representations which, even though no acoustic samples for this vowel are present in the training data, is shown to sound highly natural.</abstract><cop>Piscataway, NJ</cop><pub>IEEE</pub><doi>10.1109/TASL.2012.2215600</doi><tpages>13</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1558-7916
ispartof IEEE transactions on audio, speech, and language processing, 2013-01, Vol.21 (1), p.207-219
issn 1558-7916
2329-9290
1558-7924
2329-9304
language eng
recordid cdi_pascalfrancis_primary_26853343
source IEEE Xplore (Online service)
subjects Acoustics
Applied sciences
Articulatory features
Context
Exact sciences and technology
Gaussian mixture model
Hidden Markov models
Information, signal and communications theory
Mathematical models
multiple-regression hidden Markov model
Pattern recognition
Regression
Signal processing
Speech
Speech processing
Speech synthesis
Studies
Synthesizers
Tasks
Telecommunications and information theory
Transforms
Vowels
title Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-24T19%3A00%3A06IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pasca&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Articulatory%20Control%20of%20HMM-Based%20Parametric%20Speech%20Synthesis%20Using%20Feature-Space-Switched%20Multiple%20Regression&rft.jtitle=IEEE%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Zhen-Hua%20Ling&rft.date=2013-01&rft.volume=21&rft.issue=1&rft.spage=207&rft.epage=219&rft.pages=207-219&rft.issn=1558-7916&rft.eissn=1558-7924&rft.coden=ITASD8&rft_id=info:doi/10.1109/TASL.2012.2215600&rft_dat=%3Cproquest_pasca%3E1221886174%3C/proquest_pasca%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c465t-9253162c8f5c8da847dc509476f272d3a8a29e2f55324ed32153dba258081d743%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=1151687414&rft_id=info:pmid/&rft_ieee_id=6289354&rfr_iscdi=true