Loading…

Improved Prosody Generation by Maximizing Joint Probability of State and Longer Units

The current state-of-the-art hidden Markov model (HMM)-based text-to-speech (TTS) can produce highly intelligible, synthesized speech with decent segmental quality. However, its prosody, especially at phrase or sentence level, still tends to be bland. This blandness is partially due to the fact that...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on audio, speech, and language processing speech, and language processing, 2011-08, Vol.19 (6), p.1702-1710
Main Authors: Yao Qian, Zhizheng Wu, Boyang Gao, Soong, F K
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The current state-of-the-art hidden Markov model (HMM)-based text-to-speech (TTS) can produce highly intelligible, synthesized speech with decent segmental quality. However, its prosody, especially at phrase or sentence level, still tends to be bland. This blandness is partially due to the fact that the state-based HMM is inadequate in capturing global, hierarchical suprasegmental information in speech signals. In this paper, to improve the TTS prosody, longer units are first explicitly modeled with appropriate parametric distributions. The resultant models are then integrated with the state-based baseline models in generating better prosody by maximizing the joint probability. Experimental results in both Mandarin and English show consistent improvements over our baseline system with only state-based prosody model. The improvements are both objectively measurable and subjectively perceivable.
ISSN:1558-7916
2329-9290
1558-7924
2329-9304
DOI:10.1109/TASL.2010.2097248