Loading…
Selection of the most significant parameters for duration modelling in a Spanish text-to-speech system using neural networks
Accurate prediction of segmental duration from text in a text-to-speech system is difficult for several reasons. One which is especially relevant is the great quantity of contextual factors that affect timing and it is difficult to find the right way to model them. There are many parameters that aff...
Saved in:
Published in: | Computer speech & language 2002-04, Vol.16 (2), p.183-203 |
---|---|
Main Authors: | , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Accurate prediction of segmental duration from text in a text-to-speech system is difficult for several reasons. One which is especially relevant is the great quantity of contextual factors that affect timing and it is difficult to find the right way to model them. There are many parameters that affect duration, but not all of them are always relevant and some can even be counterproductive because of the possibility of overtraining.
The main motivation of this paper has been to reduce the error in the duration estimation.
To this end, it is of the utmost importance to find the factors that most influence duration in a given language. The approach we have taken is to use a neural network, which is completely configurable, and experiment with the different combinations of parameters that yield the minimum error in the estimation.
We have oriented our work mainly towards the following aspects: the most significant parameters that can be used as input to the automatic model, and the best way to code these parameters. We have studied first the effect of each parameter alone and, after that, we have included all parameters together to have our final system.
Another important aspect of this study is the generation of a suite of software tools and design protocols that will be used in future tasks with different speakers and databases. The applications for automatic modelling are obvious: adapt the prosody to a new speaker, to a new environment, to “restricted-domain" sentences, etc., in a fast, semi-automatic and inexpensive way. After the database labelling, it is a matter of minutes to prepare the inputs to the network for the new situation, and the network is trained in 1 h.
The result has been a system that predicts duration with very good results (19 ms in RMS) and that clearly improves our previous rule-based system. |
---|---|
ISSN: | 0885-2308 1095-8363 |
DOI: | 10.1006/csla.2002.0190 |