Loading…

Speaker adaptation based on transfer vector field smoothing using maximum a posteriori probability estimation

This paper proposes a novel speaker adaptation algorithm that enables adaptation with a small amount of speech data. This algorithm consists of two blocks. One is a parameter adaptation algorithm that utilizes the information of a well-trained initial model. The other is an initial model generation...

Full description

Saved in:
Bibliographic Details
Published in:Computer speech & language 1996-04, Vol.10 (2), p.117-132
Main Authors: Tonomura, Masahiro, Kosaka, Tetsuo, Matsunaga, Shoichi
Format: Article
Language:English
Subjects:
Citations: Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This paper proposes a novel speaker adaptation algorithm that enables adaptation with a small amount of speech data. This algorithm consists of two blocks. One is a parameter adaptation algorithm that utilizes the information of a well-trained initial model. The other is an initial model generation algorithm that is based on speaker clustering. The former algorithm is based on two speaker adaptation techniques, that is, maximum a posteriori (MAP) estimation and transfer vector field smoothing (VFS). This MAP–VFS algorithm unifies both techniques efficiently to avoid the weaknesses of the methods used individually, and can interpolate and smooth untrained or insufficiently trained parameters by taking into consideration the reliability of each estimated parameter. A higher phoneme recognition performance was obtained by using this algorithm than with the individual methods (MAP and VFS), showing the superiority of the proposed algorithm. With this algorithm, the phoneme recognition error rate was reduced from 22·0% to 19·1% for a speaker-independent model with a total of 6 s of adaptation speech. Then, in order to obtain a more efficient initial model for the MAP–VFS algorithm, the initial model generation algorithm was added. This algorithm generates an initial model by using the speech of a selected speaker cluster based on speaker similarity in order to get a prioriknowledge concerning the characteristics of the target speaker. It was found that adaptation using this initial model reduces the phoneme recognition error rate from 22·0% to 17·7%, showing the effectiveness of using speaker similarity information as a prioriinformation.
ISSN:0885-2308
1095-8363
DOI:10.1006/csla.1996.0008