Loading…

A Two-Pass Framework of Mispronunciation Detection and Diagnosis for Computer-Aided Pronunciation Training

This paper presents a two-pass framework with discriminative acoustic modeling for mispronunciation detection and diagnoses (MD&D). The first pass of mispronunciation detection does not require explicit phonetic error pattern modeling. The framework instantiates a set of antiphones and a filler...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2016-06, Vol.24 (6), p.1020-1028
Main Authors:	Xiaojun Qian, Meng, Helen, Soong, Frank
Format:	Article
Language:	English
Subjects:	Acoustics computer-aided pronunciation training Diagnosis discriminative training Error detection Feature extraction Hidden Markov models Insertion mispronunciation detection and diagnosis Permissible error Phonetics Speech Speech processing Telephones Training
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This paper presents a two-pass framework with discriminative acoustic modeling for mispronunciation detection and diagnoses (MD&D). The first pass of mispronunciation detection does not require explicit phonetic error pattern modeling. The framework instantiates a set of antiphones and a filler model to augment the original phone model for each canonical phone. This guarantees full coverage of all possible error patterns while maximally exploiting the phonetic information derived from the text prompt. The antiphones can be used to detect substitutions. The filler model can detect insertions, and phone skips are allowed to detect deletions. As such, there is no prior assumption on the possible error patterns that can occur. The second pass of mispronunciation diagnosis expands the detected insertions and substitutions into phone networks, and another recognition pass attempts to reveal the phonetic identities of the detected mispronunciation errors. Discriminative training (DT) is applied respectively to the acoustic models of the mispronunciation detection pass and the mispronunciation diagnosis pass. DT effectively separates the acoustic models of the canonical phones and the antiphones. Overall, with DT in both passes of MD&D, the error rate is reduced by 40.4% relative, compared with the maximum likelihood baseline. After DT, the error rates of the respective passes are also lower than those of a strong single-pass baseline with DT by 1.3% and 5.1% relative which are statistically significant.
ISSN:	2329-9290 2329-9304
DOI:	10.1109/TASLP.2016.2526782