Loading…

Speech recognition in a dialog system: from conventional to deep processing: A case study applied to Spanish

The aim of this paper is to illustrate an overview of the automatic speech recognition (ASR) module in a spoken dialog system and how it has evolved from the conventional GMM-HMM (Gaussian mixture model - hidden Markov model) architecture toward the recent nonlinear DNN-HMM (deep neural network) sch...

Full description

Saved in:
Bibliographic Details
Published in:Multimedia tools and applications 2018-06, Vol.77 (12), p.15875-15911
Main Authors: Becerra, Aldonso, de la Rosa, J. Ismael, González, Efrén
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The aim of this paper is to illustrate an overview of the automatic speech recognition (ASR) module in a spoken dialog system and how it has evolved from the conventional GMM-HMM (Gaussian mixture model - hidden Markov model) architecture toward the recent nonlinear DNN-HMM (deep neural network) scheme. GMMs have dominated for a long time the baseline of speech recognition, but in the past years with the resurgence of artificial neural networks (ANNs), the former models have been surpassed in most recognition tasks. An outstanding consideration for ANNs-based acoustic model is the fact that their weights can be adjusted in two training steps: i) initialization of the weights (with or without pre-training) and ii) fine-tuning. To exemplify these frameworks, a case study is realized by using the Kaldi toolkit, employing a mid-vocabulary with a personalized speaker-independent voice corpus on a connected-words phone dialing environment operated for recognition of digit strings and personal name lists in Spanish from Mexico. The obtained results show a reasonable accuracy in the speech recognition performance through the DNN acoustic modeling. A word error rate (WER) of 1.49 % for context-dependent DNN-HMM is achieved, providing a 30 % relative improvement with regard to the best GMM-HMM result in these experiments (2.12 % WER).
ISSN:1380-7501
1573-7721
DOI:10.1007/s11042-017-5160-5