Loading…
Studying the role of pitch-adaptive spectral estimation and speaking-rate normalization in automatic speech recognition
In the context of automatic speech recognition (ASR) systems, the front-end acoustic features should not be affected by signal periodicity (pitch period). Motivated by this fact, we have studied the role of pitch-synchronous spectrum estimation approach, referred to as TANDEM STRAIGHT, in this paper...
Saved in:
Published in: | Digital signal processing 2018-08, Vol.79, p.142-151 |
---|---|
Main Authors: | , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | In the context of automatic speech recognition (ASR) systems, the front-end acoustic features should not be affected by signal periodicity (pitch period). Motivated by this fact, we have studied the role of pitch-synchronous spectrum estimation approach, referred to as TANDEM STRAIGHT, in this paper. TANDEM STRAIGHT results in a smoother spectrum devoid of pitch harmonics to a large extent. Consequently, the acoustic features derived using the smoothed spectra outperform the conventional Mel-frequency cepstral coefficients (MFCC). The experimental evaluations reported in this paper are performed on speech data from a wide range of speakers belonging to different age groups including children. The proposed features are found to be effective for all groups of speakers. To further improve the recognition of children's speech, the effect of vocal-tract length normalization (VTLN) is studied. The inclusion of VTLN further improves the recognition performance. We have also performed a detailed study on the effect of speaking-rate normalization (SRN) in the context of children's speech recognition. An SRN technique based on the anchoring of glottal closure instants estimated using zero-frequency filtering is explored in this regard. SRN is observed to be highly effective for child speakers belonging to different age groups. Finally, all the studied techniques are combined for effective mismatch reduction. In the case of children's speech test set, the use of proposed features results in a relative improvement of 21.6% over the MFCC features even after combining VTLN and SRN. |
---|---|
ISSN: | 1051-2004 1095-4333 |
DOI: | 10.1016/j.dsp.2018.05.003 |