Loading…
Creating Robust Children’s ASR System in Zero-Resource Condition Through Out-of-Domain Data Augmentation
Developing an automatic speech recognition (ASR) system for children’s speech is extremely challenging due to the unavailability of data from the child domain for the majority of the languages. Consequently, in such zero-resource scenarios, we are forced to develop an ASR system using adults’ speech...
Saved in:
Published in: | Circuits, systems, and signal processing systems, and signal processing, 2022-04, Vol.41 (4), p.2205-2220 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Developing an automatic speech recognition (ASR) system for children’s speech is extremely challenging due to the unavailability of data from the child domain for the majority of the languages. Consequently, in such zero-resource scenarios, we are forced to develop an ASR system using adults’ speech for transcribing data from child speakers. However, the acoustic mismatch due to differences in formant frequencies and speaking rate between the two groups of speakers results in poor recognition rates as reported in earlier works. To reduce the said mismatch, an out-of-domain data augmentation approach based on formant and time-scale modification is proposed in this work. For that purpose, formant frequencies of adults’ speech data are up-scaled using warping of linear predictive coding coefficients. Next, the speaking rate of adults’ speech data is decreased through time-scale modification. Due to simultaneous altering of formant frequencies and duration of adults’ speech and then pooling the modified data into training, the ill effects of the acoustic mismatch due to the aforementioned factors get reduced. This, in turn, enhances the recognition performance significantly. Additional improvement in recognition rate is obtained by combining the recently reported voice-conversion-based data augmentation technique with the proposed approach. As demonstrated by the experimental evaluations presented in this paper, compared to an adult data trained ASR system, a relative reduction of
37.6
%
in word error rate is achieved through data augmentation. Furthermore, the proposed approach yields large reductions in word error rates even under noisy test conditions. |
---|---|
ISSN: | 0278-081X 1531-5878 |
DOI: | 10.1007/s00034-021-01885-5 |