Loading…

Creating Robust Children’s ASR System in Zero-Resource Condition Through Out-of-Domain Data Augmentation

Developing an automatic speech recognition (ASR) system for children’s speech is extremely challenging due to the unavailability of data from the child domain for the majority of the languages. Consequently, in such zero-resource scenarios, we are forced to develop an ASR system using adults’ speech...

Full description

Saved in:

Bibliographic Details
Published in:	Circuits, systems, and signal processing systems, and signal processing, 2022-04, Vol.41 (4), p.2205-2220
Main Authors:	Kumar, Vinit, Kumar, Avinash, Shahnawazuddin, S.
Format:	Article
Language:	English
Subjects:	Acoustics Adults Automatic speech recognition Children Circuits and Systems Data augmentation Domains Electrical Engineering Electronics and Microelectronics Engineering Error analysis Instrumentation Signal,Image and Speech Processing Speaking Speech Speech duration Speech rate Speech recognition Voice recognition Words (language)
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Developing an automatic speech recognition (ASR) system for children’s speech is extremely challenging due to the unavailability of data from the child domain for the majority of the languages. Consequently, in such zero-resource scenarios, we are forced to develop an ASR system using adults’ speech for transcribing data from child speakers. However, the acoustic mismatch due to differences in formant frequencies and speaking rate between the two groups of speakers results in poor recognition rates as reported in earlier works. To reduce the said mismatch, an out-of-domain data augmentation approach based on formant and time-scale modification is proposed in this work. For that purpose, formant frequencies of adults’ speech data are up-scaled using warping of linear predictive coding coefficients. Next, the speaking rate of adults’ speech data is decreased through time-scale modification. Due to simultaneous altering of formant frequencies and duration of adults’ speech and then pooling the modified data into training, the ill effects of the acoustic mismatch due to the aforementioned factors get reduced. This, in turn, enhances the recognition performance significantly. Additional improvement in recognition rate is obtained by combining the recently reported voice-conversion-based data augmentation technique with the proposed approach. As demonstrated by the experimental evaluations presented in this paper, compared to an adult data trained ASR system, a relative reduction of 37.6 % in word error rate is achieved through data augmentation. Furthermore, the proposed approach yields large reductions in word error rates even under noisy test conditions.
ISSN:	0278-081X 1531-5878
DOI:	10.1007/s00034-021-01885-5