Loading…
Role of Linear, Mel and Inverse-Mel Filterbanks in Automatic Recognition of Speech from High-Pitched Speakers
In the context of automatic speech recognition (ASR), the power spectrum is generally warped to the Mel-scale during front-end speech parameterization. This is motivated by the fact that human perception of sound is nonlinear. The Mel-filterbank provides better resolution for low-frequency contents,...
Saved in:
Published in: | Circuits, systems, and signal processing systems, and signal processing, 2019-10, Vol.38 (10), p.4667-4682 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | In the context of automatic speech recognition (ASR), the power spectrum is generally warped to the Mel-scale during front-end speech parameterization. This is motivated by the fact that human perception of sound is nonlinear. The Mel-filterbank provides better resolution for low-frequency contents, while a greater degree of averaging happens in the high-frequency range. The work presented in this paper aims at studying the role of linear, Mel and inverse-Mel-filterbanks in the context of ASR. When speech data are from high-pitched speakers like children, there is a significant amount of relevant information in the high-frequency region. Hence, down-sampling the information in that range through Mel-filterbank reduces the recognition performance. On the other hand, employing inverse-Mel or linear-filterbanks is expected to be more effective in such cases. The same has been experimentally validated in this work. For that purpose, an ASR system is developed on adults’ speech and tested using data from adult as well as child speakers. Significantly improved recognition rates are noted for children’s as well adult females’ speech when linear or inverse-Mel-filterbank is used. The use of linear filters results in a relative improvement of
21
%
over the baseline. To further boost the performance, vocal-tract length normalization, explicit pitch scaling and pitch-adaptive spectral estimation are also explored on top of linear filterbank. |
---|---|
ISSN: | 0278-081X 1531-5878 |
DOI: | 10.1007/s00034-019-01072-7 |