Loading…

Improving the Performance of ASR System by Building Acoustic Models using Spectro-Temporal and Phase-Based Features

State-of-the-art spectral or temporal features of speech do not provide adequate attributes for automatic speech recognition (ASR) system in noisy environments. Recently, phase-based speech processing has shown its importance in the speech community. Phase-based features are equally important as mag...

Full description

Saved in:
Bibliographic Details
Published in:Circuits, systems, and signal processing systems, and signal processing, 2022-03, Vol.41 (3), p.1609-1632
Main Authors: Dutta, Anirban, Ashishkumar, G., Rao, Ch. V. Rama
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:State-of-the-art spectral or temporal features of speech do not provide adequate attributes for automatic speech recognition (ASR) system in noisy environments. Recently, phase-based speech processing has shown its importance in the speech community. Phase-based features are equally important as magnitude-based features, and if incorporated suitably, it can provide vital acoustic information. This work investigated whether the phase features provide complementary information to spectro-temporal features and enhance the performance of an ASR system. Here, different phase extraction approaches are analysed to identify which representation gives the best performance for the hybrid ASR system. Further, this study addresses the use of phase information along with spectro-temporal features in building an acoustic model for improving the performance of ASR system. Here, gammatonegram-based Gabor filters are utilized to extract the spectro-temporal features from the speech utterances. The combined features seem to inherit better and higher discriminable feature attributes. The experiments are carried out to analyse the performance of ASR system with the combined feature set by considering Aurora2 database and speech utterances from TIMIT corrupted with different noise sources at various SNR values. From the experimental results, it is observed that for the TIMIT database, the performance results show an average relative improvement of 18.2%, 20.1% and 4.7% over MFCC, RASTA-PLP and spectro-temporal features, respectively. In the case of Aurora2 database, a relative improvement of 6.2% on average is obtained with clean training and 6.1% on average is obtained with multi-condition training, compared to the baseline spectro-temporal features.
ISSN:0278-081X
1531-5878
DOI:10.1007/s00034-021-01848-w