Loading…

Techniques for handling convolutional distortion with `missing data' automatic speech recognition

In this study we describe two techniques for handling convolutional distortion with `missing data' speech recognition using spectral features. The missing data approach to automatic speech recognition (ASR) is motivated by a model of human speech perception, and involves the modification of a h...

Full description

Saved in:
Bibliographic Details
Published in:Speech communication 2004-06, Vol.43 (1), p.123-142
Main Authors: Palomäki, Kalle J, Brown, Guy J, Barker, Jon P
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In this study we describe two techniques for handling convolutional distortion with `missing data' speech recognition using spectral features. The missing data approach to automatic speech recognition (ASR) is motivated by a model of human speech perception, and involves the modification of a hidden Markov model (HMM) classifier to deal with missing or unreliable features. Although the missing data paradigm was proposed as a means of handling additive noise in ASR, we demonstrate that it can also be effective in dealing with convolutional distortion. Firstly, we propose a normalisation technique for handling spectral distortions and changes of input level (possibly in the presence of additive noise). The technique computes a normalising factor only from the most intense regions of the speech spectrum, which are likely to remain intact across various noise conditions. We show that the proposed normalisation method improves performance compared to a conventional missing data approach with spectrally distorted and noise contaminated speech, and in conditions where the gain of the input signal varies. Secondly, we propose a method for handling reverberated speech which attempts to identify time-frequency regions that are not badly contaminated by reverberation and have strong speech energy. This is achieved by using modulation filtering to identify `reliable' regions of the speech spectrum. We demonstrate that our approach improves recognition performance in cases where the reverberation time T 60 exceeds 0.7 s, compared to a baseline system which uses acoustic features derived from perceptual linear prediction and the modulation-filtered spectrogram.
ISSN:0167-6393
1872-7182
DOI:10.1016/j.specom.2004.02.005