Loading…

Channel selection based on multichannel cross-correlation coefficients for distant speech recognition

In theory, beamforming performance can be improved by using as many microphones as possible, but in practice it has been shown that using all possible channels does not always improve speech recognition performance. In this work, we present a new channel selection method in order to increase the com...

Full description

Saved in:
Bibliographic Details
Main Authors: Kumatani, K., McDonough, J., Lehman, J. F., Raj, B.
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In theory, beamforming performance can be improved by using as many microphones as possible, but in practice it has been shown that using all possible channels does not always improve speech recognition performance. In this work, we present a new channel selection method in order to increase the computational efficiency of beamforming for distant speech recognition (DSR) without sacrficing performance. To achieve better performance, we treat a channel that is uncor related with the others as unreliable and choose a subset of micro phones whose signals are most highly correlated with each other. We use the multichannel cross-correlation coefficient (MCCC) as a measure for selecting the reliable channels. The selected channels are then used for beamforming. We evaluate our channel selection technique with DSR experiments on real children's speech data captured using a linear array with 64 microphones. A single distant microphone provided a word error rate (WER) of 15.4%, which was reduced to 8.5% by super directive beamforming with all the sensors. The experimental results suggest that almost the same recognition performance can be obtained with half the number of sensors in the case of super-directive beamforming. Maximum kurtosis beamforming with 48 sensors out of a total of 64 achieved a WER of 5.7%, which is very comparable to the 5.2% WER obtained with a close-talking microphone.
DOI:10.1109/HSCMA.2011.5942398