Loading…

Exploring Monaural Features for Classification-Based Speech Segregation

Monaural speech segregation has been a very challenging problem for decades. By casting speech segregation as a binary classification problem, recent advances have been made in computational auditory scene analysis on segregation of both voiced and unvoiced speech. So far, pitch and amplitude modula...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on audio, speech, and language processing speech, and language processing, 2013-02, Vol.21 (2), p.270-279
Main Authors:	Wang, Yuxuan, Han, Kun, Wang, DeLiang
Format:	Article
Language:	English
Subjects:	Applied sciences Binary classification Classification computational auditory scene analysis (CASA) Exact sciences and technology feature combination Feature extraction group Lasso Information, signal and communications theory Linear prediction Mel frequency cepstral coefficient Modulation, demodulation monaural speech segregation Natural language processing Scene analysis Segregations Signal and communications theory Signal representation. Spectral analysis Signal to noise ratio Signal, noise Spectrograms Speech Studies Telecommunications and information theory Training Transaction processing
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Monaural speech segregation has been a very challenging problem for decades. By casting speech segregation as a binary classification problem, recent advances have been made in computational auditory scene analysis on segregation of both voiced and unvoiced speech. So far, pitch and amplitude modulation spectrogram have been used as two main kinds of time-frequency (T-F) unit level features in classification. In this paper, we expand T-F unit features to include gammatone frequency cepstral coefficients (GFCC), mel-frequency cepstral coefficients, relative spectral transform (RASTA) and perceptual linear prediction (PLP). Comprehensive comparisons are performed in order to identify effective features for classification-based speech segregation. Our experiments in matched and unmatched test conditions show that these newly included features significantly improve speech segregation performance. Specifically, GFCC and RASTA-PLP are the best single features in matched-noise and unmatched-noise test conditions, respectively. We also find that pitch-based features are crucial for good generalization to unseen environments. To further explore complementarity in terms of discriminative power, we propose to use a group Lasso approach to select complementary features in a principled way. The final combined feature set yields promising results in both matched and unmatched test conditions.
ISSN:	1558-7916 2329-9290 1558-7924 2329-9304
DOI:	10.1109/TASL.2012.2221459