Loading…

Modeling Speech Structure to Improve T-F Masks for Speech Enhancement and Recognition

Time-frequency (TF) masks are widely used in speech enhancement (SE). However, accurately estimating TF masks from noisy speech remains a challenge to both statistical or neural network (NN) approaches. Statistical model based mask estimation usually depends on a good parameter initialization, while...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2022, Vol.30, p.2705-2715
Main Authors:	Bu, Suliang, Zhao, Yunxin, Zhao, Tuo, Wang, Shaojun, Han, Mei
Format:	Article
Language:	English
Subjects:	Artificial neural networks beamforming Estimation Masks Narrowband Neural networks Noise measurement Parameters Performance prediction Probabilistic models Spectrogram Speech Speech enhancement speech enhancement and recognition Speech processing Speech recognition speech region Statistical models Target masking Time-frequency masks Training UNet Voice recognition
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Time-frequency (TF) masks are widely used in speech enhancement (SE). However, accurately estimating TF masks from noisy speech remains a challenge to both statistical or neural network (NN) approaches. Statistical model based mask estimation usually depends on a good parameter initialization, while NN-based method relies on setting proper and stable learning targets. To address these issues, we propose to extract TF speech structure from clean speech and partition noisy speech spectrogram into mutually exclusive regions. We investigate modeling clean speech by utterance-specific narrowband complex Gaussian mixture models to derive the regions, and using the region targets to supervise the training of UNet++, a high-performance NN, for predicting regions from noisy speech. For multichannel SE, we consider two scenarios of using speech regions: 1) integrating the regions with TF masks by constraining the mask values or the model parameter updates, and 2) using the predicted regions in place of TF masks. For single-channel SE, we consider using the region targets to improve TF mask targets. Furthermore, we propose to use UNet++ for TF mask estimation. Our experiment results on speech recognition (CHiME-3) and SE (CHiME-3 and LibriSpeech) have demonstrated the effectiveness of our proposed approach of modeling speech region structure to improve TF masks for speech recognition and enhancement.
ISSN:	2329-9290 2329-9304
DOI:	10.1109/TASLP.2022.3196168