Loading…
Convolutive Transfer Function Invariant SDR Training Criteria for Multi-Channel Reverberant Speech Separation
Time-domain training criteria have proven to be very effective for the separation of single-channel non-reverberant speech mixtures. Likewise, mask-based beamforming has shown impressive performance in multi-channel reverberant speech enhancement and source separation. Here, we propose to combine ne...
Saved in:
Main Authors: | , , , , , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Citations: | Items that cite this one |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Time-domain training criteria have proven to be very effective for the separation of single-channel non-reverberant speech mixtures. Likewise, mask-based beamforming has shown impressive performance in multi-channel reverberant speech enhancement and source separation. Here, we propose to combine neural network supported multi-channel source separation with a time-domain training objective function. For the objective we propose to use a convolutive transfer function invariant Signal-to-Distortion Ratio (CI-SDR) based loss. While this is a well-known evaluation metric (BSS Eval), it has not been used as a training objective before. To show the effectiveness, we demonstrate the performance on LibriSpeech based reverberant mixtures. On this task, the proposed system approaches the error rate obtained on single-source non-reverberant input, i.e., LibriSpeech test clean, with a difference of only 1.2 percentage points, thus outperforming a conventional permutation invariant training based system and alternative objectives like Scale Invariant Signal-to-Distortion Ratio by a large margin. |
---|---|
ISSN: | 2379-190X |
DOI: | 10.1109/ICASSP39728.2021.9414661 |