Loading…

Clustering of spatial cues by semantic segmentation for anechoic binaural source separation

The recent introduction of neural networks to speech separation has dramatically boosted the separation performance. This paper presents a novel psychoacoustic approach for speech source separation in anechoic conditions, using semantic segmentation of the interaural spectrograms of the audio mixtur...

Full description

Saved in:
Bibliographic Details
Published in:Applied acoustics 2021-01, Vol.171, p.107566, Article 107566
Main Authors: Gul, Sania, Sheryar Fulaly, Muhammad, Salman Khan, Muhammad, Waqar Shah, Syed
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The recent introduction of neural networks to speech separation has dramatically boosted the separation performance. This paper presents a novel psychoacoustic approach for speech source separation in anechoic conditions, using semantic segmentation of the interaural spectrograms of the audio mixtures. We have trained two separate U-Nets (a specialized neural network for semantic segmentation) on the interaural level difference (ILD) spectrogram, and the interaural phase difference (IPD) spectrogram of a single source. After training, these U-Nets are used to predict the class of each time frequency (TF) unit of the interaural spectrogram of the audio mixture. The ILD and IPD soft masks obtained from these U-Nets are combined by a novel scheme which utilizes the strength of the interaural cues in different frequency bands. The results show improved separation over two state of the art machine learning source separation systems utilizing the same interaural cues. There is average improvement of 7.32 dB in signal to distortion ratio (SDR) and 0.3 points improvement in short term objective intelligibility (STOI) over degenerate un-mixing estimation technique (DUET) algorithm and 2.51 dB improvement in SDR with comparable intelligibility over model-based expectation–maximization source separation and localization (MESSL) algorithm.
ISSN:0003-682X
1872-910X
DOI:10.1016/j.apacoust.2020.107566