Loading…

Unsupervised speech separation by detecting speaker changeover points under single channel condition

In this paper, we propose a method to separate two speakers from a single channel speech mixture in an unsupervised way by detecting the speaker change over points. In this work, we have taken the combinations of male–female, male–male and female–female speech mixtures. The samples are taken from th...

Full description

Saved in:
Bibliographic Details
Published in:International journal of speech technology 2021-12, Vol.24 (4), p.1101-1112
Main Authors: Prasanna Kumar, M. K., Kumaraswamy, R.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In this paper, we propose a method to separate two speakers from a single channel speech mixture in an unsupervised way by detecting the speaker change over points. In this work, we have taken the combinations of male–female, male–male and female–female speech mixtures. The samples are taken from the TIMIT database. The speech mixture is segmented into frames of 20 ms duration. Speech features like Pitch, short time energy (STE), Mel-frequency cepstral coefficients (MFCC), linear predictive coefficients (LPC), log area ratio (LAR), reflection coefficient (RC), Log Filter Bank Energy (Log FBE) and Fast Fourier Transfor (FFT) spectrum are computed for each frame. Speaker change over points for the combination of male–female speech mixture can be obtained by drawing a mean pitch value line over pitch contour. This gives good values of signal to interference ratio (SIR) as the difference between male and female pitch values is large. For the combination of male–male and female-female speech mixtures, the relation between successive speech frames are obtained by computing the correlation coefficient between the feature vectors (as mentioned above) of successive speech frames. The same relation can also be obtained by taking Euclidean distance between the feature vectors of successive speech frames. In these cases, speaker changeover points are identified by plotting the correlation coefficient/Euclidian distance against each frame number and locating local minima/maxima respectively. Once the speech segments belonging to each speaker are identified using speaker change over points, mask functions for individual speakers are estimated using time–frequency ratio (TFR) of mixed speech signal and recovered speech segments of individual speakers. This will further improve the separation accuracy and the proposed method gives promising results in terms of SIR, signal to artifact ratio (SAR), signal to distortion ratio (SDR), short time objective intelligibility measure and normalized sub band envelope correlation.
ISSN:1381-2416
1572-8110
DOI:10.1007/s10772-021-09875-3