Loading…
Enhanced speech separation through a supervised approach using bidirectional long short-term memory in dual domains
The process of separating individual sound sources from mono audio is a complex yet essential endeavor in audio signal processing and analysis. This article presents an algorithm tailored for bidirectional transformations aimed at effectively isolating speech from single-channel audio. Leveraging th...
Saved in:
Published in: | Computers & electrical engineering 2024-08, Vol.118, p.109364, Article 109364 |
---|---|
Main Authors: | , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The process of separating individual sound sources from mono audio is a complex yet essential endeavor in audio signal processing and analysis. This article presents an algorithm tailored for bidirectional transformations aimed at effectively isolating speech from single-channel audio. Leveraging the dual-tree complex wavelet transform (DTCWT) on time-domain signals circumvents limitations inherent in the discrete wavelet transform (DWT), such as its incapacity to manage substantial shifts and inability to discern the correct direction. In this process, a series of subband signals is generated and subjected to the short-time Fourier transform (STFT) to create a complex spectrogram, which is then transformed into its absolute value and input into the Bi-directional Long Short-Term Memory (Bi-LSTM) network with a specified number of layers and units. This network utilizes the bidirectional capabilities of LSTM units to understand both the preceding and subsequent contexts of the input data, enabling the identification of specific speech components, aided by the ideal soft mask components that serve as corresponding labels. The final predicted signal is obtained by element-wise multiplication of the complex spectrogram by the estimated mask produced by the model. Subsequently, the inverse STFT is applied with parameters consistent with the initial transform, followed by the inverse DTCWT on the refined source elements with the same decomposition levels and wavelet filters. The improved efficacy of the proposed method for source separation quality was validated through experimental assessments conducted on the GRID audio–visual and TIMIT databases, considering metrics such as SDR, SIR, SAR, SNR, PESQ, and STOI. |
---|---|
ISSN: | 0045-7906 |
DOI: | 10.1016/j.compeleceng.2024.109364 |