Loading…

Enhanced speech separation through a supervised approach using bidirectional long short-term memory in dual domains

The process of separating individual sound sources from mono audio is a complex yet essential endeavor in audio signal processing and analysis. This article presents an algorithm tailored for bidirectional transformations aimed at effectively isolating speech from single-channel audio. Leveraging th...

Full description

Saved in:
Bibliographic Details
Published in:Computers & electrical engineering 2024-08, Vol.118, p.109364, Article 109364
Main Authors: Basir, Samiul, Hosen, Md Shakhawat, Hossain, Md Nahid, Aktaruzzaman, Md, Ali, Md Sadek, Islam, Md Shohidul
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The process of separating individual sound sources from mono audio is a complex yet essential endeavor in audio signal processing and analysis. This article presents an algorithm tailored for bidirectional transformations aimed at effectively isolating speech from single-channel audio. Leveraging the dual-tree complex wavelet transform (DTCWT) on time-domain signals circumvents limitations inherent in the discrete wavelet transform (DWT), such as its incapacity to manage substantial shifts and inability to discern the correct direction. In this process, a series of subband signals is generated and subjected to the short-time Fourier transform (STFT) to create a complex spectrogram, which is then transformed into its absolute value and input into the Bi-directional Long Short-Term Memory (Bi-LSTM) network with a specified number of layers and units. This network utilizes the bidirectional capabilities of LSTM units to understand both the preceding and subsequent contexts of the input data, enabling the identification of specific speech components, aided by the ideal soft mask components that serve as corresponding labels. The final predicted signal is obtained by element-wise multiplication of the complex spectrogram by the estimated mask produced by the model. Subsequently, the inverse STFT is applied with parameters consistent with the initial transform, followed by the inverse DTCWT on the refined source elements with the same decomposition levels and wavelet filters. The improved efficacy of the proposed method for source separation quality was validated through experimental assessments conducted on the GRID audio–visual and TIMIT databases, considering metrics such as SDR, SIR, SAR, SNR, PESQ, and STOI.
ISSN:0045-7906
DOI:10.1016/j.compeleceng.2024.109364