Loading…
Improving the efficiency of Dual-path Transformer Network for speech enhancement by reducing the input feature dimensionality
The mainstream speech enhancement (SE) algorithms often require a deep neural network architecture, which is learned by a great amount of training data and their high-dimensional feature representations. As for the successful SE framework, DPTNet, the waveform-and short-time-Fourier-transform (STFT)...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The mainstream speech enhancement (SE) algorithms often require a deep neural network architecture, which is learned by a great amount of training data and their high-dimensional feature representations. As for the successful SE framework, DPTNet, the waveform-and short-time-Fourier-transform (STFT)-domain features and their bi-projection fusion features are used together as the encoder output to predict an accurate mask for the input spectrogram to obtain the enhanced signal.This study investigates whether we can reduce the size of input speech features in DPTNet to alleviate its computation complexity and keep its SE performance. The initial attempt is to use either the real or imaginary parts of the STFT features instead of both parts. The preliminary experiments conducted on the VoiceBank-DEMAND task show that this modification brings an insignificant difference in SE metric scores, including PESQ and STOI, for the test dataset. These results probably indicate that only the real or imaginary parts of the STFT features suffice to work together with wave-domain features for DPTNet. In this way, DPTNet can exhibit the same high SE behavior with a lower computation need, and thus we can implement it more efficiently. |
---|---|
ISSN: | 2768-4156 |
DOI: | 10.1109/ICASI55125.2022.9774439 |