Loading…
Improving the efficiency of Dual-path Transformer Network for speech enhancement by reducing the input feature dimensionality
The mainstream speech enhancement (SE) algorithms often require a deep neural network architecture, which is learned by a great amount of training data and their high-dimensional feature representations. As for the successful SE framework, DPTNet, the waveform-and short-time-Fourier-transform (STFT)...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | 83 |
container_issue | |
container_start_page | 80 |
container_title | |
container_volume | |
creator | Tang, Yong-Jie Hsieh, Po-Yen Tsai, Ming-Hung Chen, Yan-Tong Hung, Jeih-Weih |
description | The mainstream speech enhancement (SE) algorithms often require a deep neural network architecture, which is learned by a great amount of training data and their high-dimensional feature representations. As for the successful SE framework, DPTNet, the waveform-and short-time-Fourier-transform (STFT)-domain features and their bi-projection fusion features are used together as the encoder output to predict an accurate mask for the input spectrogram to obtain the enhanced signal.This study investigates whether we can reduce the size of input speech features in DPTNet to alleviate its computation complexity and keep its SE performance. The initial attempt is to use either the real or imaginary parts of the STFT features instead of both parts. The preliminary experiments conducted on the VoiceBank-DEMAND task show that this modification brings an insignificant difference in SE metric scores, including PESQ and STOI, for the test dataset. These results probably indicate that only the real or imaginary parts of the STFT features suffice to work together with wave-domain features for DPTNet. In this way, DPTNet can exhibit the same high SE behavior with a lower computation need, and thus we can implement it more efficiently. |
doi_str_mv | 10.1109/ICASI55125.2022.9774439 |
format | conference_proceeding |
fullrecord | <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_9774439</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9774439</ieee_id><sourcerecordid>9774439</sourcerecordid><originalsourceid>FETCH-LOGICAL-i487-ea6888da39a28466586d52a7e5db65f577a425f20e3830cc6eb42266f7f6ae7c3</originalsourceid><addsrcrecordid>eNo1kN9KwzAchaMgOOeewAvzAp1p_udSps7B0At3P7L0Fxtd05KmSi98dwvOq8OBwwffQei2JMuyJOZus7p_2whRUrGkhNKlUYpzZs7QVSml4EYKYs7RjCqpC14KeYkWff9BCGGUMCP5DP1smi61XyG-41wDBu-DCxDdiFuPHwZ7LDqba7xLNva-TQ0k_AL5u02feKq47wBcjSHWNjpoIGZ8GHGCanD_yBC7IWMPNg8JcBWmUR_aaI8hj9fowttjD4tTztHu6XG3ei62r-vJbVsErlUBVmqtK8uMpZpPYlpWgloFojpI4YVSllPhKQGmGXFOwoFTKqVXXlpQjs3RzR82AMC-S6GxadyfzmK_IFNh5w</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Improving the efficiency of Dual-path Transformer Network for speech enhancement by reducing the input feature dimensionality</title><source>IEEE Xplore All Conference Series</source><creator>Tang, Yong-Jie ; Hsieh, Po-Yen ; Tsai, Ming-Hung ; Chen, Yan-Tong ; Hung, Jeih-Weih</creator><creatorcontrib>Tang, Yong-Jie ; Hsieh, Po-Yen ; Tsai, Ming-Hung ; Chen, Yan-Tong ; Hung, Jeih-Weih</creatorcontrib><description>The mainstream speech enhancement (SE) algorithms often require a deep neural network architecture, which is learned by a great amount of training data and their high-dimensional feature representations. As for the successful SE framework, DPTNet, the waveform-and short-time-Fourier-transform (STFT)-domain features and their bi-projection fusion features are used together as the encoder output to predict an accurate mask for the input spectrogram to obtain the enhanced signal.This study investigates whether we can reduce the size of input speech features in DPTNet to alleviate its computation complexity and keep its SE performance. The initial attempt is to use either the real or imaginary parts of the STFT features instead of both parts. The preliminary experiments conducted on the VoiceBank-DEMAND task show that this modification brings an insignificant difference in SE metric scores, including PESQ and STOI, for the test dataset. These results probably indicate that only the real or imaginary parts of the STFT features suffice to work together with wave-domain features for DPTNet. In this way, DPTNet can exhibit the same high SE behavior with a lower computation need, and thus we can implement it more efficiently.</description><identifier>EISSN: 2768-4156</identifier><identifier>EISBN: 1665496509</identifier><identifier>EISBN: 9781665496506</identifier><identifier>DOI: 10.1109/ICASI55125.2022.9774439</identifier><language>eng</language><publisher>IEEE</publisher><subject>Deep learning ; dual-path Transformer Network ; loss function ; Measurement ; Neural networks ; PESQ ; Speech enhancement ; STFT ; STOI ; Technological innovation ; Training data ; Transformers</subject><ispartof>2022 8th International Conference on Applied System Innovation (ICASI), 2022, p.80-83</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9774439$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,23929,23930,25139,27924,54554,54931</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9774439$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Tang, Yong-Jie</creatorcontrib><creatorcontrib>Hsieh, Po-Yen</creatorcontrib><creatorcontrib>Tsai, Ming-Hung</creatorcontrib><creatorcontrib>Chen, Yan-Tong</creatorcontrib><creatorcontrib>Hung, Jeih-Weih</creatorcontrib><title>Improving the efficiency of Dual-path Transformer Network for speech enhancement by reducing the input feature dimensionality</title><title>2022 8th International Conference on Applied System Innovation (ICASI)</title><addtitle>ICASI</addtitle><description>The mainstream speech enhancement (SE) algorithms often require a deep neural network architecture, which is learned by a great amount of training data and their high-dimensional feature representations. As for the successful SE framework, DPTNet, the waveform-and short-time-Fourier-transform (STFT)-domain features and their bi-projection fusion features are used together as the encoder output to predict an accurate mask for the input spectrogram to obtain the enhanced signal.This study investigates whether we can reduce the size of input speech features in DPTNet to alleviate its computation complexity and keep its SE performance. The initial attempt is to use either the real or imaginary parts of the STFT features instead of both parts. The preliminary experiments conducted on the VoiceBank-DEMAND task show that this modification brings an insignificant difference in SE metric scores, including PESQ and STOI, for the test dataset. These results probably indicate that only the real or imaginary parts of the STFT features suffice to work together with wave-domain features for DPTNet. In this way, DPTNet can exhibit the same high SE behavior with a lower computation need, and thus we can implement it more efficiently.</description><subject>Deep learning</subject><subject>dual-path Transformer Network</subject><subject>loss function</subject><subject>Measurement</subject><subject>Neural networks</subject><subject>PESQ</subject><subject>Speech enhancement</subject><subject>STFT</subject><subject>STOI</subject><subject>Technological innovation</subject><subject>Training data</subject><subject>Transformers</subject><issn>2768-4156</issn><isbn>1665496509</isbn><isbn>9781665496506</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2022</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo1kN9KwzAchaMgOOeewAvzAp1p_udSps7B0At3P7L0Fxtd05KmSi98dwvOq8OBwwffQei2JMuyJOZus7p_2whRUrGkhNKlUYpzZs7QVSml4EYKYs7RjCqpC14KeYkWff9BCGGUMCP5DP1smi61XyG-41wDBu-DCxDdiFuPHwZ7LDqba7xLNva-TQ0k_AL5u02feKq47wBcjSHWNjpoIGZ8GHGCanD_yBC7IWMPNg8JcBWmUR_aaI8hj9fowttjD4tTztHu6XG3ei62r-vJbVsErlUBVmqtK8uMpZpPYlpWgloFojpI4YVSllPhKQGmGXFOwoFTKqVXXlpQjs3RzR82AMC-S6GxadyfzmK_IFNh5w</recordid><startdate>20220422</startdate><enddate>20220422</enddate><creator>Tang, Yong-Jie</creator><creator>Hsieh, Po-Yen</creator><creator>Tsai, Ming-Hung</creator><creator>Chen, Yan-Tong</creator><creator>Hung, Jeih-Weih</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>20220422</creationdate><title>Improving the efficiency of Dual-path Transformer Network for speech enhancement by reducing the input feature dimensionality</title><author>Tang, Yong-Jie ; Hsieh, Po-Yen ; Tsai, Ming-Hung ; Chen, Yan-Tong ; Hung, Jeih-Weih</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i487-ea6888da39a28466586d52a7e5db65f577a425f20e3830cc6eb42266f7f6ae7c3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Deep learning</topic><topic>dual-path Transformer Network</topic><topic>loss function</topic><topic>Measurement</topic><topic>Neural networks</topic><topic>PESQ</topic><topic>Speech enhancement</topic><topic>STFT</topic><topic>STOI</topic><topic>Technological innovation</topic><topic>Training data</topic><topic>Transformers</topic><toplevel>online_resources</toplevel><creatorcontrib>Tang, Yong-Jie</creatorcontrib><creatorcontrib>Hsieh, Po-Yen</creatorcontrib><creatorcontrib>Tsai, Ming-Hung</creatorcontrib><creatorcontrib>Chen, Yan-Tong</creatorcontrib><creatorcontrib>Hung, Jeih-Weih</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Tang, Yong-Jie</au><au>Hsieh, Po-Yen</au><au>Tsai, Ming-Hung</au><au>Chen, Yan-Tong</au><au>Hung, Jeih-Weih</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Improving the efficiency of Dual-path Transformer Network for speech enhancement by reducing the input feature dimensionality</atitle><btitle>2022 8th International Conference on Applied System Innovation (ICASI)</btitle><stitle>ICASI</stitle><date>2022-04-22</date><risdate>2022</risdate><spage>80</spage><epage>83</epage><pages>80-83</pages><eissn>2768-4156</eissn><eisbn>1665496509</eisbn><eisbn>9781665496506</eisbn><abstract>The mainstream speech enhancement (SE) algorithms often require a deep neural network architecture, which is learned by a great amount of training data and their high-dimensional feature representations. As for the successful SE framework, DPTNet, the waveform-and short-time-Fourier-transform (STFT)-domain features and their bi-projection fusion features are used together as the encoder output to predict an accurate mask for the input spectrogram to obtain the enhanced signal.This study investigates whether we can reduce the size of input speech features in DPTNet to alleviate its computation complexity and keep its SE performance. The initial attempt is to use either the real or imaginary parts of the STFT features instead of both parts. The preliminary experiments conducted on the VoiceBank-DEMAND task show that this modification brings an insignificant difference in SE metric scores, including PESQ and STOI, for the test dataset. These results probably indicate that only the real or imaginary parts of the STFT features suffice to work together with wave-domain features for DPTNet. In this way, DPTNet can exhibit the same high SE behavior with a lower computation need, and thus we can implement it more efficiently.</abstract><pub>IEEE</pub><doi>10.1109/ICASI55125.2022.9774439</doi><tpages>4</tpages></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | EISSN: 2768-4156 |
ispartof | 2022 8th International Conference on Applied System Innovation (ICASI), 2022, p.80-83 |
issn | 2768-4156 |
language | eng |
recordid | cdi_ieee_primary_9774439 |
source | IEEE Xplore All Conference Series |
subjects | Deep learning dual-path Transformer Network loss function Measurement Neural networks PESQ Speech enhancement STFT STOI Technological innovation Training data Transformers |
title | Improving the efficiency of Dual-path Transformer Network for speech enhancement by reducing the input feature dimensionality |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-12T23%3A09%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Improving%20the%20efficiency%20of%20Dual-path%20Transformer%20Network%20for%20speech%20enhancement%20by%20reducing%20the%20input%20feature%20dimensionality&rft.btitle=2022%208th%20International%20Conference%20on%20Applied%20System%20Innovation%20(ICASI)&rft.au=Tang,%20Yong-Jie&rft.date=2022-04-22&rft.spage=80&rft.epage=83&rft.pages=80-83&rft.eissn=2768-4156&rft_id=info:doi/10.1109/ICASI55125.2022.9774439&rft.eisbn=1665496509&rft.eisbn_list=9781665496506&rft_dat=%3Cieee_CHZPO%3E9774439%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i487-ea6888da39a28466586d52a7e5db65f577a425f20e3830cc6eb42266f7f6ae7c3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=9774439&rfr_iscdi=true |