Loading…

Improving the efficiency of Dual-path Transformer Network for speech enhancement by reducing the input feature dimensionality

The mainstream speech enhancement (SE) algorithms often require a deep neural network architecture, which is learned by a great amount of training data and their high-dimensional feature representations. As for the successful SE framework, DPTNet, the waveform-and short-time-Fourier-transform (STFT)...

Full description

Saved in:

Bibliographic Details
Main Authors:	Tang, Yong-Jie, Hsieh, Po-Yen, Tsai, Ming-Hung, Chen, Yan-Tong, Hung, Jeih-Weih
Format:	Conference Proceeding
Language:	English
Subjects:	Deep learning dual-path Transformer Network loss function Measurement Neural networks PESQ Speech enhancement STFT STOI Technological innovation Training data Transformers
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	83
container_issue
container_start_page	80
container_title
container_volume
creator	Tang, Yong-Jie Hsieh, Po-Yen Tsai, Ming-Hung Chen, Yan-Tong Hung, Jeih-Weih
description	The mainstream speech enhancement (SE) algorithms often require a deep neural network architecture, which is learned by a great amount of training data and their high-dimensional feature representations. As for the successful SE framework, DPTNet, the waveform-and short-time-Fourier-transform (STFT)-domain features and their bi-projection fusion features are used together as the encoder output to predict an accurate mask for the input spectrogram to obtain the enhanced signal.This study investigates whether we can reduce the size of input speech features in DPTNet to alleviate its computation complexity and keep its SE performance. The initial attempt is to use either the real or imaginary parts of the STFT features instead of both parts. The preliminary experiments conducted on the VoiceBank-DEMAND task show that this modification brings an insignificant difference in SE metric scores, including PESQ and STOI, for the test dataset. These results probably indicate that only the real or imaginary parts of the STFT features suffice to work together with wave-domain features for DPTNet. In this way, DPTNet can exhibit the same high SE behavior with a lower computation need, and thus we can implement it more efficiently.
doi_str_mv	10.1109/ICASI55125.2022.9774439
format	conference_proceeding
fullrecord	<record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_9774439</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9774439</ieee_id><sourcerecordid>9774439</sourcerecordid><originalsourceid>FETCH-LOGICAL-i487-ea6888da39a28466586d52a7e5db65f577a425f20e3830cc6eb42266f7f6ae7c3</originalsourceid><addsrcrecordid>eNo1kN9KwzAchaMgOOeewAvzAp1p_udSps7B0At3P7L0Fxtd05KmSi98dwvOq8OBwwffQei2JMuyJOZus7p_2whRUrGkhNKlUYpzZs7QVSml4EYKYs7RjCqpC14KeYkWff9BCGGUMCP5DP1smi61XyG-41wDBu-DCxDdiFuPHwZ7LDqba7xLNva-TQ0k_AL5u02feKq47wBcjSHWNjpoIGZ8GHGCanD_yBC7IWMPNg8JcBWmUR_aaI8hj9fowttjD4tTztHu6XG3ei62r-vJbVsErlUBVmqtK8uMpZpPYlpWgloFojpI4YVSllPhKQGmGXFOwoFTKqVXXlpQjs3RzR82AMC-S6GxadyfzmK_IFNh5w</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Improving the efficiency of Dual-path Transformer Network for speech enhancement by reducing the input feature dimensionality</title><source>IEEE Xplore All Conference Series</source><creator>Tang, Yong-Jie ; Hsieh, Po-Yen ; Tsai, Ming-Hung ; Chen, Yan-Tong ; Hung, Jeih-Weih</creator><creatorcontrib>Tang, Yong-Jie ; Hsieh, Po-Yen ; Tsai, Ming-Hung ; Chen, Yan-Tong ; Hung, Jeih-Weih</creatorcontrib><description>The mainstream speech enhancement (SE) algorithms often require a deep neural network architecture, which is learned by a great amount of training data and their high-dimensional feature representations. As for the successful SE framework, DPTNet, the waveform-and short-time-Fourier-transform (STFT)-domain features and their bi-projection fusion features are used together as the encoder output to predict an accurate mask for the input spectrogram to obtain the enhanced signal.This study investigates whether we can reduce the size of input speech features in DPTNet to alleviate its computation complexity and keep its SE performance. The initial attempt is to use either the real or imaginary parts of the STFT features instead of both parts. The preliminary experiments conducted on the VoiceBank-DEMAND task show that this modification brings an insignificant difference in SE metric scores, including PESQ and STOI, for the test dataset. These results probably indicate that only the real or imaginary parts of the STFT features suffice to work together with wave-domain features for DPTNet. In this way, DPTNet can exhibit the same high SE behavior with a lower computation need, and thus we can implement it more efficiently.</description><identifier>EISSN: 2768-4156</identifier><identifier>EISBN: 1665496509</identifier><identifier>EISBN: 9781665496506</identifier><identifier>DOI: 10.1109/ICASI55125.2022.9774439</identifier><language>eng</language><publisher>IEEE</publisher><subject>Deep learning ; dual-path Transformer Network ; loss function ; Measurement ; Neural networks ; PESQ ; Speech enhancement ; STFT ; STOI ; Technological innovation ; Training data ; Transformers</subject><ispartof>2022 8th International Conference on Applied System Innovation (ICASI), 2022, p.80-83</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9774439$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,23929,23930,25139,27924,54554,54931</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9774439$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Tang, Yong-Jie</creatorcontrib><creatorcontrib>Hsieh, Po-Yen</creatorcontrib><creatorcontrib>Tsai, Ming-Hung</creatorcontrib><creatorcontrib>Chen, Yan-Tong</creatorcontrib><creatorcontrib>Hung, Jeih-Weih</creatorcontrib><title>Improving the efficiency of Dual-path Transformer Network for speech enhancement by reducing the input feature dimensionality</title><title>2022 8th International Conference on Applied System Innovation (ICASI)</title><addtitle>ICASI</addtitle><description>The mainstream speech enhancement (SE) algorithms often require a deep neural network architecture, which is learned by a great amount of training data and their high-dimensional feature representations. As for the successful SE framework, DPTNet, the waveform-and short-time-Fourier-transform (STFT)-domain features and their bi-projection fusion features are used together as the encoder output to predict an accurate mask for the input spectrogram to obtain the enhanced signal.This study investigates whether we can reduce the size of input speech features in DPTNet to alleviate its computation complexity and keep its SE performance. The initial attempt is to use either the real or imaginary parts of the STFT features instead of both parts. The preliminary experiments conducted on the VoiceBank-DEMAND task show that this modification brings an insignificant difference in SE metric scores, including PESQ and STOI, for the test dataset. These results probably indicate that only the real or imaginary parts of the STFT features suffice to work together with wave-domain features for DPTNet. In this way, DPTNet can exhibit the same high SE behavior with a lower computation need, and thus we can implement it more efficiently.</description><subject>Deep learning</subject><subject>dual-path Transformer Network</subject><subject>loss function</subject><subject>Measurement</subject><subject>Neural networks</subject><subject>PESQ</subject><subject>Speech enhancement</subject><subject>STFT</subject><subject>STOI</subject><subject>Technological innovation</subject><subject>Training data</subject><subject>Transformers</subject><issn>2768-4156</issn><isbn>1665496509</isbn><isbn>9781665496506</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2022</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo1kN9KwzAchaMgOOeewAvzAp1p_udSps7B0At3P7L0Fxtd05KmSi98dwvOq8OBwwffQei2JMuyJOZus7p_2whRUrGkhNKlUYpzZs7QVSml4EYKYs7RjCqpC14KeYkWff9BCGGUMCP5DP1smi61XyG-41wDBu-DCxDdiFuPHwZ7LDqba7xLNva-TQ0k_AL5u02feKq47wBcjSHWNjpoIGZ8GHGCanD_yBC7IWMPNg8JcBWmUR_aaI8hj9fowttjD4tTztHu6XG3ei62r-vJbVsErlUBVmqtK8uMpZpPYlpWgloFojpI4YVSllPhKQGmGXFOwoFTKqVXXlpQjs3RzR82AMC-S6GxadyfzmK_IFNh5w</recordid><startdate>20220422</startdate><enddate>20220422</enddate><creator>Tang, Yong-Jie</creator><creator>Hsieh, Po-Yen</creator><creator>Tsai, Ming-Hung</creator><creator>Chen, Yan-Tong</creator><creator>Hung, Jeih-Weih</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>20220422</creationdate><title>Improving the efficiency of Dual-path Transformer Network for speech enhancement by reducing the input feature dimensionality</title><author>Tang, Yong-Jie ; Hsieh, Po-Yen ; Tsai, Ming-Hung ; Chen, Yan-Tong ; Hung, Jeih-Weih</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i487-ea6888da39a28466586d52a7e5db65f577a425f20e3830cc6eb42266f7f6ae7c3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Deep learning</topic><topic>dual-path Transformer Network</topic><topic>loss function</topic><topic>Measurement</topic><topic>Neural networks</topic><topic>PESQ</topic><topic>Speech enhancement</topic><topic>STFT</topic><topic>STOI</topic><topic>Technological innovation</topic><topic>Training data</topic><topic>Transformers</topic><toplevel>online_resources</toplevel><creatorcontrib>Tang, Yong-Jie</creatorcontrib><creatorcontrib>Hsieh, Po-Yen</creatorcontrib><creatorcontrib>Tsai, Ming-Hung</creatorcontrib><creatorcontrib>Chen, Yan-Tong</creatorcontrib><creatorcontrib>Hung, Jeih-Weih</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Tang, Yong-Jie</au><au>Hsieh, Po-Yen</au><au>Tsai, Ming-Hung</au><au>Chen, Yan-Tong</au><au>Hung, Jeih-Weih</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Improving the efficiency of Dual-path Transformer Network for speech enhancement by reducing the input feature dimensionality</atitle><btitle>2022 8th International Conference on Applied System Innovation (ICASI)</btitle><stitle>ICASI</stitle><date>2022-04-22</date><risdate>2022</risdate><spage>80</spage><epage>83</epage><pages>80-83</pages><eissn>2768-4156</eissn><eisbn>1665496509</eisbn><eisbn>9781665496506</eisbn><abstract>The mainstream speech enhancement (SE) algorithms often require a deep neural network architecture, which is learned by a great amount of training data and their high-dimensional feature representations. As for the successful SE framework, DPTNet, the waveform-and short-time-Fourier-transform (STFT)-domain features and their bi-projection fusion features are used together as the encoder output to predict an accurate mask for the input spectrogram to obtain the enhanced signal.This study investigates whether we can reduce the size of input speech features in DPTNet to alleviate its computation complexity and keep its SE performance. The initial attempt is to use either the real or imaginary parts of the STFT features instead of both parts. The preliminary experiments conducted on the VoiceBank-DEMAND task show that this modification brings an insignificant difference in SE metric scores, including PESQ and STOI, for the test dataset. These results probably indicate that only the real or imaginary parts of the STFT features suffice to work together with wave-domain features for DPTNet. In this way, DPTNet can exhibit the same high SE behavior with a lower computation need, and thus we can implement it more efficiently.</abstract><pub>IEEE</pub><doi>10.1109/ICASI55125.2022.9774439</doi><tpages>4</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	EISSN: 2768-4156
ispartof	2022 8th International Conference on Applied System Innovation (ICASI), 2022, p.80-83
issn	2768-4156
language	eng
recordid	cdi_ieee_primary_9774439
source	IEEE Xplore All Conference Series
subjects	Deep learning dual-path Transformer Network loss function Measurement Neural networks PESQ Speech enhancement STFT STOI Technological innovation Training data Transformers
title	Improving the efficiency of Dual-path Transformer Network for speech enhancement by reducing the input feature dimensionality
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-12T23%3A09%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Improving%20the%20efficiency%20of%20Dual-path%20Transformer%20Network%20for%20speech%20enhancement%20by%20reducing%20the%20input%20feature%20dimensionality&rft.btitle=2022%208th%20International%20Conference%20on%20Applied%20System%20Innovation%20(ICASI)&rft.au=Tang,%20Yong-Jie&rft.date=2022-04-22&rft.spage=80&rft.epage=83&rft.pages=80-83&rft.eissn=2768-4156&rft_id=info:doi/10.1109/ICASI55125.2022.9774439&rft.eisbn=1665496509&rft.eisbn_list=9781665496506&rft_dat=%3Cieee_CHZPO%3E9774439%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i487-ea6888da39a28466586d52a7e5db65f577a425f20e3830cc6eb42266f7f6ae7c3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=9774439&rfr_iscdi=true