Loading…

Improving the efficiency of Dual-path Transformer Network for speech enhancement by reducing the input feature dimensionality

The mainstream speech enhancement (SE) algorithms often require a deep neural network architecture, which is learned by a great amount of training data and their high-dimensional feature representations. As for the successful SE framework, DPTNet, the waveform-and short-time-Fourier-transform (STFT)...

Full description

Saved in:
Bibliographic Details
Main Authors: Tang, Yong-Jie, Hsieh, Po-Yen, Tsai, Ming-Hung, Chen, Yan-Tong, Hung, Jeih-Weih
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page 83
container_issue
container_start_page 80
container_title
container_volume
creator Tang, Yong-Jie
Hsieh, Po-Yen
Tsai, Ming-Hung
Chen, Yan-Tong
Hung, Jeih-Weih
description The mainstream speech enhancement (SE) algorithms often require a deep neural network architecture, which is learned by a great amount of training data and their high-dimensional feature representations. As for the successful SE framework, DPTNet, the waveform-and short-time-Fourier-transform (STFT)-domain features and their bi-projection fusion features are used together as the encoder output to predict an accurate mask for the input spectrogram to obtain the enhanced signal.This study investigates whether we can reduce the size of input speech features in DPTNet to alleviate its computation complexity and keep its SE performance. The initial attempt is to use either the real or imaginary parts of the STFT features instead of both parts. The preliminary experiments conducted on the VoiceBank-DEMAND task show that this modification brings an insignificant difference in SE metric scores, including PESQ and STOI, for the test dataset. These results probably indicate that only the real or imaginary parts of the STFT features suffice to work together with wave-domain features for DPTNet. In this way, DPTNet can exhibit the same high SE behavior with a lower computation need, and thus we can implement it more efficiently.
doi_str_mv 10.1109/ICASI55125.2022.9774439
format conference_proceeding
fullrecord <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_9774439</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9774439</ieee_id><sourcerecordid>9774439</sourcerecordid><originalsourceid>FETCH-LOGICAL-i487-ea6888da39a28466586d52a7e5db65f577a425f20e3830cc6eb42266f7f6ae7c3</originalsourceid><addsrcrecordid>eNo1kN9KwzAchaMgOOeewAvzAp1p_udSps7B0At3P7L0Fxtd05KmSi98dwvOq8OBwwffQei2JMuyJOZus7p_2whRUrGkhNKlUYpzZs7QVSml4EYKYs7RjCqpC14KeYkWff9BCGGUMCP5DP1smi61XyG-41wDBu-DCxDdiFuPHwZ7LDqba7xLNva-TQ0k_AL5u02feKq47wBcjSHWNjpoIGZ8GHGCanD_yBC7IWMPNg8JcBWmUR_aaI8hj9fowttjD4tTztHu6XG3ei62r-vJbVsErlUBVmqtK8uMpZpPYlpWgloFojpI4YVSllPhKQGmGXFOwoFTKqVXXlpQjs3RzR82AMC-S6GxadyfzmK_IFNh5w</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Improving the efficiency of Dual-path Transformer Network for speech enhancement by reducing the input feature dimensionality</title><source>IEEE Xplore All Conference Series</source><creator>Tang, Yong-Jie ; Hsieh, Po-Yen ; Tsai, Ming-Hung ; Chen, Yan-Tong ; Hung, Jeih-Weih</creator><creatorcontrib>Tang, Yong-Jie ; Hsieh, Po-Yen ; Tsai, Ming-Hung ; Chen, Yan-Tong ; Hung, Jeih-Weih</creatorcontrib><description>The mainstream speech enhancement (SE) algorithms often require a deep neural network architecture, which is learned by a great amount of training data and their high-dimensional feature representations. As for the successful SE framework, DPTNet, the waveform-and short-time-Fourier-transform (STFT)-domain features and their bi-projection fusion features are used together as the encoder output to predict an accurate mask for the input spectrogram to obtain the enhanced signal.This study investigates whether we can reduce the size of input speech features in DPTNet to alleviate its computation complexity and keep its SE performance. The initial attempt is to use either the real or imaginary parts of the STFT features instead of both parts. The preliminary experiments conducted on the VoiceBank-DEMAND task show that this modification brings an insignificant difference in SE metric scores, including PESQ and STOI, for the test dataset. These results probably indicate that only the real or imaginary parts of the STFT features suffice to work together with wave-domain features for DPTNet. In this way, DPTNet can exhibit the same high SE behavior with a lower computation need, and thus we can implement it more efficiently.</description><identifier>EISSN: 2768-4156</identifier><identifier>EISBN: 1665496509</identifier><identifier>EISBN: 9781665496506</identifier><identifier>DOI: 10.1109/ICASI55125.2022.9774439</identifier><language>eng</language><publisher>IEEE</publisher><subject>Deep learning ; dual-path Transformer Network ; loss function ; Measurement ; Neural networks ; PESQ ; Speech enhancement ; STFT ; STOI ; Technological innovation ; Training data ; Transformers</subject><ispartof>2022 8th International Conference on Applied System Innovation (ICASI), 2022, p.80-83</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9774439$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,23929,23930,25139,27924,54554,54931</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9774439$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Tang, Yong-Jie</creatorcontrib><creatorcontrib>Hsieh, Po-Yen</creatorcontrib><creatorcontrib>Tsai, Ming-Hung</creatorcontrib><creatorcontrib>Chen, Yan-Tong</creatorcontrib><creatorcontrib>Hung, Jeih-Weih</creatorcontrib><title>Improving the efficiency of Dual-path Transformer Network for speech enhancement by reducing the input feature dimensionality</title><title>2022 8th International Conference on Applied System Innovation (ICASI)</title><addtitle>ICASI</addtitle><description>The mainstream speech enhancement (SE) algorithms often require a deep neural network architecture, which is learned by a great amount of training data and their high-dimensional feature representations. As for the successful SE framework, DPTNet, the waveform-and short-time-Fourier-transform (STFT)-domain features and their bi-projection fusion features are used together as the encoder output to predict an accurate mask for the input spectrogram to obtain the enhanced signal.This study investigates whether we can reduce the size of input speech features in DPTNet to alleviate its computation complexity and keep its SE performance. The initial attempt is to use either the real or imaginary parts of the STFT features instead of both parts. The preliminary experiments conducted on the VoiceBank-DEMAND task show that this modification brings an insignificant difference in SE metric scores, including PESQ and STOI, for the test dataset. These results probably indicate that only the real or imaginary parts of the STFT features suffice to work together with wave-domain features for DPTNet. In this way, DPTNet can exhibit the same high SE behavior with a lower computation need, and thus we can implement it more efficiently.</description><subject>Deep learning</subject><subject>dual-path Transformer Network</subject><subject>loss function</subject><subject>Measurement</subject><subject>Neural networks</subject><subject>PESQ</subject><subject>Speech enhancement</subject><subject>STFT</subject><subject>STOI</subject><subject>Technological innovation</subject><subject>Training data</subject><subject>Transformers</subject><issn>2768-4156</issn><isbn>1665496509</isbn><isbn>9781665496506</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2022</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo1kN9KwzAchaMgOOeewAvzAp1p_udSps7B0At3P7L0Fxtd05KmSi98dwvOq8OBwwffQei2JMuyJOZus7p_2whRUrGkhNKlUYpzZs7QVSml4EYKYs7RjCqpC14KeYkWff9BCGGUMCP5DP1smi61XyG-41wDBu-DCxDdiFuPHwZ7LDqba7xLNva-TQ0k_AL5u02feKq47wBcjSHWNjpoIGZ8GHGCanD_yBC7IWMPNg8JcBWmUR_aaI8hj9fowttjD4tTztHu6XG3ei62r-vJbVsErlUBVmqtK8uMpZpPYlpWgloFojpI4YVSllPhKQGmGXFOwoFTKqVXXlpQjs3RzR82AMC-S6GxadyfzmK_IFNh5w</recordid><startdate>20220422</startdate><enddate>20220422</enddate><creator>Tang, Yong-Jie</creator><creator>Hsieh, Po-Yen</creator><creator>Tsai, Ming-Hung</creator><creator>Chen, Yan-Tong</creator><creator>Hung, Jeih-Weih</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>20220422</creationdate><title>Improving the efficiency of Dual-path Transformer Network for speech enhancement by reducing the input feature dimensionality</title><author>Tang, Yong-Jie ; Hsieh, Po-Yen ; Tsai, Ming-Hung ; Chen, Yan-Tong ; Hung, Jeih-Weih</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i487-ea6888da39a28466586d52a7e5db65f577a425f20e3830cc6eb42266f7f6ae7c3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Deep learning</topic><topic>dual-path Transformer Network</topic><topic>loss function</topic><topic>Measurement</topic><topic>Neural networks</topic><topic>PESQ</topic><topic>Speech enhancement</topic><topic>STFT</topic><topic>STOI</topic><topic>Technological innovation</topic><topic>Training data</topic><topic>Transformers</topic><toplevel>online_resources</toplevel><creatorcontrib>Tang, Yong-Jie</creatorcontrib><creatorcontrib>Hsieh, Po-Yen</creatorcontrib><creatorcontrib>Tsai, Ming-Hung</creatorcontrib><creatorcontrib>Chen, Yan-Tong</creatorcontrib><creatorcontrib>Hung, Jeih-Weih</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Tang, Yong-Jie</au><au>Hsieh, Po-Yen</au><au>Tsai, Ming-Hung</au><au>Chen, Yan-Tong</au><au>Hung, Jeih-Weih</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Improving the efficiency of Dual-path Transformer Network for speech enhancement by reducing the input feature dimensionality</atitle><btitle>2022 8th International Conference on Applied System Innovation (ICASI)</btitle><stitle>ICASI</stitle><date>2022-04-22</date><risdate>2022</risdate><spage>80</spage><epage>83</epage><pages>80-83</pages><eissn>2768-4156</eissn><eisbn>1665496509</eisbn><eisbn>9781665496506</eisbn><abstract>The mainstream speech enhancement (SE) algorithms often require a deep neural network architecture, which is learned by a great amount of training data and their high-dimensional feature representations. As for the successful SE framework, DPTNet, the waveform-and short-time-Fourier-transform (STFT)-domain features and their bi-projection fusion features are used together as the encoder output to predict an accurate mask for the input spectrogram to obtain the enhanced signal.This study investigates whether we can reduce the size of input speech features in DPTNet to alleviate its computation complexity and keep its SE performance. The initial attempt is to use either the real or imaginary parts of the STFT features instead of both parts. The preliminary experiments conducted on the VoiceBank-DEMAND task show that this modification brings an insignificant difference in SE metric scores, including PESQ and STOI, for the test dataset. These results probably indicate that only the real or imaginary parts of the STFT features suffice to work together with wave-domain features for DPTNet. In this way, DPTNet can exhibit the same high SE behavior with a lower computation need, and thus we can implement it more efficiently.</abstract><pub>IEEE</pub><doi>10.1109/ICASI55125.2022.9774439</doi><tpages>4</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier EISSN: 2768-4156
ispartof 2022 8th International Conference on Applied System Innovation (ICASI), 2022, p.80-83
issn 2768-4156
language eng
recordid cdi_ieee_primary_9774439
source IEEE Xplore All Conference Series
subjects Deep learning
dual-path Transformer Network
loss function
Measurement
Neural networks
PESQ
Speech enhancement
STFT
STOI
Technological innovation
Training data
Transformers
title Improving the efficiency of Dual-path Transformer Network for speech enhancement by reducing the input feature dimensionality
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-12T23%3A09%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Improving%20the%20efficiency%20of%20Dual-path%20Transformer%20Network%20for%20speech%20enhancement%20by%20reducing%20the%20input%20feature%20dimensionality&rft.btitle=2022%208th%20International%20Conference%20on%20Applied%20System%20Innovation%20(ICASI)&rft.au=Tang,%20Yong-Jie&rft.date=2022-04-22&rft.spage=80&rft.epage=83&rft.pages=80-83&rft.eissn=2768-4156&rft_id=info:doi/10.1109/ICASI55125.2022.9774439&rft.eisbn=1665496509&rft.eisbn_list=9781665496506&rft_dat=%3Cieee_CHZPO%3E9774439%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i487-ea6888da39a28466586d52a7e5db65f577a425f20e3830cc6eb42266f7f6ae7c3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=9774439&rfr_iscdi=true