Loading…

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition

Deep learning (DL)-based speaker diarization methods have proven powerful performance comparing to traditional clustering-based methods for multi-talker speech diarization and recognition in farfield scenes. However, most DL-based approaches cannot utilize the spatial information well due to the poo...

Full description

Saved in:

Bibliographic Details
Main Authors:	Ma, Feng, Tu, Yanhui, He, Maokui, Wang, Ruoyu, Niu, Shutong, Sun, Lei, Ye, Zhongfu, Du, Jun, Pan, Jia, Lee, Chin-Hui
Format:	Conference Proceeding
Language:	English
Subjects:	Acoustics CHiME-7 Challenge Estimation iterative mask estimation Iterative methods multi-channel speech enhancement Robustness Speaker diarization Speech enhancement Speech recognition Topology
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	12335
container_issue
container_start_page	12331
container_title
container_volume
creator	Ma, Feng Tu, Yanhui He, Maokui Wang, Ruoyu Niu, Shutong Sun, Lei Ye, Zhongfu Du, Jun Pan, Jia Lee, Chin-Hui
description	Deep learning (DL)-based speaker diarization methods have proven powerful performance comparing to traditional clustering-based methods for multi-talker speech diarization and recognition in farfield scenes. However, most DL-based approaches cannot utilize the spatial information well due to the poor robustness to unknown array topology and acoustic scenario. In this paper, a spatial long-term iterative mask estimation (SLT-IME) method is proposed to improve the performance of speaker diarization in various real-world acoustic scenarios. First, the complex angular central gaussian mixture model (cACGMM) with diarization results as initial values is used to estimate the presence probability of each speaker at each time-frequency bin, namely speaker masks, in a long-term chunk. Then, the speaker masks are converted to speaker activities according to the threshold, which deliver the diarization information of which speaker is active and when. Finally, the estimated speaker activity can also serve as the initial input for the diarization system, resulting in improved ASR performance. Experimental results on the CHiME-7 three datasets (CHiME-6, DiPCo, Mixer 6) show proposed method can improve diarization and recognition systems performance simultaneously. It also plays a key role in the ensemble system that achieves the best performance in the main track of CHiME-7 DASR Challenge.
doi_str_mv	10.1109/ICASSP48485.2024.10446168
format	conference_proceeding
fullrecord	<record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10446168</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10446168</ieee_id><sourcerecordid>10446168</sourcerecordid><originalsourceid>FETCH-LOGICAL-i728-5a7a8e483364496b888e0540ce525b11bd34867d349fa63c07199dd7c7d3c1eb3</originalsourceid><addsrcrecordid>eNo1kMFOwzAMhgMSEmPsDTiEB2hJmrRxjtMYY1InEO2B25S27hbWtVVakODpyTS42PL327_0m5B7zkLOmX5YL-ZZ9ipBQhxGLJIhZ1ImPIELMtNKg4iZkF7kl2QSCaUDrtn7NbkZhg_GGCgJEzLMadab0ZqGpl27C3J0R7oe0Xn2hXRjhgNdDqM9-rlr6bzvXWfKPa07RzefzWiDxd60LTbeBs0BHX20xtmf87ppqxNHf_CGZbdr7QnfkqvaNAPO_vqU5E_LfPEcpC8rnykNrIogiI0ygBKESKTUSQEAyGLJSoyjuOC8qISERPmqa5OIkimudVWp0qOSYyGm5O5saxFx2zufwX1v_38kfgG_N10M</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition</title><source>IEEE Xplore All Conference Series</source><creator>Ma, Feng ; Tu, Yanhui ; He, Maokui ; Wang, Ruoyu ; Niu, Shutong ; Sun, Lei ; Ye, Zhongfu ; Du, Jun ; Pan, Jia ; Lee, Chin-Hui</creator><creatorcontrib>Ma, Feng ; Tu, Yanhui ; He, Maokui ; Wang, Ruoyu ; Niu, Shutong ; Sun, Lei ; Ye, Zhongfu ; Du, Jun ; Pan, Jia ; Lee, Chin-Hui</creatorcontrib><description>Deep learning (DL)-based speaker diarization methods have proven powerful performance comparing to traditional clustering-based methods for multi-talker speech diarization and recognition in farfield scenes. However, most DL-based approaches cannot utilize the spatial information well due to the poor robustness to unknown array topology and acoustic scenario. In this paper, a spatial long-term iterative mask estimation (SLT-IME) method is proposed to improve the performance of speaker diarization in various real-world acoustic scenarios. First, the complex angular central gaussian mixture model (cACGMM) with diarization results as initial values is used to estimate the presence probability of each speaker at each time-frequency bin, namely speaker masks, in a long-term chunk. Then, the speaker masks are converted to speaker activities according to the threshold, which deliver the diarization information of which speaker is active and when. Finally, the estimated speaker activity can also serve as the initial input for the diarization system, resulting in improved ASR performance. Experimental results on the CHiME-7 three datasets (CHiME-6, DiPCo, Mixer 6) show proposed method can improve diarization and recognition systems performance simultaneously. It also plays a key role in the ensemble system that achieves the best performance in the main track of CHiME-7 DASR Challenge.</description><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 9798350344851</identifier><identifier>DOI: 10.1109/ICASSP48485.2024.10446168</identifier><language>eng</language><publisher>IEEE</publisher><subject>Acoustics ; CHiME-7 Challenge ; Estimation ; iterative mask estimation ; Iterative methods ; multi-channel speech enhancement ; Robustness ; Speaker diarization ; Speech enhancement ; Speech recognition ; Topology</subject><ispartof>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, p.12331-12335</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10446168$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,27902,54530,54907</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10446168$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Ma, Feng</creatorcontrib><creatorcontrib>Tu, Yanhui</creatorcontrib><creatorcontrib>He, Maokui</creatorcontrib><creatorcontrib>Wang, Ruoyu</creatorcontrib><creatorcontrib>Niu, Shutong</creatorcontrib><creatorcontrib>Sun, Lei</creatorcontrib><creatorcontrib>Ye, Zhongfu</creatorcontrib><creatorcontrib>Du, Jun</creatorcontrib><creatorcontrib>Pan, Jia</creatorcontrib><creatorcontrib>Lee, Chin-Hui</creatorcontrib><title>A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition</title><title>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title><addtitle>ICASSP</addtitle><description>Deep learning (DL)-based speaker diarization methods have proven powerful performance comparing to traditional clustering-based methods for multi-talker speech diarization and recognition in farfield scenes. However, most DL-based approaches cannot utilize the spatial information well due to the poor robustness to unknown array topology and acoustic scenario. In this paper, a spatial long-term iterative mask estimation (SLT-IME) method is proposed to improve the performance of speaker diarization in various real-world acoustic scenarios. First, the complex angular central gaussian mixture model (cACGMM) with diarization results as initial values is used to estimate the presence probability of each speaker at each time-frequency bin, namely speaker masks, in a long-term chunk. Then, the speaker masks are converted to speaker activities according to the threshold, which deliver the diarization information of which speaker is active and when. Finally, the estimated speaker activity can also serve as the initial input for the diarization system, resulting in improved ASR performance. Experimental results on the CHiME-7 three datasets (CHiME-6, DiPCo, Mixer 6) show proposed method can improve diarization and recognition systems performance simultaneously. It also plays a key role in the ensemble system that achieves the best performance in the main track of CHiME-7 DASR Challenge.</description><subject>Acoustics</subject><subject>CHiME-7 Challenge</subject><subject>Estimation</subject><subject>iterative mask estimation</subject><subject>Iterative methods</subject><subject>multi-channel speech enhancement</subject><subject>Robustness</subject><subject>Speaker diarization</subject><subject>Speech enhancement</subject><subject>Speech recognition</subject><subject>Topology</subject><issn>2379-190X</issn><isbn>9798350344851</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2024</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo1kMFOwzAMhgMSEmPsDTiEB2hJmrRxjtMYY1InEO2B25S27hbWtVVakODpyTS42PL327_0m5B7zkLOmX5YL-ZZ9ipBQhxGLJIhZ1ImPIELMtNKg4iZkF7kl2QSCaUDrtn7NbkZhg_GGCgJEzLMadab0ZqGpl27C3J0R7oe0Xn2hXRjhgNdDqM9-rlr6bzvXWfKPa07RzefzWiDxd60LTbeBs0BHX20xtmf87ppqxNHf_CGZbdr7QnfkqvaNAPO_vqU5E_LfPEcpC8rnykNrIogiI0ygBKESKTUSQEAyGLJSoyjuOC8qISERPmqa5OIkimudVWp0qOSYyGm5O5saxFx2zufwX1v_38kfgG_N10M</recordid><startdate>20240414</startdate><enddate>20240414</enddate><creator>Ma, Feng</creator><creator>Tu, Yanhui</creator><creator>He, Maokui</creator><creator>Wang, Ruoyu</creator><creator>Niu, Shutong</creator><creator>Sun, Lei</creator><creator>Ye, Zhongfu</creator><creator>Du, Jun</creator><creator>Pan, Jia</creator><creator>Lee, Chin-Hui</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>20240414</creationdate><title>A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition</title><author>Ma, Feng ; Tu, Yanhui ; He, Maokui ; Wang, Ruoyu ; Niu, Shutong ; Sun, Lei ; Ye, Zhongfu ; Du, Jun ; Pan, Jia ; Lee, Chin-Hui</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i728-5a7a8e483364496b888e0540ce525b11bd34867d349fa63c07199dd7c7d3c1eb3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Acoustics</topic><topic>CHiME-7 Challenge</topic><topic>Estimation</topic><topic>iterative mask estimation</topic><topic>Iterative methods</topic><topic>multi-channel speech enhancement</topic><topic>Robustness</topic><topic>Speaker diarization</topic><topic>Speech enhancement</topic><topic>Speech recognition</topic><topic>Topology</topic><toplevel>online_resources</toplevel><creatorcontrib>Ma, Feng</creatorcontrib><creatorcontrib>Tu, Yanhui</creatorcontrib><creatorcontrib>He, Maokui</creatorcontrib><creatorcontrib>Wang, Ruoyu</creatorcontrib><creatorcontrib>Niu, Shutong</creatorcontrib><creatorcontrib>Sun, Lei</creatorcontrib><creatorcontrib>Ye, Zhongfu</creatorcontrib><creatorcontrib>Du, Jun</creatorcontrib><creatorcontrib>Pan, Jia</creatorcontrib><creatorcontrib>Lee, Chin-Hui</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Xplore</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ma, Feng</au><au>Tu, Yanhui</au><au>He, Maokui</au><au>Wang, Ruoyu</au><au>Niu, Shutong</au><au>Sun, Lei</au><au>Ye, Zhongfu</au><au>Du, Jun</au><au>Pan, Jia</au><au>Lee, Chin-Hui</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition</atitle><btitle>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</btitle><stitle>ICASSP</stitle><date>2024-04-14</date><risdate>2024</risdate><spage>12331</spage><epage>12335</epage><pages>12331-12335</pages><eissn>2379-190X</eissn><eisbn>9798350344851</eisbn><abstract>Deep learning (DL)-based speaker diarization methods have proven powerful performance comparing to traditional clustering-based methods for multi-talker speech diarization and recognition in farfield scenes. However, most DL-based approaches cannot utilize the spatial information well due to the poor robustness to unknown array topology and acoustic scenario. In this paper, a spatial long-term iterative mask estimation (SLT-IME) method is proposed to improve the performance of speaker diarization in various real-world acoustic scenarios. First, the complex angular central gaussian mixture model (cACGMM) with diarization results as initial values is used to estimate the presence probability of each speaker at each time-frequency bin, namely speaker masks, in a long-term chunk. Then, the speaker masks are converted to speaker activities according to the threshold, which deliver the diarization information of which speaker is active and when. Finally, the estimated speaker activity can also serve as the initial input for the diarization system, resulting in improved ASR performance. Experimental results on the CHiME-7 three datasets (CHiME-6, DiPCo, Mixer 6) show proposed method can improve diarization and recognition systems performance simultaneously. It also plays a key role in the ensemble system that achieves the best performance in the main track of CHiME-7 DASR Challenge.</abstract><pub>IEEE</pub><doi>10.1109/ICASSP48485.2024.10446168</doi><tpages>5</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	EISSN: 2379-190X
ispartof	ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, p.12331-12335
issn	2379-190X
language	eng
recordid	cdi_ieee_primary_10446168
source	IEEE Xplore All Conference Series
subjects	Acoustics CHiME-7 Challenge Estimation iterative mask estimation Iterative methods multi-channel speech enhancement Robustness Speaker diarization Speech enhancement Speech recognition Topology
title	A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T16%3A08%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=A%20Spatial%20Long-Term%20Iterative%20Mask%20Estimation%20Approach%20for%20Multi-Channel%20Speaker%20Diarization%20and%20Speech%20Recognition&rft.btitle=ICASSP%202024%20-%202024%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing%20(ICASSP)&rft.au=Ma,%20Feng&rft.date=2024-04-14&rft.spage=12331&rft.epage=12335&rft.pages=12331-12335&rft.eissn=2379-190X&rft_id=info:doi/10.1109/ICASSP48485.2024.10446168&rft.eisbn=9798350344851&rft_dat=%3Cieee_CHZPO%3E10446168%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i728-5a7a8e483364496b888e0540ce525b11bd34867d349fa63c07199dd7c7d3c1eb3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10446168&rfr_iscdi=true