Loading…

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition

Deep learning (DL)-based speaker diarization methods have proven powerful performance comparing to traditional clustering-based methods for multi-talker speech diarization and recognition in farfield scenes. However, most DL-based approaches cannot utilize the spatial information well due to the poo...

Full description

Saved in:
Bibliographic Details
Main Authors: Ma, Feng, Tu, Yanhui, He, Maokui, Wang, Ruoyu, Niu, Shutong, Sun, Lei, Ye, Zhongfu, Du, Jun, Pan, Jia, Lee, Chin-Hui
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page 12335
container_issue
container_start_page 12331
container_title
container_volume
creator Ma, Feng
Tu, Yanhui
He, Maokui
Wang, Ruoyu
Niu, Shutong
Sun, Lei
Ye, Zhongfu
Du, Jun
Pan, Jia
Lee, Chin-Hui
description Deep learning (DL)-based speaker diarization methods have proven powerful performance comparing to traditional clustering-based methods for multi-talker speech diarization and recognition in farfield scenes. However, most DL-based approaches cannot utilize the spatial information well due to the poor robustness to unknown array topology and acoustic scenario. In this paper, a spatial long-term iterative mask estimation (SLT-IME) method is proposed to improve the performance of speaker diarization in various real-world acoustic scenarios. First, the complex angular central gaussian mixture model (cACGMM) with diarization results as initial values is used to estimate the presence probability of each speaker at each time-frequency bin, namely speaker masks, in a long-term chunk. Then, the speaker masks are converted to speaker activities according to the threshold, which deliver the diarization information of which speaker is active and when. Finally, the estimated speaker activity can also serve as the initial input for the diarization system, resulting in improved ASR performance. Experimental results on the CHiME-7 three datasets (CHiME-6, DiPCo, Mixer 6) show proposed method can improve diarization and recognition systems performance simultaneously. It also plays a key role in the ensemble system that achieves the best performance in the main track of CHiME-7 DASR Challenge.
doi_str_mv 10.1109/ICASSP48485.2024.10446168
format conference_proceeding
fullrecord <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10446168</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10446168</ieee_id><sourcerecordid>10446168</sourcerecordid><originalsourceid>FETCH-LOGICAL-i728-5a7a8e483364496b888e0540ce525b11bd34867d349fa63c07199dd7c7d3c1eb3</originalsourceid><addsrcrecordid>eNo1kMFOwzAMhgMSEmPsDTiEB2hJmrRxjtMYY1InEO2B25S27hbWtVVakODpyTS42PL327_0m5B7zkLOmX5YL-ZZ9ipBQhxGLJIhZ1ImPIELMtNKg4iZkF7kl2QSCaUDrtn7NbkZhg_GGCgJEzLMadab0ZqGpl27C3J0R7oe0Xn2hXRjhgNdDqM9-rlr6bzvXWfKPa07RzefzWiDxd60LTbeBs0BHX20xtmf87ppqxNHf_CGZbdr7QnfkqvaNAPO_vqU5E_LfPEcpC8rnykNrIogiI0ygBKESKTUSQEAyGLJSoyjuOC8qISERPmqa5OIkimudVWp0qOSYyGm5O5saxFx2zufwX1v_38kfgG_N10M</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition</title><source>IEEE Xplore All Conference Series</source><creator>Ma, Feng ; Tu, Yanhui ; He, Maokui ; Wang, Ruoyu ; Niu, Shutong ; Sun, Lei ; Ye, Zhongfu ; Du, Jun ; Pan, Jia ; Lee, Chin-Hui</creator><creatorcontrib>Ma, Feng ; Tu, Yanhui ; He, Maokui ; Wang, Ruoyu ; Niu, Shutong ; Sun, Lei ; Ye, Zhongfu ; Du, Jun ; Pan, Jia ; Lee, Chin-Hui</creatorcontrib><description>Deep learning (DL)-based speaker diarization methods have proven powerful performance comparing to traditional clustering-based methods for multi-talker speech diarization and recognition in farfield scenes. However, most DL-based approaches cannot utilize the spatial information well due to the poor robustness to unknown array topology and acoustic scenario. In this paper, a spatial long-term iterative mask estimation (SLT-IME) method is proposed to improve the performance of speaker diarization in various real-world acoustic scenarios. First, the complex angular central gaussian mixture model (cACGMM) with diarization results as initial values is used to estimate the presence probability of each speaker at each time-frequency bin, namely speaker masks, in a long-term chunk. Then, the speaker masks are converted to speaker activities according to the threshold, which deliver the diarization information of which speaker is active and when. Finally, the estimated speaker activity can also serve as the initial input for the diarization system, resulting in improved ASR performance. Experimental results on the CHiME-7 three datasets (CHiME-6, DiPCo, Mixer 6) show proposed method can improve diarization and recognition systems performance simultaneously. It also plays a key role in the ensemble system that achieves the best performance in the main track of CHiME-7 DASR Challenge.</description><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 9798350344851</identifier><identifier>DOI: 10.1109/ICASSP48485.2024.10446168</identifier><language>eng</language><publisher>IEEE</publisher><subject>Acoustics ; CHiME-7 Challenge ; Estimation ; iterative mask estimation ; Iterative methods ; multi-channel speech enhancement ; Robustness ; Speaker diarization ; Speech enhancement ; Speech recognition ; Topology</subject><ispartof>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, p.12331-12335</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10446168$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,27902,54530,54907</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10446168$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Ma, Feng</creatorcontrib><creatorcontrib>Tu, Yanhui</creatorcontrib><creatorcontrib>He, Maokui</creatorcontrib><creatorcontrib>Wang, Ruoyu</creatorcontrib><creatorcontrib>Niu, Shutong</creatorcontrib><creatorcontrib>Sun, Lei</creatorcontrib><creatorcontrib>Ye, Zhongfu</creatorcontrib><creatorcontrib>Du, Jun</creatorcontrib><creatorcontrib>Pan, Jia</creatorcontrib><creatorcontrib>Lee, Chin-Hui</creatorcontrib><title>A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition</title><title>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title><addtitle>ICASSP</addtitle><description>Deep learning (DL)-based speaker diarization methods have proven powerful performance comparing to traditional clustering-based methods for multi-talker speech diarization and recognition in farfield scenes. However, most DL-based approaches cannot utilize the spatial information well due to the poor robustness to unknown array topology and acoustic scenario. In this paper, a spatial long-term iterative mask estimation (SLT-IME) method is proposed to improve the performance of speaker diarization in various real-world acoustic scenarios. First, the complex angular central gaussian mixture model (cACGMM) with diarization results as initial values is used to estimate the presence probability of each speaker at each time-frequency bin, namely speaker masks, in a long-term chunk. Then, the speaker masks are converted to speaker activities according to the threshold, which deliver the diarization information of which speaker is active and when. Finally, the estimated speaker activity can also serve as the initial input for the diarization system, resulting in improved ASR performance. Experimental results on the CHiME-7 three datasets (CHiME-6, DiPCo, Mixer 6) show proposed method can improve diarization and recognition systems performance simultaneously. It also plays a key role in the ensemble system that achieves the best performance in the main track of CHiME-7 DASR Challenge.</description><subject>Acoustics</subject><subject>CHiME-7 Challenge</subject><subject>Estimation</subject><subject>iterative mask estimation</subject><subject>Iterative methods</subject><subject>multi-channel speech enhancement</subject><subject>Robustness</subject><subject>Speaker diarization</subject><subject>Speech enhancement</subject><subject>Speech recognition</subject><subject>Topology</subject><issn>2379-190X</issn><isbn>9798350344851</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2024</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo1kMFOwzAMhgMSEmPsDTiEB2hJmrRxjtMYY1InEO2B25S27hbWtVVakODpyTS42PL327_0m5B7zkLOmX5YL-ZZ9ipBQhxGLJIhZ1ImPIELMtNKg4iZkF7kl2QSCaUDrtn7NbkZhg_GGCgJEzLMadab0ZqGpl27C3J0R7oe0Xn2hXRjhgNdDqM9-rlr6bzvXWfKPa07RzefzWiDxd60LTbeBs0BHX20xtmf87ppqxNHf_CGZbdr7QnfkqvaNAPO_vqU5E_LfPEcpC8rnykNrIogiI0ygBKESKTUSQEAyGLJSoyjuOC8qISERPmqa5OIkimudVWp0qOSYyGm5O5saxFx2zufwX1v_38kfgG_N10M</recordid><startdate>20240414</startdate><enddate>20240414</enddate><creator>Ma, Feng</creator><creator>Tu, Yanhui</creator><creator>He, Maokui</creator><creator>Wang, Ruoyu</creator><creator>Niu, Shutong</creator><creator>Sun, Lei</creator><creator>Ye, Zhongfu</creator><creator>Du, Jun</creator><creator>Pan, Jia</creator><creator>Lee, Chin-Hui</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>20240414</creationdate><title>A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition</title><author>Ma, Feng ; Tu, Yanhui ; He, Maokui ; Wang, Ruoyu ; Niu, Shutong ; Sun, Lei ; Ye, Zhongfu ; Du, Jun ; Pan, Jia ; Lee, Chin-Hui</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i728-5a7a8e483364496b888e0540ce525b11bd34867d349fa63c07199dd7c7d3c1eb3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Acoustics</topic><topic>CHiME-7 Challenge</topic><topic>Estimation</topic><topic>iterative mask estimation</topic><topic>Iterative methods</topic><topic>multi-channel speech enhancement</topic><topic>Robustness</topic><topic>Speaker diarization</topic><topic>Speech enhancement</topic><topic>Speech recognition</topic><topic>Topology</topic><toplevel>online_resources</toplevel><creatorcontrib>Ma, Feng</creatorcontrib><creatorcontrib>Tu, Yanhui</creatorcontrib><creatorcontrib>He, Maokui</creatorcontrib><creatorcontrib>Wang, Ruoyu</creatorcontrib><creatorcontrib>Niu, Shutong</creatorcontrib><creatorcontrib>Sun, Lei</creatorcontrib><creatorcontrib>Ye, Zhongfu</creatorcontrib><creatorcontrib>Du, Jun</creatorcontrib><creatorcontrib>Pan, Jia</creatorcontrib><creatorcontrib>Lee, Chin-Hui</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Xplore</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ma, Feng</au><au>Tu, Yanhui</au><au>He, Maokui</au><au>Wang, Ruoyu</au><au>Niu, Shutong</au><au>Sun, Lei</au><au>Ye, Zhongfu</au><au>Du, Jun</au><au>Pan, Jia</au><au>Lee, Chin-Hui</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition</atitle><btitle>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</btitle><stitle>ICASSP</stitle><date>2024-04-14</date><risdate>2024</risdate><spage>12331</spage><epage>12335</epage><pages>12331-12335</pages><eissn>2379-190X</eissn><eisbn>9798350344851</eisbn><abstract>Deep learning (DL)-based speaker diarization methods have proven powerful performance comparing to traditional clustering-based methods for multi-talker speech diarization and recognition in farfield scenes. However, most DL-based approaches cannot utilize the spatial information well due to the poor robustness to unknown array topology and acoustic scenario. In this paper, a spatial long-term iterative mask estimation (SLT-IME) method is proposed to improve the performance of speaker diarization in various real-world acoustic scenarios. First, the complex angular central gaussian mixture model (cACGMM) with diarization results as initial values is used to estimate the presence probability of each speaker at each time-frequency bin, namely speaker masks, in a long-term chunk. Then, the speaker masks are converted to speaker activities according to the threshold, which deliver the diarization information of which speaker is active and when. Finally, the estimated speaker activity can also serve as the initial input for the diarization system, resulting in improved ASR performance. Experimental results on the CHiME-7 three datasets (CHiME-6, DiPCo, Mixer 6) show proposed method can improve diarization and recognition systems performance simultaneously. It also plays a key role in the ensemble system that achieves the best performance in the main track of CHiME-7 DASR Challenge.</abstract><pub>IEEE</pub><doi>10.1109/ICASSP48485.2024.10446168</doi><tpages>5</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier EISSN: 2379-190X
ispartof ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, p.12331-12335
issn 2379-190X
language eng
recordid cdi_ieee_primary_10446168
source IEEE Xplore All Conference Series
subjects Acoustics
CHiME-7 Challenge
Estimation
iterative mask estimation
Iterative methods
multi-channel speech enhancement
Robustness
Speaker diarization
Speech enhancement
Speech recognition
Topology
title A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T16%3A08%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=A%20Spatial%20Long-Term%20Iterative%20Mask%20Estimation%20Approach%20for%20Multi-Channel%20Speaker%20Diarization%20and%20Speech%20Recognition&rft.btitle=ICASSP%202024%20-%202024%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing%20(ICASSP)&rft.au=Ma,%20Feng&rft.date=2024-04-14&rft.spage=12331&rft.epage=12335&rft.pages=12331-12335&rft.eissn=2379-190X&rft_id=info:doi/10.1109/ICASSP48485.2024.10446168&rft.eisbn=9798350344851&rft_dat=%3Cieee_CHZPO%3E10446168%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i728-5a7a8e483364496b888e0540ce525b11bd34867d349fa63c07199dd7c7d3c1eb3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10446168&rfr_iscdi=true