Loading…
A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition
Deep learning (DL)-based speaker diarization methods have proven powerful performance comparing to traditional clustering-based methods for multi-talker speech diarization and recognition in farfield scenes. However, most DL-based approaches cannot utilize the spatial information well due to the poo...
Saved in:
Main Authors: | , , , , , , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | 12335 |
container_issue | |
container_start_page | 12331 |
container_title | |
container_volume | |
creator | Ma, Feng Tu, Yanhui He, Maokui Wang, Ruoyu Niu, Shutong Sun, Lei Ye, Zhongfu Du, Jun Pan, Jia Lee, Chin-Hui |
description | Deep learning (DL)-based speaker diarization methods have proven powerful performance comparing to traditional clustering-based methods for multi-talker speech diarization and recognition in farfield scenes. However, most DL-based approaches cannot utilize the spatial information well due to the poor robustness to unknown array topology and acoustic scenario. In this paper, a spatial long-term iterative mask estimation (SLT-IME) method is proposed to improve the performance of speaker diarization in various real-world acoustic scenarios. First, the complex angular central gaussian mixture model (cACGMM) with diarization results as initial values is used to estimate the presence probability of each speaker at each time-frequency bin, namely speaker masks, in a long-term chunk. Then, the speaker masks are converted to speaker activities according to the threshold, which deliver the diarization information of which speaker is active and when. Finally, the estimated speaker activity can also serve as the initial input for the diarization system, resulting in improved ASR performance. Experimental results on the CHiME-7 three datasets (CHiME-6, DiPCo, Mixer 6) show proposed method can improve diarization and recognition systems performance simultaneously. It also plays a key role in the ensemble system that achieves the best performance in the main track of CHiME-7 DASR Challenge. |
doi_str_mv | 10.1109/ICASSP48485.2024.10446168 |
format | conference_proceeding |
fullrecord | <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10446168</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10446168</ieee_id><sourcerecordid>10446168</sourcerecordid><originalsourceid>FETCH-LOGICAL-i728-5a7a8e483364496b888e0540ce525b11bd34867d349fa63c07199dd7c7d3c1eb3</originalsourceid><addsrcrecordid>eNo1kMFOwzAMhgMSEmPsDTiEB2hJmrRxjtMYY1InEO2B25S27hbWtVVakODpyTS42PL327_0m5B7zkLOmX5YL-ZZ9ipBQhxGLJIhZ1ImPIELMtNKg4iZkF7kl2QSCaUDrtn7NbkZhg_GGCgJEzLMadab0ZqGpl27C3J0R7oe0Xn2hXRjhgNdDqM9-rlr6bzvXWfKPa07RzefzWiDxd60LTbeBs0BHX20xtmf87ppqxNHf_CGZbdr7QnfkqvaNAPO_vqU5E_LfPEcpC8rnykNrIogiI0ygBKESKTUSQEAyGLJSoyjuOC8qISERPmqa5OIkimudVWp0qOSYyGm5O5saxFx2zufwX1v_38kfgG_N10M</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition</title><source>IEEE Xplore All Conference Series</source><creator>Ma, Feng ; Tu, Yanhui ; He, Maokui ; Wang, Ruoyu ; Niu, Shutong ; Sun, Lei ; Ye, Zhongfu ; Du, Jun ; Pan, Jia ; Lee, Chin-Hui</creator><creatorcontrib>Ma, Feng ; Tu, Yanhui ; He, Maokui ; Wang, Ruoyu ; Niu, Shutong ; Sun, Lei ; Ye, Zhongfu ; Du, Jun ; Pan, Jia ; Lee, Chin-Hui</creatorcontrib><description>Deep learning (DL)-based speaker diarization methods have proven powerful performance comparing to traditional clustering-based methods for multi-talker speech diarization and recognition in farfield scenes. However, most DL-based approaches cannot utilize the spatial information well due to the poor robustness to unknown array topology and acoustic scenario. In this paper, a spatial long-term iterative mask estimation (SLT-IME) method is proposed to improve the performance of speaker diarization in various real-world acoustic scenarios. First, the complex angular central gaussian mixture model (cACGMM) with diarization results as initial values is used to estimate the presence probability of each speaker at each time-frequency bin, namely speaker masks, in a long-term chunk. Then, the speaker masks are converted to speaker activities according to the threshold, which deliver the diarization information of which speaker is active and when. Finally, the estimated speaker activity can also serve as the initial input for the diarization system, resulting in improved ASR performance. Experimental results on the CHiME-7 three datasets (CHiME-6, DiPCo, Mixer 6) show proposed method can improve diarization and recognition systems performance simultaneously. It also plays a key role in the ensemble system that achieves the best performance in the main track of CHiME-7 DASR Challenge.</description><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 9798350344851</identifier><identifier>DOI: 10.1109/ICASSP48485.2024.10446168</identifier><language>eng</language><publisher>IEEE</publisher><subject>Acoustics ; CHiME-7 Challenge ; Estimation ; iterative mask estimation ; Iterative methods ; multi-channel speech enhancement ; Robustness ; Speaker diarization ; Speech enhancement ; Speech recognition ; Topology</subject><ispartof>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, p.12331-12335</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10446168$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,27902,54530,54907</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10446168$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Ma, Feng</creatorcontrib><creatorcontrib>Tu, Yanhui</creatorcontrib><creatorcontrib>He, Maokui</creatorcontrib><creatorcontrib>Wang, Ruoyu</creatorcontrib><creatorcontrib>Niu, Shutong</creatorcontrib><creatorcontrib>Sun, Lei</creatorcontrib><creatorcontrib>Ye, Zhongfu</creatorcontrib><creatorcontrib>Du, Jun</creatorcontrib><creatorcontrib>Pan, Jia</creatorcontrib><creatorcontrib>Lee, Chin-Hui</creatorcontrib><title>A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition</title><title>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title><addtitle>ICASSP</addtitle><description>Deep learning (DL)-based speaker diarization methods have proven powerful performance comparing to traditional clustering-based methods for multi-talker speech diarization and recognition in farfield scenes. However, most DL-based approaches cannot utilize the spatial information well due to the poor robustness to unknown array topology and acoustic scenario. In this paper, a spatial long-term iterative mask estimation (SLT-IME) method is proposed to improve the performance of speaker diarization in various real-world acoustic scenarios. First, the complex angular central gaussian mixture model (cACGMM) with diarization results as initial values is used to estimate the presence probability of each speaker at each time-frequency bin, namely speaker masks, in a long-term chunk. Then, the speaker masks are converted to speaker activities according to the threshold, which deliver the diarization information of which speaker is active and when. Finally, the estimated speaker activity can also serve as the initial input for the diarization system, resulting in improved ASR performance. Experimental results on the CHiME-7 three datasets (CHiME-6, DiPCo, Mixer 6) show proposed method can improve diarization and recognition systems performance simultaneously. It also plays a key role in the ensemble system that achieves the best performance in the main track of CHiME-7 DASR Challenge.</description><subject>Acoustics</subject><subject>CHiME-7 Challenge</subject><subject>Estimation</subject><subject>iterative mask estimation</subject><subject>Iterative methods</subject><subject>multi-channel speech enhancement</subject><subject>Robustness</subject><subject>Speaker diarization</subject><subject>Speech enhancement</subject><subject>Speech recognition</subject><subject>Topology</subject><issn>2379-190X</issn><isbn>9798350344851</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2024</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo1kMFOwzAMhgMSEmPsDTiEB2hJmrRxjtMYY1InEO2B25S27hbWtVVakODpyTS42PL327_0m5B7zkLOmX5YL-ZZ9ipBQhxGLJIhZ1ImPIELMtNKg4iZkF7kl2QSCaUDrtn7NbkZhg_GGCgJEzLMadab0ZqGpl27C3J0R7oe0Xn2hXRjhgNdDqM9-rlr6bzvXWfKPa07RzefzWiDxd60LTbeBs0BHX20xtmf87ppqxNHf_CGZbdr7QnfkqvaNAPO_vqU5E_LfPEcpC8rnykNrIogiI0ygBKESKTUSQEAyGLJSoyjuOC8qISERPmqa5OIkimudVWp0qOSYyGm5O5saxFx2zufwX1v_38kfgG_N10M</recordid><startdate>20240414</startdate><enddate>20240414</enddate><creator>Ma, Feng</creator><creator>Tu, Yanhui</creator><creator>He, Maokui</creator><creator>Wang, Ruoyu</creator><creator>Niu, Shutong</creator><creator>Sun, Lei</creator><creator>Ye, Zhongfu</creator><creator>Du, Jun</creator><creator>Pan, Jia</creator><creator>Lee, Chin-Hui</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>20240414</creationdate><title>A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition</title><author>Ma, Feng ; Tu, Yanhui ; He, Maokui ; Wang, Ruoyu ; Niu, Shutong ; Sun, Lei ; Ye, Zhongfu ; Du, Jun ; Pan, Jia ; Lee, Chin-Hui</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i728-5a7a8e483364496b888e0540ce525b11bd34867d349fa63c07199dd7c7d3c1eb3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Acoustics</topic><topic>CHiME-7 Challenge</topic><topic>Estimation</topic><topic>iterative mask estimation</topic><topic>Iterative methods</topic><topic>multi-channel speech enhancement</topic><topic>Robustness</topic><topic>Speaker diarization</topic><topic>Speech enhancement</topic><topic>Speech recognition</topic><topic>Topology</topic><toplevel>online_resources</toplevel><creatorcontrib>Ma, Feng</creatorcontrib><creatorcontrib>Tu, Yanhui</creatorcontrib><creatorcontrib>He, Maokui</creatorcontrib><creatorcontrib>Wang, Ruoyu</creatorcontrib><creatorcontrib>Niu, Shutong</creatorcontrib><creatorcontrib>Sun, Lei</creatorcontrib><creatorcontrib>Ye, Zhongfu</creatorcontrib><creatorcontrib>Du, Jun</creatorcontrib><creatorcontrib>Pan, Jia</creatorcontrib><creatorcontrib>Lee, Chin-Hui</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Xplore</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ma, Feng</au><au>Tu, Yanhui</au><au>He, Maokui</au><au>Wang, Ruoyu</au><au>Niu, Shutong</au><au>Sun, Lei</au><au>Ye, Zhongfu</au><au>Du, Jun</au><au>Pan, Jia</au><au>Lee, Chin-Hui</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition</atitle><btitle>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</btitle><stitle>ICASSP</stitle><date>2024-04-14</date><risdate>2024</risdate><spage>12331</spage><epage>12335</epage><pages>12331-12335</pages><eissn>2379-190X</eissn><eisbn>9798350344851</eisbn><abstract>Deep learning (DL)-based speaker diarization methods have proven powerful performance comparing to traditional clustering-based methods for multi-talker speech diarization and recognition in farfield scenes. However, most DL-based approaches cannot utilize the spatial information well due to the poor robustness to unknown array topology and acoustic scenario. In this paper, a spatial long-term iterative mask estimation (SLT-IME) method is proposed to improve the performance of speaker diarization in various real-world acoustic scenarios. First, the complex angular central gaussian mixture model (cACGMM) with diarization results as initial values is used to estimate the presence probability of each speaker at each time-frequency bin, namely speaker masks, in a long-term chunk. Then, the speaker masks are converted to speaker activities according to the threshold, which deliver the diarization information of which speaker is active and when. Finally, the estimated speaker activity can also serve as the initial input for the diarization system, resulting in improved ASR performance. Experimental results on the CHiME-7 three datasets (CHiME-6, DiPCo, Mixer 6) show proposed method can improve diarization and recognition systems performance simultaneously. It also plays a key role in the ensemble system that achieves the best performance in the main track of CHiME-7 DASR Challenge.</abstract><pub>IEEE</pub><doi>10.1109/ICASSP48485.2024.10446168</doi><tpages>5</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | EISSN: 2379-190X |
ispartof | ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, p.12331-12335 |
issn | 2379-190X |
language | eng |
recordid | cdi_ieee_primary_10446168 |
source | IEEE Xplore All Conference Series |
subjects | Acoustics CHiME-7 Challenge Estimation iterative mask estimation Iterative methods multi-channel speech enhancement Robustness Speaker diarization Speech enhancement Speech recognition Topology |
title | A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T16%3A08%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=A%20Spatial%20Long-Term%20Iterative%20Mask%20Estimation%20Approach%20for%20Multi-Channel%20Speaker%20Diarization%20and%20Speech%20Recognition&rft.btitle=ICASSP%202024%20-%202024%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing%20(ICASSP)&rft.au=Ma,%20Feng&rft.date=2024-04-14&rft.spage=12331&rft.epage=12335&rft.pages=12331-12335&rft.eissn=2379-190X&rft_id=info:doi/10.1109/ICASSP48485.2024.10446168&rft.eisbn=9798350344851&rft_dat=%3Cieee_CHZPO%3E10446168%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i728-5a7a8e483364496b888e0540ce525b11bd34867d349fa63c07199dd7c7d3c1eb3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10446168&rfr_iscdi=true |