Loading…

Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture

We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates the strengths of memory-aware multi-speaker embedding (MA-MSE) and sequence-to-sequence (Seq2Seq) architecture, leading to improvement i...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yang, Gaobin, He, Maokui, Niu, Shutong, Wang, Ruoyu, Yue, Yanyan, Qian, Shuangqing, Wu, Shilong, Du, Jun, Lee, Chin-Hui
Format:	Conference Proceeding
Language:	English
Subjects:	Acoustics CHiME challenge Codes Error analysis Graphics processing units Memory management memory-aware speaker embedding Oral communication sequence-to-sequence architecture Signal processing speaker diarization
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	11630
container_issue
container_start_page	11626
container_title
container_volume
creator	Yang, Gaobin He, Maokui Niu, Shutong Wang, Ruoyu Yue, Yanyan Qian, Shuangqing Wu, Shilong Du, Jun Lee, Chin-Hui
description	We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates the strengths of memory-aware multi-speaker embedding (MA-MSE) and sequence-to-sequence (Seq2Seq) architecture, leading to improvement in both efficiency and performance. Next, we further decrease the memory occupation of decoding by incorporating input features fusion and then employ a multi-head attention mechanism to capture features at different levels. NSD-MS2S achieved a macro diarization error rate (DER) of 15.9% on the CHiME-7 EVAL set, which signifies a relative improvement of 49% over the official baseline system, and is the key technique for us to achieve the best performance for the main track of CHiME-7 DASR Challenge. Additionally, we introduce a deep interactive module (DIM) in MA-MSE module to better retrieve a cleaner and more discriminative multi-speaker embedding, enabling the current model to outperform the system we used in the CHiME-7 DASR Challenge. Our code is available at https://github.com/liyunlongaaa/NSD-MS2S.
doi_str_mv	10.1109/ICASSP48485.2024.10446661
format	conference_proceeding
fullrecord	<record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10446661</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10446661</ieee_id><sourcerecordid>10446661</sourcerecordid><originalsourceid>FETCH-LOGICAL-i721-9b430b7a4c53479a3874349cef87bb6935834fad7759d74e791ee3463feda41e3</originalsourceid><addsrcrecordid>eNo1UNtOwzAUC0hIjLE_4CF8QEbSc9okj9MYF2kDpA6JtyltT1mgXUfaahpfzxDsyZZsWbYZu1ZyrJS0N4_TSZq-oEETjyMZ4VhJxCRJ1AkbWW0NxBLwIKpTNohAW6GsfDtnF237IaU0Gs2AVU_UB1fxdEvukwK_9S74b9f5ZsNfW7955wuqm7AXk50LxBd91XlxNM_qjIri17Tz3Zqn9NXTJifRNeLI-STka99R3vWBLtlZ6aqWRv84ZMu72XL6IObP94cxc-F1pITNEGSmHeYxoLYODk0BbU6l0VmWWIgNYOkKrWNbaCRtFRFgAiUVDhXBkF39xXoiWm2Dr13Yr47nwA-Dn1tm</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture</title><source>IEEE Xplore All Conference Series</source><creator>Yang, Gaobin ; He, Maokui ; Niu, Shutong ; Wang, Ruoyu ; Yue, Yanyan ; Qian, Shuangqing ; Wu, Shilong ; Du, Jun ; Lee, Chin-Hui</creator><creatorcontrib>Yang, Gaobin ; He, Maokui ; Niu, Shutong ; Wang, Ruoyu ; Yue, Yanyan ; Qian, Shuangqing ; Wu, Shilong ; Du, Jun ; Lee, Chin-Hui</creatorcontrib><description>We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates the strengths of memory-aware multi-speaker embedding (MA-MSE) and sequence-to-sequence (Seq2Seq) architecture, leading to improvement in both efficiency and performance. Next, we further decrease the memory occupation of decoding by incorporating input features fusion and then employ a multi-head attention mechanism to capture features at different levels. NSD-MS2S achieved a macro diarization error rate (DER) of 15.9% on the CHiME-7 EVAL set, which signifies a relative improvement of 49% over the official baseline system, and is the key technique for us to achieve the best performance for the main track of CHiME-7 DASR Challenge. Additionally, we introduce a deep interactive module (DIM) in MA-MSE module to better retrieve a cleaner and more discriminative multi-speaker embedding, enabling the current model to outperform the system we used in the CHiME-7 DASR Challenge. Our code is available at https://github.com/liyunlongaaa/NSD-MS2S.</description><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 9798350344851</identifier><identifier>DOI: 10.1109/ICASSP48485.2024.10446661</identifier><language>eng</language><publisher>IEEE</publisher><subject>Acoustics ; CHiME challenge ; Codes ; Error analysis ; Graphics processing units ; Memory management ; memory-aware speaker embedding ; Oral communication ; sequence-to-sequence architecture ; Signal processing ; speaker diarization</subject><ispartof>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, p.11626-11630</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10446661$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,27902,54530,54907</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10446661$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Yang, Gaobin</creatorcontrib><creatorcontrib>He, Maokui</creatorcontrib><creatorcontrib>Niu, Shutong</creatorcontrib><creatorcontrib>Wang, Ruoyu</creatorcontrib><creatorcontrib>Yue, Yanyan</creatorcontrib><creatorcontrib>Qian, Shuangqing</creatorcontrib><creatorcontrib>Wu, Shilong</creatorcontrib><creatorcontrib>Du, Jun</creatorcontrib><creatorcontrib>Lee, Chin-Hui</creatorcontrib><title>Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture</title><title>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title><addtitle>ICASSP</addtitle><description>We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates the strengths of memory-aware multi-speaker embedding (MA-MSE) and sequence-to-sequence (Seq2Seq) architecture, leading to improvement in both efficiency and performance. Next, we further decrease the memory occupation of decoding by incorporating input features fusion and then employ a multi-head attention mechanism to capture features at different levels. NSD-MS2S achieved a macro diarization error rate (DER) of 15.9% on the CHiME-7 EVAL set, which signifies a relative improvement of 49% over the official baseline system, and is the key technique for us to achieve the best performance for the main track of CHiME-7 DASR Challenge. Additionally, we introduce a deep interactive module (DIM) in MA-MSE module to better retrieve a cleaner and more discriminative multi-speaker embedding, enabling the current model to outperform the system we used in the CHiME-7 DASR Challenge. Our code is available at https://github.com/liyunlongaaa/NSD-MS2S.</description><subject>Acoustics</subject><subject>CHiME challenge</subject><subject>Codes</subject><subject>Error analysis</subject><subject>Graphics processing units</subject><subject>Memory management</subject><subject>memory-aware speaker embedding</subject><subject>Oral communication</subject><subject>sequence-to-sequence architecture</subject><subject>Signal processing</subject><subject>speaker diarization</subject><issn>2379-190X</issn><isbn>9798350344851</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2024</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo1UNtOwzAUC0hIjLE_4CF8QEbSc9okj9MYF2kDpA6JtyltT1mgXUfaahpfzxDsyZZsWbYZu1ZyrJS0N4_TSZq-oEETjyMZ4VhJxCRJ1AkbWW0NxBLwIKpTNohAW6GsfDtnF237IaU0Gs2AVU_UB1fxdEvukwK_9S74b9f5ZsNfW7955wuqm7AXk50LxBd91XlxNM_qjIri17Tz3Zqn9NXTJifRNeLI-STka99R3vWBLtlZ6aqWRv84ZMu72XL6IObP94cxc-F1pITNEGSmHeYxoLYODk0BbU6l0VmWWIgNYOkKrWNbaCRtFRFgAiUVDhXBkF39xXoiWm2Dr13Yr47nwA-Dn1tm</recordid><startdate>20240414</startdate><enddate>20240414</enddate><creator>Yang, Gaobin</creator><creator>He, Maokui</creator><creator>Niu, Shutong</creator><creator>Wang, Ruoyu</creator><creator>Yue, Yanyan</creator><creator>Qian, Shuangqing</creator><creator>Wu, Shilong</creator><creator>Du, Jun</creator><creator>Lee, Chin-Hui</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>20240414</creationdate><title>Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture</title><author>Yang, Gaobin ; He, Maokui ; Niu, Shutong ; Wang, Ruoyu ; Yue, Yanyan ; Qian, Shuangqing ; Wu, Shilong ; Du, Jun ; Lee, Chin-Hui</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i721-9b430b7a4c53479a3874349cef87bb6935834fad7759d74e791ee3463feda41e3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Acoustics</topic><topic>CHiME challenge</topic><topic>Codes</topic><topic>Error analysis</topic><topic>Graphics processing units</topic><topic>Memory management</topic><topic>memory-aware speaker embedding</topic><topic>Oral communication</topic><topic>sequence-to-sequence architecture</topic><topic>Signal processing</topic><topic>speaker diarization</topic><toplevel>online_resources</toplevel><creatorcontrib>Yang, Gaobin</creatorcontrib><creatorcontrib>He, Maokui</creatorcontrib><creatorcontrib>Niu, Shutong</creatorcontrib><creatorcontrib>Wang, Ruoyu</creatorcontrib><creatorcontrib>Yue, Yanyan</creatorcontrib><creatorcontrib>Qian, Shuangqing</creatorcontrib><creatorcontrib>Wu, Shilong</creatorcontrib><creatorcontrib>Du, Jun</creatorcontrib><creatorcontrib>Lee, Chin-Hui</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Yang, Gaobin</au><au>He, Maokui</au><au>Niu, Shutong</au><au>Wang, Ruoyu</au><au>Yue, Yanyan</au><au>Qian, Shuangqing</au><au>Wu, Shilong</au><au>Du, Jun</au><au>Lee, Chin-Hui</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture</atitle><btitle>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</btitle><stitle>ICASSP</stitle><date>2024-04-14</date><risdate>2024</risdate><spage>11626</spage><epage>11630</epage><pages>11626-11630</pages><eissn>2379-190X</eissn><eisbn>9798350344851</eisbn><abstract>We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates the strengths of memory-aware multi-speaker embedding (MA-MSE) and sequence-to-sequence (Seq2Seq) architecture, leading to improvement in both efficiency and performance. Next, we further decrease the memory occupation of decoding by incorporating input features fusion and then employ a multi-head attention mechanism to capture features at different levels. NSD-MS2S achieved a macro diarization error rate (DER) of 15.9% on the CHiME-7 EVAL set, which signifies a relative improvement of 49% over the official baseline system, and is the key technique for us to achieve the best performance for the main track of CHiME-7 DASR Challenge. Additionally, we introduce a deep interactive module (DIM) in MA-MSE module to better retrieve a cleaner and more discriminative multi-speaker embedding, enabling the current model to outperform the system we used in the CHiME-7 DASR Challenge. Our code is available at https://github.com/liyunlongaaa/NSD-MS2S.</abstract><pub>IEEE</pub><doi>10.1109/ICASSP48485.2024.10446661</doi><tpages>5</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	EISSN: 2379-190X
ispartof	ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, p.11626-11630
issn	2379-190X
language	eng
recordid	cdi_ieee_primary_10446661
source	IEEE Xplore All Conference Series
subjects	Acoustics CHiME challenge Codes Error analysis Graphics processing units Memory management memory-aware speaker embedding Oral communication sequence-to-sequence architecture Signal processing speaker diarization
title	Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T16%3A02%3A08IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Neural%20Speaker%20Diarization%20Using%20Memory-Aware%20Multi-Speaker%20Embedding%20with%20Sequence-to-Sequence%20Architecture&rft.btitle=ICASSP%202024%20-%202024%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing%20(ICASSP)&rft.au=Yang,%20Gaobin&rft.date=2024-04-14&rft.spage=11626&rft.epage=11630&rft.pages=11626-11630&rft.eissn=2379-190X&rft_id=info:doi/10.1109/ICASSP48485.2024.10446661&rft.eisbn=9798350344851&rft_dat=%3Cieee_CHZPO%3E10446661%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i721-9b430b7a4c53479a3874349cef87bb6935834fad7759d74e791ee3463feda41e3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10446661&rfr_iscdi=true