Loading…

Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario

Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (T...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2020-07
Main Authors:	Medennikov, Ivan, Korenevsky, Maxim, Prisyach, Tatiana, Khokhlov, Yuri, Korenevskaya, Mariya, Sorokin, Ivan, Timofeeva, Tatiana, Mitrofanov, Anton, Andrusenko, Andrei, Podluzhny, Ivan, Laptev, Aleksandr, Romanenko, Aleksei
Format:	Article
Language:	English
Subjects:	Clustering Post-processing Target detection Voice activity detectors Voice recognition
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Medennikov, Ivan Korenevsky, Maxim Prisyach, Tatiana Khokhlov, Yuri Korenevskaya, Mariya Sorokin, Ivan Timofeeva, Tatiana Mitrofanov, Anton Andrusenko, Andrei Podluzhny, Ivan Laptev, Aleksandr Romanenko, Aleksei
description	Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.
doi_str_mv	10.48550/arxiv.2005.07272
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2404184487</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2404184487</sourcerecordid><originalsourceid>FETCH-LOGICAL-a527-f6494791095c957b40ec5cd1359a096a98ee5fdd6c34a4f4b6ee9a3ed190baf33</originalsourceid><addsrcrecordid>eNo9jltLw0AQhRdBsNT-AN8WfE6c7CWb9a203qBeoMXXMtlMdGvJ1k1a1F_viuLTDOfM-c4wdlZAriqt4QLjhz_kAkDnYIQRR2wkpCyySglxwiZ9vwEAURqhtRyxuML4QkO23BG-UeTPwTviUzf4gx8--ZwGSnvoLjnyh3CgLZ_udjGge-VtiPx-vx38f3juMfov_LnnvkuJue-6pD9hTKyloy754ZQdt7jtafI3x2x1fbWa3WaLx5u72XSRoRYma0tllbEFWO2sNrUCcto1hdQWwZZoKyLdNk3ppELVqroksiipKSzU2Eo5Zue_2PTu-576Yb0J-9ilxrVQoIpKqcrIb7ptXLE</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2404184487</pqid></control><display><type>article</type><title>Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario</title><source>Publicly Available Content (ProQuest)</source><creator>Medennikov, Ivan ; Korenevsky, Maxim ; Prisyach, Tatiana ; Khokhlov, Yuri ; Korenevskaya, Mariya ; Sorokin, Ivan ; Timofeeva, Tatiana ; Mitrofanov, Anton ; Andrusenko, Andrei ; Podluzhny, Ivan ; Laptev, Aleksandr ; Romanenko, Aleksei</creator><creatorcontrib>Medennikov, Ivan ; Korenevsky, Maxim ; Prisyach, Tatiana ; Khokhlov, Yuri ; Korenevskaya, Mariya ; Sorokin, Ivan ; Timofeeva, Tatiana ; Mitrofanov, Anton ; Andrusenko, Andrei ; Podluzhny, Ivan ; Laptev, Aleksandr ; Romanenko, Aleksei</creatorcontrib><description>Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2005.07272</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Clustering ; Post-processing ; Target detection ; Voice activity detectors ; Voice recognition</subject><ispartof>arXiv.org, 2020-07</ispartof><rights>2020. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2404184487?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,27925,37012,44590</link.rule.ids></links><search><creatorcontrib>Medennikov, Ivan</creatorcontrib><creatorcontrib>Korenevsky, Maxim</creatorcontrib><creatorcontrib>Prisyach, Tatiana</creatorcontrib><creatorcontrib>Khokhlov, Yuri</creatorcontrib><creatorcontrib>Korenevskaya, Mariya</creatorcontrib><creatorcontrib>Sorokin, Ivan</creatorcontrib><creatorcontrib>Timofeeva, Tatiana</creatorcontrib><creatorcontrib>Mitrofanov, Anton</creatorcontrib><creatorcontrib>Andrusenko, Andrei</creatorcontrib><creatorcontrib>Podluzhny, Ivan</creatorcontrib><creatorcontrib>Laptev, Aleksandr</creatorcontrib><creatorcontrib>Romanenko, Aleksei</creatorcontrib><title>Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario</title><title>arXiv.org</title><description>Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.</description><subject>Clustering</subject><subject>Post-processing</subject><subject>Target detection</subject><subject>Voice activity detectors</subject><subject>Voice recognition</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNo9jltLw0AQhRdBsNT-AN8WfE6c7CWb9a203qBeoMXXMtlMdGvJ1k1a1F_viuLTDOfM-c4wdlZAriqt4QLjhz_kAkDnYIQRR2wkpCyySglxwiZ9vwEAURqhtRyxuML4QkO23BG-UeTPwTviUzf4gx8--ZwGSnvoLjnyh3CgLZ_udjGge-VtiPx-vx38f3juMfov_LnnvkuJue-6pD9hTKyloy754ZQdt7jtafI3x2x1fbWa3WaLx5u72XSRoRYma0tllbEFWO2sNrUCcto1hdQWwZZoKyLdNk3ppELVqroksiipKSzU2Eo5Zue_2PTu-576Yb0J-9ilxrVQoIpKqcrIb7ptXLE</recordid><startdate>20200727</startdate><enddate>20200727</enddate><creator>Medennikov, Ivan</creator><creator>Korenevsky, Maxim</creator><creator>Prisyach, Tatiana</creator><creator>Khokhlov, Yuri</creator><creator>Korenevskaya, Mariya</creator><creator>Sorokin, Ivan</creator><creator>Timofeeva, Tatiana</creator><creator>Mitrofanov, Anton</creator><creator>Andrusenko, Andrei</creator><creator>Podluzhny, Ivan</creator><creator>Laptev, Aleksandr</creator><creator>Romanenko, Aleksei</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20200727</creationdate><title>Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario</title><author>Medennikov, Ivan ; Korenevsky, Maxim ; Prisyach, Tatiana ; Khokhlov, Yuri ; Korenevskaya, Mariya ; Sorokin, Ivan ; Timofeeva, Tatiana ; Mitrofanov, Anton ; Andrusenko, Andrei ; Podluzhny, Ivan ; Laptev, Aleksandr ; Romanenko, Aleksei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a527-f6494791095c957b40ec5cd1359a096a98ee5fdd6c34a4f4b6ee9a3ed190baf33</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Clustering</topic><topic>Post-processing</topic><topic>Target detection</topic><topic>Voice activity detectors</topic><topic>Voice recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Medennikov, Ivan</creatorcontrib><creatorcontrib>Korenevsky, Maxim</creatorcontrib><creatorcontrib>Prisyach, Tatiana</creatorcontrib><creatorcontrib>Khokhlov, Yuri</creatorcontrib><creatorcontrib>Korenevskaya, Mariya</creatorcontrib><creatorcontrib>Sorokin, Ivan</creatorcontrib><creatorcontrib>Timofeeva, Tatiana</creatorcontrib><creatorcontrib>Mitrofanov, Anton</creatorcontrib><creatorcontrib>Andrusenko, Andrei</creatorcontrib><creatorcontrib>Podluzhny, Ivan</creatorcontrib><creatorcontrib>Laptev, Aleksandr</creatorcontrib><creatorcontrib>Romanenko, Aleksei</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Medennikov, Ivan</au><au>Korenevsky, Maxim</au><au>Prisyach, Tatiana</au><au>Khokhlov, Yuri</au><au>Korenevskaya, Mariya</au><au>Sorokin, Ivan</au><au>Timofeeva, Tatiana</au><au>Mitrofanov, Anton</au><au>Andrusenko, Andrei</au><au>Podluzhny, Ivan</au><au>Laptev, Aleksandr</au><au>Romanenko, Aleksei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario</atitle><jtitle>arXiv.org</jtitle><date>2020-07-27</date><risdate>2020</risdate><eissn>2331-8422</eissn><abstract>Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2005.07272</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2020-07
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2404184487
source	Publicly Available Content (ProQuest)
subjects	Clustering Post-processing Target detection Voice activity detectors Voice recognition
title	Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T00%3A30%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Target-Speaker%20Voice%20Activity%20Detection:%20a%20Novel%20Approach%20for%20Multi-Speaker%20Diarization%20in%20a%20Dinner%20Party%20Scenario&rft.jtitle=arXiv.org&rft.au=Medennikov,%20Ivan&rft.date=2020-07-27&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2005.07272&rft_dat=%3Cproquest%3E2404184487%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-a527-f6494791095c957b40ec5cd1359a096a98ee5fdd6c34a4f4b6ee9a3ed190baf33%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2404184487&rft_id=info:pmid/&rfr_iscdi=true