Loading…

Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario

Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (T...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2020-07
Main Authors: Medennikov, Ivan, Korenevsky, Maxim, Prisyach, Tatiana, Khokhlov, Yuri, Korenevskaya, Mariya, Sorokin, Ivan, Timofeeva, Tatiana, Mitrofanov, Anton, Andrusenko, Andrei, Podluzhny, Ivan, Laptev, Aleksandr, Romanenko, Aleksei
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Medennikov, Ivan
Korenevsky, Maxim
Prisyach, Tatiana
Khokhlov, Yuri
Korenevskaya, Mariya
Sorokin, Ivan
Timofeeva, Tatiana
Mitrofanov, Anton
Andrusenko, Andrei
Podluzhny, Ivan
Laptev, Aleksandr
Romanenko, Aleksei
description Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.
doi_str_mv 10.48550/arxiv.2005.07272
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2404184487</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2404184487</sourcerecordid><originalsourceid>FETCH-LOGICAL-a527-f6494791095c957b40ec5cd1359a096a98ee5fdd6c34a4f4b6ee9a3ed190baf33</originalsourceid><addsrcrecordid>eNo9jltLw0AQhRdBsNT-AN8WfE6c7CWb9a203qBeoMXXMtlMdGvJ1k1a1F_viuLTDOfM-c4wdlZAriqt4QLjhz_kAkDnYIQRR2wkpCyySglxwiZ9vwEAURqhtRyxuML4QkO23BG-UeTPwTviUzf4gx8--ZwGSnvoLjnyh3CgLZ_udjGge-VtiPx-vx38f3juMfov_LnnvkuJue-6pD9hTKyloy754ZQdt7jtafI3x2x1fbWa3WaLx5u72XSRoRYma0tllbEFWO2sNrUCcto1hdQWwZZoKyLdNk3ppELVqroksiipKSzU2Eo5Zue_2PTu-576Yb0J-9ilxrVQoIpKqcrIb7ptXLE</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2404184487</pqid></control><display><type>article</type><title>Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario</title><source>Publicly Available Content (ProQuest)</source><creator>Medennikov, Ivan ; Korenevsky, Maxim ; Prisyach, Tatiana ; Khokhlov, Yuri ; Korenevskaya, Mariya ; Sorokin, Ivan ; Timofeeva, Tatiana ; Mitrofanov, Anton ; Andrusenko, Andrei ; Podluzhny, Ivan ; Laptev, Aleksandr ; Romanenko, Aleksei</creator><creatorcontrib>Medennikov, Ivan ; Korenevsky, Maxim ; Prisyach, Tatiana ; Khokhlov, Yuri ; Korenevskaya, Mariya ; Sorokin, Ivan ; Timofeeva, Tatiana ; Mitrofanov, Anton ; Andrusenko, Andrei ; Podluzhny, Ivan ; Laptev, Aleksandr ; Romanenko, Aleksei</creatorcontrib><description>Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2005.07272</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Clustering ; Post-processing ; Target detection ; Voice activity detectors ; Voice recognition</subject><ispartof>arXiv.org, 2020-07</ispartof><rights>2020. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2404184487?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,27925,37012,44590</link.rule.ids></links><search><creatorcontrib>Medennikov, Ivan</creatorcontrib><creatorcontrib>Korenevsky, Maxim</creatorcontrib><creatorcontrib>Prisyach, Tatiana</creatorcontrib><creatorcontrib>Khokhlov, Yuri</creatorcontrib><creatorcontrib>Korenevskaya, Mariya</creatorcontrib><creatorcontrib>Sorokin, Ivan</creatorcontrib><creatorcontrib>Timofeeva, Tatiana</creatorcontrib><creatorcontrib>Mitrofanov, Anton</creatorcontrib><creatorcontrib>Andrusenko, Andrei</creatorcontrib><creatorcontrib>Podluzhny, Ivan</creatorcontrib><creatorcontrib>Laptev, Aleksandr</creatorcontrib><creatorcontrib>Romanenko, Aleksei</creatorcontrib><title>Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario</title><title>arXiv.org</title><description>Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.</description><subject>Clustering</subject><subject>Post-processing</subject><subject>Target detection</subject><subject>Voice activity detectors</subject><subject>Voice recognition</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNo9jltLw0AQhRdBsNT-AN8WfE6c7CWb9a203qBeoMXXMtlMdGvJ1k1a1F_viuLTDOfM-c4wdlZAriqt4QLjhz_kAkDnYIQRR2wkpCyySglxwiZ9vwEAURqhtRyxuML4QkO23BG-UeTPwTviUzf4gx8--ZwGSnvoLjnyh3CgLZ_udjGge-VtiPx-vx38f3juMfov_LnnvkuJue-6pD9hTKyloy754ZQdt7jtafI3x2x1fbWa3WaLx5u72XSRoRYma0tllbEFWO2sNrUCcto1hdQWwZZoKyLdNk3ppELVqroksiipKSzU2Eo5Zue_2PTu-576Yb0J-9ilxrVQoIpKqcrIb7ptXLE</recordid><startdate>20200727</startdate><enddate>20200727</enddate><creator>Medennikov, Ivan</creator><creator>Korenevsky, Maxim</creator><creator>Prisyach, Tatiana</creator><creator>Khokhlov, Yuri</creator><creator>Korenevskaya, Mariya</creator><creator>Sorokin, Ivan</creator><creator>Timofeeva, Tatiana</creator><creator>Mitrofanov, Anton</creator><creator>Andrusenko, Andrei</creator><creator>Podluzhny, Ivan</creator><creator>Laptev, Aleksandr</creator><creator>Romanenko, Aleksei</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20200727</creationdate><title>Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario</title><author>Medennikov, Ivan ; Korenevsky, Maxim ; Prisyach, Tatiana ; Khokhlov, Yuri ; Korenevskaya, Mariya ; Sorokin, Ivan ; Timofeeva, Tatiana ; Mitrofanov, Anton ; Andrusenko, Andrei ; Podluzhny, Ivan ; Laptev, Aleksandr ; Romanenko, Aleksei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a527-f6494791095c957b40ec5cd1359a096a98ee5fdd6c34a4f4b6ee9a3ed190baf33</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Clustering</topic><topic>Post-processing</topic><topic>Target detection</topic><topic>Voice activity detectors</topic><topic>Voice recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Medennikov, Ivan</creatorcontrib><creatorcontrib>Korenevsky, Maxim</creatorcontrib><creatorcontrib>Prisyach, Tatiana</creatorcontrib><creatorcontrib>Khokhlov, Yuri</creatorcontrib><creatorcontrib>Korenevskaya, Mariya</creatorcontrib><creatorcontrib>Sorokin, Ivan</creatorcontrib><creatorcontrib>Timofeeva, Tatiana</creatorcontrib><creatorcontrib>Mitrofanov, Anton</creatorcontrib><creatorcontrib>Andrusenko, Andrei</creatorcontrib><creatorcontrib>Podluzhny, Ivan</creatorcontrib><creatorcontrib>Laptev, Aleksandr</creatorcontrib><creatorcontrib>Romanenko, Aleksei</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Medennikov, Ivan</au><au>Korenevsky, Maxim</au><au>Prisyach, Tatiana</au><au>Khokhlov, Yuri</au><au>Korenevskaya, Mariya</au><au>Sorokin, Ivan</au><au>Timofeeva, Tatiana</au><au>Mitrofanov, Anton</au><au>Andrusenko, Andrei</au><au>Podluzhny, Ivan</au><au>Laptev, Aleksandr</au><au>Romanenko, Aleksei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario</atitle><jtitle>arXiv.org</jtitle><date>2020-07-27</date><risdate>2020</risdate><eissn>2331-8422</eissn><abstract>Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2005.07272</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2020-07
issn 2331-8422
language eng
recordid cdi_proquest_journals_2404184487
source Publicly Available Content (ProQuest)
subjects Clustering
Post-processing
Target detection
Voice activity detectors
Voice recognition
title Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T00%3A30%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Target-Speaker%20Voice%20Activity%20Detection:%20a%20Novel%20Approach%20for%20Multi-Speaker%20Diarization%20in%20a%20Dinner%20Party%20Scenario&rft.jtitle=arXiv.org&rft.au=Medennikov,%20Ivan&rft.date=2020-07-27&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2005.07272&rft_dat=%3Cproquest%3E2404184487%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-a527-f6494791095c957b40ec5cd1359a096a98ee5fdd6c34a4f4b6ee9a3ed190baf33%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2404184487&rft_id=info:pmid/&rfr_iscdi=true