Loading…
Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario
Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (T...
Saved in:
Published in: | arXiv.org 2020-07 |
---|---|
Main Authors: | , , , , , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | |
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Medennikov, Ivan Korenevsky, Maxim Prisyach, Tatiana Khokhlov, Yuri Korenevskaya, Mariya Sorokin, Ivan Timofeeva, Tatiana Mitrofanov, Anton Andrusenko, Andrei Podluzhny, Ivan Laptev, Aleksandr Romanenko, Aleksei |
description | Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs. |
doi_str_mv | 10.48550/arxiv.2005.07272 |
format | article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2404184487</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2404184487</sourcerecordid><originalsourceid>FETCH-LOGICAL-a527-f6494791095c957b40ec5cd1359a096a98ee5fdd6c34a4f4b6ee9a3ed190baf33</originalsourceid><addsrcrecordid>eNo9jltLw0AQhRdBsNT-AN8WfE6c7CWb9a203qBeoMXXMtlMdGvJ1k1a1F_viuLTDOfM-c4wdlZAriqt4QLjhz_kAkDnYIQRR2wkpCyySglxwiZ9vwEAURqhtRyxuML4QkO23BG-UeTPwTviUzf4gx8--ZwGSnvoLjnyh3CgLZ_udjGge-VtiPx-vx38f3juMfov_LnnvkuJue-6pD9hTKyloy754ZQdt7jtafI3x2x1fbWa3WaLx5u72XSRoRYma0tllbEFWO2sNrUCcto1hdQWwZZoKyLdNk3ppELVqroksiipKSzU2Eo5Zue_2PTu-576Yb0J-9ilxrVQoIpKqcrIb7ptXLE</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2404184487</pqid></control><display><type>article</type><title>Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario</title><source>Publicly Available Content (ProQuest)</source><creator>Medennikov, Ivan ; Korenevsky, Maxim ; Prisyach, Tatiana ; Khokhlov, Yuri ; Korenevskaya, Mariya ; Sorokin, Ivan ; Timofeeva, Tatiana ; Mitrofanov, Anton ; Andrusenko, Andrei ; Podluzhny, Ivan ; Laptev, Aleksandr ; Romanenko, Aleksei</creator><creatorcontrib>Medennikov, Ivan ; Korenevsky, Maxim ; Prisyach, Tatiana ; Khokhlov, Yuri ; Korenevskaya, Mariya ; Sorokin, Ivan ; Timofeeva, Tatiana ; Mitrofanov, Anton ; Andrusenko, Andrei ; Podluzhny, Ivan ; Laptev, Aleksandr ; Romanenko, Aleksei</creatorcontrib><description>Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2005.07272</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Clustering ; Post-processing ; Target detection ; Voice activity detectors ; Voice recognition</subject><ispartof>arXiv.org, 2020-07</ispartof><rights>2020. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2404184487?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,27925,37012,44590</link.rule.ids></links><search><creatorcontrib>Medennikov, Ivan</creatorcontrib><creatorcontrib>Korenevsky, Maxim</creatorcontrib><creatorcontrib>Prisyach, Tatiana</creatorcontrib><creatorcontrib>Khokhlov, Yuri</creatorcontrib><creatorcontrib>Korenevskaya, Mariya</creatorcontrib><creatorcontrib>Sorokin, Ivan</creatorcontrib><creatorcontrib>Timofeeva, Tatiana</creatorcontrib><creatorcontrib>Mitrofanov, Anton</creatorcontrib><creatorcontrib>Andrusenko, Andrei</creatorcontrib><creatorcontrib>Podluzhny, Ivan</creatorcontrib><creatorcontrib>Laptev, Aleksandr</creatorcontrib><creatorcontrib>Romanenko, Aleksei</creatorcontrib><title>Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario</title><title>arXiv.org</title><description>Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.</description><subject>Clustering</subject><subject>Post-processing</subject><subject>Target detection</subject><subject>Voice activity detectors</subject><subject>Voice recognition</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNo9jltLw0AQhRdBsNT-AN8WfE6c7CWb9a203qBeoMXXMtlMdGvJ1k1a1F_viuLTDOfM-c4wdlZAriqt4QLjhz_kAkDnYIQRR2wkpCyySglxwiZ9vwEAURqhtRyxuML4QkO23BG-UeTPwTviUzf4gx8--ZwGSnvoLjnyh3CgLZ_udjGge-VtiPx-vx38f3juMfov_LnnvkuJue-6pD9hTKyloy754ZQdt7jtafI3x2x1fbWa3WaLx5u72XSRoRYma0tllbEFWO2sNrUCcto1hdQWwZZoKyLdNk3ppELVqroksiipKSzU2Eo5Zue_2PTu-576Yb0J-9ilxrVQoIpKqcrIb7ptXLE</recordid><startdate>20200727</startdate><enddate>20200727</enddate><creator>Medennikov, Ivan</creator><creator>Korenevsky, Maxim</creator><creator>Prisyach, Tatiana</creator><creator>Khokhlov, Yuri</creator><creator>Korenevskaya, Mariya</creator><creator>Sorokin, Ivan</creator><creator>Timofeeva, Tatiana</creator><creator>Mitrofanov, Anton</creator><creator>Andrusenko, Andrei</creator><creator>Podluzhny, Ivan</creator><creator>Laptev, Aleksandr</creator><creator>Romanenko, Aleksei</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20200727</creationdate><title>Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario</title><author>Medennikov, Ivan ; Korenevsky, Maxim ; Prisyach, Tatiana ; Khokhlov, Yuri ; Korenevskaya, Mariya ; Sorokin, Ivan ; Timofeeva, Tatiana ; Mitrofanov, Anton ; Andrusenko, Andrei ; Podluzhny, Ivan ; Laptev, Aleksandr ; Romanenko, Aleksei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a527-f6494791095c957b40ec5cd1359a096a98ee5fdd6c34a4f4b6ee9a3ed190baf33</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Clustering</topic><topic>Post-processing</topic><topic>Target detection</topic><topic>Voice activity detectors</topic><topic>Voice recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Medennikov, Ivan</creatorcontrib><creatorcontrib>Korenevsky, Maxim</creatorcontrib><creatorcontrib>Prisyach, Tatiana</creatorcontrib><creatorcontrib>Khokhlov, Yuri</creatorcontrib><creatorcontrib>Korenevskaya, Mariya</creatorcontrib><creatorcontrib>Sorokin, Ivan</creatorcontrib><creatorcontrib>Timofeeva, Tatiana</creatorcontrib><creatorcontrib>Mitrofanov, Anton</creatorcontrib><creatorcontrib>Andrusenko, Andrei</creatorcontrib><creatorcontrib>Podluzhny, Ivan</creatorcontrib><creatorcontrib>Laptev, Aleksandr</creatorcontrib><creatorcontrib>Romanenko, Aleksei</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Medennikov, Ivan</au><au>Korenevsky, Maxim</au><au>Prisyach, Tatiana</au><au>Khokhlov, Yuri</au><au>Korenevskaya, Mariya</au><au>Sorokin, Ivan</au><au>Timofeeva, Tatiana</au><au>Mitrofanov, Anton</au><au>Andrusenko, Andrei</au><au>Podluzhny, Ivan</au><au>Laptev, Aleksandr</au><au>Romanenko, Aleksei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario</atitle><jtitle>arXiv.org</jtitle><date>2020-07-27</date><risdate>2020</risdate><eissn>2331-8422</eissn><abstract>Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2005.07272</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2020-07 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2404184487 |
source | Publicly Available Content (ProQuest) |
subjects | Clustering Post-processing Target detection Voice activity detectors Voice recognition |
title | Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T00%3A30%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Target-Speaker%20Voice%20Activity%20Detection:%20a%20Novel%20Approach%20for%20Multi-Speaker%20Diarization%20in%20a%20Dinner%20Party%20Scenario&rft.jtitle=arXiv.org&rft.au=Medennikov,%20Ivan&rft.date=2020-07-27&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2005.07272&rft_dat=%3Cproquest%3E2404184487%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-a527-f6494791095c957b40ec5cd1359a096a98ee5fdd6c34a4f4b6ee9a3ed190baf33%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2404184487&rft_id=info:pmid/&rfr_iscdi=true |