Loading…

ORVAE: One-Class Residual Variational Autoencoder for Voice Activity Detection in Noisy Environment

Detecting human speech is foundational for a wide range of emerging intelligent applications. However, accurately detecting human speech is challenging, especially in the presence of unknown noise patterns. Generally, deep learning-based methods have shown to be more robust and accurate than statist...

Full description

Saved in:

Bibliographic Details
Published in:	Neural processing letters 2022-06, Vol.54 (3), p.1565-1586
Main Authors:	Khalid, Hasam, Tariq, Shahroz, Kim, TaeSoo, Ko, Jong Hwan, Woo, Simon S.
Format:	Article
Language:	English
Subjects:	Artificial Intelligence Audio data Background noise Complex Systems Computational Intelligence Computer Science Data collection Datasets Deep learning Machine learning Neural networks Robustness Signal processing Signal to noise ratio Statistical methods Support vector machines Synthesis Voice activity detectors Voice recognition
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Detecting human speech is foundational for a wide range of emerging intelligent applications. However, accurately detecting human speech is challenging, especially in the presence of unknown noise patterns. Generally, deep learning-based methods have shown to be more robust and accurate than statistical methods and other existing approaches. However, typically creating a noise-robust and more generalized deep learning-based voice activity detection system requires the collection of an enormous amount of annotated audio data. In this work, we develop a generalized model trained on limited types of human speeches with noisy backgrounds. Yet, it can detect human speech in the presence of various unseen noise types, which were not present in the training set. To achieve this, we propose a one-class residual connections-based variational autoencoder (ORVAE), which only requires a limited number of human speech data with noisy background for training, thereby eliminating the need for collecting data with diverse noise patterns. Evaluating ORVAE with three different datasets (synthesized TIMIT and NOISEX-92, synthesized LibriSpeech and NOISEX-92, and a Publicly Recorded dataset), our method outperforms other one-class baseline methods, achieving F 1 -scores of over 90 % for multiple signal-to-noise ratio levels.
ISSN:	1370-4621 1573-773X
DOI:	10.1007/s11063-021-10695-4