Loading…

LOTUS-SOC: A social media speech corpus for Thai LVCSR in noisy environments

A Large vOcabulary Thai continUous Speech - SOCial media corpus (LOTUS-SOC) has been developed since 2015. Twitter messages were selected as a source for sound recording through a mobile application. At present, 172 hours of speech from 208 speakers were recorded, while more 192 speakers to achieve...

Full description

Saved in:
Bibliographic Details
Main Authors: Chootrakool, Patcharika, Chunwijitra, Vataya, Sertsi, Phuttapong, Kasuriya, Sawit, Wutiwiwatchai, Chai
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:A Large vOcabulary Thai continUous Speech - SOCial media corpus (LOTUS-SOC) has been developed since 2015. Twitter messages were selected as a source for sound recording through a mobile application. At present, 172 hours of speech from 208 speakers were recorded, while more 192 speakers to achieve the total 400 speakers are under recording. We design the data to balance gender and 8 types of noise conditions. This paper describes the detail of the corpus design and development process. The corpus aims for building a Thai large vocabulary continuous speech recognizer (LVCSR) which could better deal with spoken-style input speech under various noisy environments. To assess the corpus, different kinds of Thai LVCSR systems have been built. Evaluations show that systems additionally trained by LOTUS-SOC are more robust to noisy environments. With the best setting and training method, the GMM-based and DNN-based systems achieve 35.2% and 17.1% word error rates respectively.
ISSN:2472-7695
DOI:10.1109/ICSDA.2016.7919017