Loading…

The development and analysis of a Malay broadcasr news corpus

This paper presents our effort in collecting a Malay broadcast news (BN) speech corpus to support our research in Malay LVCSR. The 53 hours corpus is recorded from the TV channels in both Singapore and Malaysia over a 9-month period. To facilitate various researches in LVCSR, besides of orthographic...

Full description

Saved in:
Bibliographic Details
Main Authors: Tze Yuang Chong, Xiong Xiao, Haihua Xu, Tien-Ping Tan, Pham Chau-Khoa, Dau-Cheng Lyu, Eng Siong Chng, Haizhou Li
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This paper presents our effort in collecting a Malay broadcast news (BN) speech corpus to support our research in Malay LVCSR. The 53 hours corpus is recorded from the TV channels in both Singapore and Malaysia over a 9-month period. To facilitate various researches in LVCSR, besides of orthographic transcription, the corpus provides other metadata such as speaking environment type, speaker identity information, language identity, and topic descriptions. In the orthographic transcription, we also tagged various linguistic phenomena such as disfluencies, code switched words, and proper nouns. We trained an ASR system and achieved a word error rate of 8.5% for anchor speech and 17.1% overall (including reporter and other speakers speech) on 27 hours of test data.
DOI:10.1109/ICSDA.2013.6709862