Loading…

The development and analysis of a Malay broadcasr news corpus

This paper presents our effort in collecting a Malay broadcast news (BN) speech corpus to support our research in Malay LVCSR. The 53 hours corpus is recorded from the TV channels in both Singapore and Malaysia over a 9-month period. To facilitate various researches in LVCSR, besides of orthographic...

Full description

Saved in:
Bibliographic Details
Main Authors: Tze Yuang Chong, Xiong Xiao, Haihua Xu, Tien-Ping Tan, Pham Chau-Khoa, Dau-Cheng Lyu, Eng Siong Chng, Haizhou Li
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page 5
container_issue
container_start_page 1
container_title
container_volume
creator Tze Yuang Chong
Xiong Xiao
Haihua Xu
Tien-Ping Tan
Pham Chau-Khoa
Dau-Cheng Lyu
Eng Siong Chng
Haizhou Li
description This paper presents our effort in collecting a Malay broadcast news (BN) speech corpus to support our research in Malay LVCSR. The 53 hours corpus is recorded from the TV channels in both Singapore and Malaysia over a 9-month period. To facilitate various researches in LVCSR, besides of orthographic transcription, the corpus provides other metadata such as speaking environment type, speaker identity information, language identity, and topic descriptions. In the orthographic transcription, we also tagged various linguistic phenomena such as disfluencies, code switched words, and proper nouns. We trained an ASR system and achieved a word error rate of 8.5% for anchor speech and 17.1% overall (including reporter and other speakers speech) on 27 hours of test data.
doi_str_mv 10.1109/ICSDA.2013.6709862
format conference_proceeding
fullrecord <record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_6709862</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6709862</ieee_id><sourcerecordid>6709862</sourcerecordid><originalsourceid>FETCH-LOGICAL-i90t-15a7d16eefab404ca24806e4ea74e07c0f992faa140a7a4feb8537b6137bb0483</originalsourceid><addsrcrecordid>eNotj7tOw0AURJcCKSj4B0KzP2Bz95F9FBSReUUKosB9dG3fFUaObe0GkP8eS6SYme7MDGMbAYUQ4O_35cfjrpAgVGEseGfkFcu8dUJb76WyzqxYltIXAAhvHBh5wx6qT-It_VA_TicazhyHdhH2c-oSHwNH_oY9zryOI7YNpsgH-k28GeP0nW7ZdcA-UXbJNauen6ryNT-8v-zL3SHvPJxzsUXbCkMUsNagG5R6KSdNaDWBbSAs8wKi0IAWdaDabZWtjVisBu3Umt39YzsiOk6xO2Gcj5eP6g-GnEdV</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>The development and analysis of a Malay broadcasr news corpus</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Tze Yuang Chong ; Xiong Xiao ; Haihua Xu ; Tien-Ping Tan ; Pham Chau-Khoa ; Dau-Cheng Lyu ; Eng Siong Chng ; Haizhou Li</creator><creatorcontrib>Tze Yuang Chong ; Xiong Xiao ; Haihua Xu ; Tien-Ping Tan ; Pham Chau-Khoa ; Dau-Cheng Lyu ; Eng Siong Chng ; Haizhou Li</creatorcontrib><description>This paper presents our effort in collecting a Malay broadcast news (BN) speech corpus to support our research in Malay LVCSR. The 53 hours corpus is recorded from the TV channels in both Singapore and Malaysia over a 9-month period. To facilitate various researches in LVCSR, besides of orthographic transcription, the corpus provides other metadata such as speaking environment type, speaker identity information, language identity, and topic descriptions. In the orthographic transcription, we also tagged various linguistic phenomena such as disfluencies, code switched words, and proper nouns. We trained an ASR system and achieved a word error rate of 8.5% for anchor speech and 17.1% overall (including reporter and other speakers speech) on 27 hours of test data.</description><identifier>EISBN: 9781479923786</identifier><identifier>EISBN: 1479923788</identifier><identifier>DOI: 10.1109/ICSDA.2013.6709862</identifier><language>eng</language><publisher>IEEE</publisher><subject>Acoustics ; broadcast news ; Interviews ; Malay ; Noise ; Speech ; Speech corpus ; Speech recognition ; Switches</subject><ispartof>2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2013, p.1-5</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6709862$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2051,27904,54899</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6709862$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Tze Yuang Chong</creatorcontrib><creatorcontrib>Xiong Xiao</creatorcontrib><creatorcontrib>Haihua Xu</creatorcontrib><creatorcontrib>Tien-Ping Tan</creatorcontrib><creatorcontrib>Pham Chau-Khoa</creatorcontrib><creatorcontrib>Dau-Cheng Lyu</creatorcontrib><creatorcontrib>Eng Siong Chng</creatorcontrib><creatorcontrib>Haizhou Li</creatorcontrib><title>The development and analysis of a Malay broadcasr news corpus</title><title>2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE)</title><addtitle>ICSDA</addtitle><description>This paper presents our effort in collecting a Malay broadcast news (BN) speech corpus to support our research in Malay LVCSR. The 53 hours corpus is recorded from the TV channels in both Singapore and Malaysia over a 9-month period. To facilitate various researches in LVCSR, besides of orthographic transcription, the corpus provides other metadata such as speaking environment type, speaker identity information, language identity, and topic descriptions. In the orthographic transcription, we also tagged various linguistic phenomena such as disfluencies, code switched words, and proper nouns. We trained an ASR system and achieved a word error rate of 8.5% for anchor speech and 17.1% overall (including reporter and other speakers speech) on 27 hours of test data.</description><subject>Acoustics</subject><subject>broadcast news</subject><subject>Interviews</subject><subject>Malay</subject><subject>Noise</subject><subject>Speech</subject><subject>Speech corpus</subject><subject>Speech recognition</subject><subject>Switches</subject><isbn>9781479923786</isbn><isbn>1479923788</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2013</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotj7tOw0AURJcCKSj4B0KzP2Bz95F9FBSReUUKosB9dG3fFUaObe0GkP8eS6SYme7MDGMbAYUQ4O_35cfjrpAgVGEseGfkFcu8dUJb76WyzqxYltIXAAhvHBh5wx6qT-It_VA_TicazhyHdhH2c-oSHwNH_oY9zryOI7YNpsgH-k28GeP0nW7ZdcA-UXbJNauen6ryNT-8v-zL3SHvPJxzsUXbCkMUsNagG5R6KSdNaDWBbSAs8wKi0IAWdaDabZWtjVisBu3Umt39YzsiOk6xO2Gcj5eP6g-GnEdV</recordid><startdate>201311</startdate><enddate>201311</enddate><creator>Tze Yuang Chong</creator><creator>Xiong Xiao</creator><creator>Haihua Xu</creator><creator>Tien-Ping Tan</creator><creator>Pham Chau-Khoa</creator><creator>Dau-Cheng Lyu</creator><creator>Eng Siong Chng</creator><creator>Haizhou Li</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>201311</creationdate><title>The development and analysis of a Malay broadcasr news corpus</title><author>Tze Yuang Chong ; Xiong Xiao ; Haihua Xu ; Tien-Ping Tan ; Pham Chau-Khoa ; Dau-Cheng Lyu ; Eng Siong Chng ; Haizhou Li</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i90t-15a7d16eefab404ca24806e4ea74e07c0f992faa140a7a4feb8537b6137bb0483</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2013</creationdate><topic>Acoustics</topic><topic>broadcast news</topic><topic>Interviews</topic><topic>Malay</topic><topic>Noise</topic><topic>Speech</topic><topic>Speech corpus</topic><topic>Speech recognition</topic><topic>Switches</topic><toplevel>online_resources</toplevel><creatorcontrib>Tze Yuang Chong</creatorcontrib><creatorcontrib>Xiong Xiao</creatorcontrib><creatorcontrib>Haihua Xu</creatorcontrib><creatorcontrib>Tien-Ping Tan</creatorcontrib><creatorcontrib>Pham Chau-Khoa</creatorcontrib><creatorcontrib>Dau-Cheng Lyu</creatorcontrib><creatorcontrib>Eng Siong Chng</creatorcontrib><creatorcontrib>Haizhou Li</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Tze Yuang Chong</au><au>Xiong Xiao</au><au>Haihua Xu</au><au>Tien-Ping Tan</au><au>Pham Chau-Khoa</au><au>Dau-Cheng Lyu</au><au>Eng Siong Chng</au><au>Haizhou Li</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>The development and analysis of a Malay broadcasr news corpus</atitle><btitle>2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE)</btitle><stitle>ICSDA</stitle><date>2013-11</date><risdate>2013</risdate><spage>1</spage><epage>5</epage><pages>1-5</pages><eisbn>9781479923786</eisbn><eisbn>1479923788</eisbn><abstract>This paper presents our effort in collecting a Malay broadcast news (BN) speech corpus to support our research in Malay LVCSR. The 53 hours corpus is recorded from the TV channels in both Singapore and Malaysia over a 9-month period. To facilitate various researches in LVCSR, besides of orthographic transcription, the corpus provides other metadata such as speaking environment type, speaker identity information, language identity, and topic descriptions. In the orthographic transcription, we also tagged various linguistic phenomena such as disfluencies, code switched words, and proper nouns. We trained an ASR system and achieved a word error rate of 8.5% for anchor speech and 17.1% overall (including reporter and other speakers speech) on 27 hours of test data.</abstract><pub>IEEE</pub><doi>10.1109/ICSDA.2013.6709862</doi><tpages>5</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier EISBN: 9781479923786
ispartof 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2013, p.1-5
issn
language eng
recordid cdi_ieee_primary_6709862
source IEEE Electronic Library (IEL) Conference Proceedings
subjects Acoustics
broadcast news
Interviews
Malay
Noise
Speech
Speech corpus
Speech recognition
Switches
title The development and analysis of a Malay broadcasr news corpus
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-22T16%3A13%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=The%20development%20and%20analysis%20of%20a%20Malay%20broadcasr%20news%20corpus&rft.btitle=2013%20International%20Conference%20Oriental%20COCOSDA%20held%20jointly%20with%202013%20Conference%20on%20Asian%20Spoken%20Language%20Research%20and%20Evaluation%20(O-COCOSDA/CASLRE)&rft.au=Tze%20Yuang%20Chong&rft.date=2013-11&rft.spage=1&rft.epage=5&rft.pages=1-5&rft_id=info:doi/10.1109/ICSDA.2013.6709862&rft.eisbn=9781479923786&rft.eisbn_list=1479923788&rft_dat=%3Cieee_6IE%3E6709862%3C/ieee_6IE%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i90t-15a7d16eefab404ca24806e4ea74e07c0f992faa140a7a4feb8537b6137bb0483%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=6709862&rfr_iscdi=true