Loading…

Crawling Chinese-Myanmar Parallel Corpus: Automatic Collection, Screening and Cleaning Corpus

The collection of Chinese-Myanmar Parallel Corpus (CMPC) is the key step in the natural language processing (NLP) and training Machine Translation Engine (MTE) of Southeast Asia minority languages. As the scarcity of CMPC resources that efficient corpus collection methods are worth studying extremel...

Full description

Saved in:
Bibliographic Details
Published in:IOP conference series. Materials Science and Engineering 2019-10, Vol.646 (1), p.12046
Main Authors: Xiong, Kai, Yuan, Rui, He, Wenxue, Jing, Yanmei, Wang, Yansheng, He, Qiqi, Li, Huafu
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The collection of Chinese-Myanmar Parallel Corpus (CMPC) is the key step in the natural language processing (NLP) and training Machine Translation Engine (MTE) of Southeast Asia minority languages. As the scarcity of CMPC resources that efficient corpus collection methods are worth studying extremely. Traditional corpus collection methods include manual collection, text recognition of books and Internet crawlers, etc. Among them, the most efficient method to collect corpus is internet crawler preached by many. Traditional Internet crawler algorithm is interfere easily by a lot of spamming and advertising that lead to the time-consuming and low-precision. We propose a web crawler mechanism combines acquisition automatically technology bilingual website list, crawling corpus and cleaning corpus to obtain high quality parallel corpus. Firstly, using the hyperlinks to recursively access related corpus websites through building the website graph. Furthermore, the breadth-first, Backline and PageRank crawler framework used to build a corpus selection model based on crawling with threshold, matching link, ranking the heat of page, through this, the CMPC can be found accurately. Finally, the corpus cleaning model based on the HTML parsing to determine a set of standardized token sequences. By testing the Chinese-Myanmar reptile algorithm established in this paper, the experimental results show that our benchmarks this model exceeds previous published benchmarks. Up to now, we have obtained 1.1 million parallel corpus pairs of Chinese-Myanmar.
ISSN:1757-8981
1757-899X
DOI:10.1088/1757-899X/646/1/012046