Loading…
Crawling Chinese-Myanmar Parallel Corpus: Automatic Collection, Screening and Cleaning Corpus
The collection of Chinese-Myanmar Parallel Corpus (CMPC) is the key step in the natural language processing (NLP) and training Machine Translation Engine (MTE) of Southeast Asia minority languages. As the scarcity of CMPC resources that efficient corpus collection methods are worth studying extremel...
Saved in:
Published in: | IOP conference series. Materials Science and Engineering 2019-10, Vol.646 (1), p.12046 |
---|---|
Main Authors: | , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The collection of Chinese-Myanmar Parallel Corpus (CMPC) is the key step in the natural language processing (NLP) and training Machine Translation Engine (MTE) of Southeast Asia minority languages. As the scarcity of CMPC resources that efficient corpus collection methods are worth studying extremely. Traditional corpus collection methods include manual collection, text recognition of books and Internet crawlers, etc. Among them, the most efficient method to collect corpus is internet crawler preached by many. Traditional Internet crawler algorithm is interfere easily by a lot of spamming and advertising that lead to the time-consuming and low-precision. We propose a web crawler mechanism combines acquisition automatically technology bilingual website list, crawling corpus and cleaning corpus to obtain high quality parallel corpus. Firstly, using the hyperlinks to recursively access related corpus websites through building the website graph. Furthermore, the breadth-first, Backline and PageRank crawler framework used to build a corpus selection model based on crawling with threshold, matching link, ranking the heat of page, through this, the CMPC can be found accurately. Finally, the corpus cleaning model based on the HTML parsing to determine a set of standardized token sequences. By testing the Chinese-Myanmar reptile algorithm established in this paper, the experimental results show that our benchmarks this model exceeds previous published benchmarks. Up to now, we have obtained 1.1 million parallel corpus pairs of Chinese-Myanmar. |
---|---|
ISSN: | 1757-8981 1757-899X |
DOI: | 10.1088/1757-899X/646/1/012046 |