Loading…

Orthographic and morphological processing for English-Arabic statistical machine translation

Much of the work on statistical machine translation (SMT) from morphologically rich languages has shown that morphological tokenization and orthographic normalization help improve SMT quality because of the sparsity reduction they contribute. In this article, we study the effect of these processes o...

Full description

Saved in:
Bibliographic Details
Published in:Machine translation 2012-03, Vol.26 (1/2), p.25-45
Main Authors: Kholy, Ahmed El, Habash, Nizar
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Much of the work on statistical machine translation (SMT) from morphologically rich languages has shown that morphological tokenization and orthographic normalization help improve SMT quality because of the sparsity reduction they contribute. In this article, we study the effect of these processes on SMT when translating into a morphologically rich language, namely Arabic. We explore a space of tokenization schemes and normalization options. We also examine a set of six detokenization techniques and evaluate on detokenized and orthographically correct (enriched) output. Our results show that the best performing tokenization scheme is that of the Penn Arabic Treebank. Additionally, training on orthographically normalized (reduced) text then jointly enriching and detokenizing the output outperforms training on enriched text.
ISSN:0922-6567
1573-0573
DOI:10.1007/s10590-011-9110-0