Loading…

Orthographic Errors in Web Pages: Toward Cleaner Web Corpora

Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. For applications where high accuracy is crucial, the problem has to be faced that a non-negligible number o...

Full description

Saved in:
Bibliographic Details
Published in:Computational linguistics - Association for Computational Linguistics 2006-09, Vol.32 (3), p.295-340
Main Authors: Ringlstetter, Christoph, Schulz, Klaus U., Mihov, Stoyan
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. For applications where high accuracy is crucial, the problem has to be faced that a non-negligible number of orthographic and grammatical errors occur in Web documents. In this article we investigate the distribution of orthographic errors of various types in Web pages. As a by-product, methods are developed for efficiently detecting erroneous pages and for marking orthographic errors in acceptable Web documents, reducing thus the number of errors in corpora and linguistic knowledge bases automatically retrieved from the Web.
ISSN:0891-2017
1530-9312
DOI:10.1162/coli.2006.32.3.295