Loading…

Corpus and unsupervised benchmark: Towards Tagalog grammatical error correction

Grammatical error correction (GEC) is a challenging task for natural language processing techniques. Many efforts to address GEC have been made for high-resource languages such as English or Chinese. However, limited work has been done for low-resource languages because of the lack of large annotate...

Full description

Saved in:

Bibliographic Details
Published in:	Computer speech & language 2025-04, Vol.91, p.101750, Article 101750
Main Authors:	Lin, Nankai, Zhang, Hongbin, Shen, Menglan, Wang, Yu, Jiang, Shengyi, Yang, Aimin
Format:	Article
Language:	English
Subjects:	Grammatical error correction Low-resource languages Sentence perplexity scoring
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Grammatical error correction (GEC) is a challenging task for natural language processing techniques. Many efforts to address GEC have been made for high-resource languages such as English or Chinese. However, limited work has been done for low-resource languages because of the lack of large annotated corpora. In low-resource languages, the current unsupervised GEC based on language model scoring performs well. However, the pre-trained language model is still to be explored in this context. This study proposes a BERT-based unsupervised GEC framework that primarily addresses word-level errors, where GEC is viewed as a multi-class classification task. The framework contains three modules: a data flow construction module, a sentence perplexity scoring module, and an error detecting and correcting module. We propose a novel scoring method for pseudo-perplexity to evaluate a sentence’s probable correctness and construct a Tagalog corpus for Tagalog GEC research. It obtains competitive performance on the self-constructed Tagalog corpus and the open-source Indonesian corpus, and it demonstrates that our framework is complementary to the baseline methods for low-resource GEC tasks. Our corpus can be obtained from https://github.com/GKLMIP/TagalogGEC. •We construct the first Tagalog GEC evaluation corpus.•Our unsupervised GEC framework is independent of any data annotations.•Our proposed pseudo-perplexity scoring method evaluates a sentence’s likely validity.•Experimental results on two corpora verify the effectiveness of the proposed model.
ISSN:	0885-2308
DOI:	10.1016/j.csl.2024.101750