Loading…

Low-resource neural character-based noisy text normalization

User generated data in social networks is often not written in its standard form. This kind of text can lead to large dispersion in the datasets and can lead to inconsistent data. Therefore, normalization of such kind of texts is a crucial preprocessing step for common Natural Language Processing to...

Full description

Saved in:
Bibliographic Details
Published in:Journal of intelligent & fuzzy systems 2019-01, Vol.36 (5), p.4921-4929
Main Authors: Mager, Manuel, Rosales, Mónica Jasso, Çetinoğlu, Özlem, Meza, Ivan
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:User generated data in social networks is often not written in its standard form. This kind of text can lead to large dispersion in the datasets and can lead to inconsistent data. Therefore, normalization of such kind of texts is a crucial preprocessing step for common Natural Language Processing tools. In this paper we explore the state-of-the-art of the machine translation approach to normalize text under low-resource conditions. We also propose an auxiliary task for the sequence-to-sequence (seq2seq) neural architecture novel to the text normalization task, that improves the base seq2seq model up to 5%. This increase of performance closes the gap between statistical machine translation approaches and neural ones for low-resource text normalization.
ISSN:1064-1246
1875-8967
DOI:10.3233/JIFS-179039