Loading…
Low-resource neural character-based noisy text normalization
User generated data in social networks is often not written in its standard form. This kind of text can lead to large dispersion in the datasets and can lead to inconsistent data. Therefore, normalization of such kind of texts is a crucial preprocessing step for common Natural Language Processing to...
Saved in:
Published in: | Journal of intelligent & fuzzy systems 2019-01, Vol.36 (5), p.4921-4929 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | User generated data in social networks is often not written in its standard form. This kind of text can lead to large dispersion in the datasets and can lead to inconsistent data. Therefore, normalization of such kind of texts is a crucial preprocessing step for common Natural Language Processing tools. In this paper we explore the state-of-the-art of the machine translation approach to normalize text under low-resource conditions. We also propose an auxiliary task for the sequence-to-sequence (seq2seq) neural architecture novel to the text normalization task, that improves the base seq2seq model up to 5%. This increase of performance closes the gap between statistical machine translation approaches and neural ones for low-resource text normalization. |
---|---|
ISSN: | 1064-1246 1875-8967 |
DOI: | 10.3233/JIFS-179039 |