Loading…

Unsupervised Character Embedding Correction and Candidate Word Denoising

Inthis paper, we take Indonesian as the research object, and propose a multiple filter correction framework (MFCF). The main idea of MFCF is to remove noise from candidate words to increase the probability of correct words being selected. In MFCF, we use window search algorithm (WSA) to filter the c...

Full description

Saved in:
Bibliographic Details
Published in:IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2022, Vol.30, p.76-86
Main Authors: Zheng, Kengtao, Lin, Nankai, Shengyi, Jiang
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Inthis paper, we take Indonesian as the research object, and propose a multiple filter correction framework (MFCF). The main idea of MFCF is to remove noise from candidate words to increase the probability of correct words being selected. In MFCF, we use window search algorithm (WSA) to filter the candidate words in the dictionary. When searching for candidate words whose Levenshtein distance is 1, WSA reduces the candidate word search space by an average of 71%. When searching for candidate words whose Levenshtein distance is 2, the search space is reduced by an average of 55%. The reduction in search space has brought about an increase in search speed. When WSA searches for candidate words with Levenshtein distance equal to 1 and 2, the speed exceeds the current advanced search algorithm. A character vector-based candidate word scoring model (CWSM-CV) is also introduced in this paper. CWSM-CV is a simple but unsupervised method. In MFCF, we use CWSM-CV to filter the correct word in the candidate word list. Through exploring the feasibility of using word vector-based candidate word scoring model to score candidate words (CWSM-WV), we find the necessity of denoising the candidate word list and verified it with experiments. In order to apply this finding to the text correction, a new set of evaluation indicators are proposed to replace accuracy. Finally, we recommend that researchers who correct text in low-resource languages ​​make the model an open system and publish it for users to use. The system receives user feedback as new data to gradually reduce the negative impact of data volume.
ISSN:2329-9290
2329-9304
DOI:10.1109/TASLP.2021.3129334