Loading…
Contextual word disambiguates of Ge'ez language with homophonic using machine learning
According to natural language processing experts, there are numerous ambiguous words in languages. Without automated word meaning disambiguation for any language, the development of natural language processing technologies such as information extraction, information retrieval, machine translation, a...
Saved in:
Published in: | Ampersand (Oxford, UK) UK), 2024-06, Vol.12, p.100169, Article 100169 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | According to natural language processing experts, there are numerous ambiguous words in languages. Without automated word meaning disambiguation for any language, the development of natural language processing technologies such as information extraction, information retrieval, machine translation, and others are still challenging task. Therfore, this paper presents the development of a word sense disambiguation model for duplicate alphabet words for the Ge'ez language using corpus-based methods. Because there is no wordNet or public dataset for the Ge'ez language, 1010 samples of ambiguous words were gathered. Afterwards, the words were preprocessed and the text was vectorized using bag of words, Term Frequency-Inverse Document Frequency, and word embeddings such as word2vec and fastText. The vectorized texts are then analysed using the supervised machine learning algorithms such Naive Bayes, decision trees, random forests, K-nearest neighbor, linear support vector machine, and logistic regression. Bag of words paired with random forests outperformed all other combinations, with an accuracy of 99.52%. However, when Deep learning algorithms such as Deep neural network and Long Short-Term memory were used for the same dataset, a 100% accuracy was achieved.
•This paper presents a word sense disambiguation model for duplicate alphabet words for the Ge'ez Language using corpus-based methods.•1010 samples of ambiguous words were acquired because there is no wordNet or other publicly available dataset for the Ge'ez language.•When using NB for training, Multinomial NB is employed for bag of words or TFIDF (score 93.91% for BoW and 90.38% for TFIDF).•Gaussian NB (simply NB) approach was discovered to be the worst when word embedding techniques are used for text vectorizing.•Utilizing the bag of word feature extraction approach and the random forest algorithm, we achieved an accuracy score of 99.52%. |
---|---|
ISSN: | 2215-0390 2215-0390 |
DOI: | 10.1016/j.amper.2024.100169 |