Loading…

Arabic Idioms Detection by Utilizing Deep Learning and Transformer-based Models

Arabic language resources and natural language processing technologies have seen significant advancements in recent years. The detection of idiomatic expressions is a crucial problem in Arabic natural language processing. Unfortunately, there have been few advancements in this field due to the scarc...

Full description

Saved in:

Bibliographic Details
Published in:	Procedia computer science 2024, Vol.244, p.37-48
Main Author:	Himdi, Hanen
Format:	Article
Language:	English
Subjects:	Deep Learning (DL) Natural Language Processing (NLP) Text Mining Transformer-based Models
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Arabic language resources and natural language processing technologies have seen significant advancements in recent years. The detection of idiomatic expressions is a crucial problem in Arabic natural language processing. Unfortunately, there have been few advancements in this field due to the scarcity of datasets and computational models that still struggle to handle Arabic idiomatic phrases. Although idioms have significance in Arabic because of their distinctive relation to history and cultural knowledge, they also offer a language tool for conveying concepts and thoughts that may diverge from their literal interpretation. Hence, to accurately detect their incongruent language, a model requires the ability to interpret the idioms’ contextual meaning. To tackle this issue, this study entails comprehensive work on detecting Arabic idiomatic and literal statements. For that, we introduce a large Arabic idiomatic and literal statements. The dataset includes idioms from several Arabic idiomatic platforms balanced with literal statements from titles in news platforms with similar keywords found in the idioms. This study adopts four deep learning models, namely, CNN, LSTM, Bi-LSTM, and GRU trained with word embeddings, FastText and Skip-Gram Word2Vec, and three transformer-based models, BERT, RoBERTa, and DistilBERT were compiled and assessed via empirical evaluations. Our experiments demonstrated that transformer-based models outperformed deep learning models trained by word embedding to detect Arabic idioms and literal statements, reaching 97% in favor of DistilBERT. We also enhanced the accuracy through the ensemble stacking method, boosting it by 0.8 points. Furthermore, we propose a novel Arabic idiom interpreter that offers simple explanations for selected idiomatic contents within the literal text. Our work enriches the challenge of detecting incongruities found in similar figurative genres, specifically in the rich morphological language, Arabic.
ISSN:	1877-0509 1877-0509
DOI:	10.1016/j.procs.2024.10.176