Loading…
Mono-lingual text reuse detection for the Urdu language at lexical level
Text reuse is the process of creating new texts from pre-existing ones. In recent years, Urdu Text Reuse Detection (U-TRD) has garnered the attention of researchers due to the ready availability of digital text all over the internet, which can be copied or paraphrased from other sources without prop...
Saved in:
Published in: | Engineering applications of artificial intelligence 2024-10, Vol.136, p.109003, Article 109003 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Text reuse is the process of creating new texts from pre-existing ones. In recent years, Urdu Text Reuse Detection (U-TRD) has garnered the attention of researchers due to the ready availability of digital text all over the internet, which can be copied or paraphrased from other sources without proper attribution, making it easier to reuse but challenging to detect. Previous studies have explored the issue of U-TRD at the phrasal, sentence/passage, and document levels, using benchmark corpora and methods. However, the problem of U-TRD has not been investigated at the lexical level in terms of corpora and methods. To address this research gap, our study has developed a large benchmark corpus manually annotated at the lexical level. This corpus consists of 22,184 text pairs categorized into two levels of rewrite: (1) Derived (8,660) and (2) Non-Derived (13,524). Additionally, our research has involved the development, application, evaluation, and comparison of a range of methods, including baseline methods (uni-gram overlap and word embedding-based methods), along with state-of-the-art transformer-based methods and feature-fusion-based methods, using the proposed UTRD-Lex-23 corpus. Our study concludes that one of our proposed feature-fusion methods outperforms all other methods. The model we propose, which combines seven different Sentence Transformers (ST) (each producing 768 dimension vectors) with one uni-gram (at word level) and sixteen different features extracted from four different Word Embedding (WE) based models (yielding 300 dimension vectors), achieves an F1 score of 0.70601 using 10-fold cross validation. To foster and promote research in Urdu (a low-resourced language) proposed corpus will be freely and publicly available for research purposes. |
---|---|
ISSN: | 0952-1976 |
DOI: | 10.1016/j.engappai.2024.109003 |