Loading…
Urdu Text Reuse Detection at Phrasal level using Sentence Transformer-based approach
Text reuse is a process of creating new text(s) from pre-existing text(s). In recent years, Urdu Text Reuse Detection (UTRD) has gained the attention of researchers because the text is readily available in digital format all over the internet and can be copied or paraphrased from another source with...
Saved in:
Published in: | Expert systems with applications 2023-12, Vol.234, p.121063, Article 121063 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Text reuse is a process of creating new text(s) from pre-existing text(s). In recent years, Urdu Text Reuse Detection (UTRD) has gained the attention of researchers because the text is readily available in digital format all over the internet and can be copied or paraphrased from another source without proper attribution, which makes it easier to reuse but hard to detect. Text Reuse Detection (TRD) has many potential applications in Plagiarism detection, Paraphrase detection, Paraphrase generation, and Analysis of text reuse in web content. In recent years, Urdu Text Reuse Detection (UTRD) has gained the attention of researchers because the text is readily available in digital format all over the internet and can be copied or paraphrased from another source without proper attribution, which makes it easier to reuse but hard to detect. In previous studies, the problem of UTRD has been explored at the sentence level (Hafeez, 2022; Hafeez et al., 2023), sentence/passage level (Sameen et al., 2017), and document level (Sharjeel et al., 2017), along with benchmark corpora and approaches. However, the problem of UTRD has not been explored at the Phrasal level with respect to corpora and approaches. To fulfill this research gap, this research study has made a major contribution by developing a large benchmark manually annotated corpus of 25,001 text pairs at two levels of a rewrite: (1) Derived = 15,105 and (2) Non-Derived = 9896. In addition, we have developed, applied, evaluated, and compared baseline approaches (N-gram Overlap and Word Embedding-based approaches) with proposed Sentence Transformer-based approaches on the proposed UTRD-Phr-23 Corpus. As another contribution, we proposed a novel Sentence Transformers-based model (using a combination of eight different Sentence Transformers (ST) including paraphrase- multilingual-mpnet-base-v2, distiluse-base-multilingual-cased-v, paraphrase-multilingual-MiniLM-L12-v2, LaBSE, xlm-r-distilroberta-base-paraphrase-v1, xlm-r-100langs-bert-base-nli-mean-tokens, xlm-r-bert- base-nli-stsb-mean-tokens, and xlm-r-100langs-bert-base-nli-stsb-mean-tokens). Our proposed model outperforms with an F1 score of 0.63 compared to the best results obtained using N-gram Overlap (baseline) approach (F1 = 0.53).
•Proposed gold standard benchmark corpus for UTRD at the Phrasal level.•Develop classical N-gram, and Word Embedding Approaches as Baseline.•Develop, and Propose Transfer Learning Approaches. |
---|---|
ISSN: | 0957-4174 1873-6793 |
DOI: | 10.1016/j.eswa.2023.121063 |