Loading…

Urdu Text Reuse Detection at Phrasal level using Sentence Transformer-based approach

Text reuse is a process of creating new text(s) from pre-existing text(s). In recent years, Urdu Text Reuse Detection (UTRD) has gained the attention of researchers because the text is readily available in digital format all over the internet and can be copied or paraphrased from another source with...

Full description

Saved in:
Bibliographic Details
Published in:Expert systems with applications 2023-12, Vol.234, p.121063, Article 121063
Main Authors: Mehak, Gull, Muneer, Iqra, Nawab, Rao Muhammad Adeel
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Text reuse is a process of creating new text(s) from pre-existing text(s). In recent years, Urdu Text Reuse Detection (UTRD) has gained the attention of researchers because the text is readily available in digital format all over the internet and can be copied or paraphrased from another source without proper attribution, which makes it easier to reuse but hard to detect. Text Reuse Detection (TRD) has many potential applications in Plagiarism detection, Paraphrase detection, Paraphrase generation, and Analysis of text reuse in web content. In recent years, Urdu Text Reuse Detection (UTRD) has gained the attention of researchers because the text is readily available in digital format all over the internet and can be copied or paraphrased from another source without proper attribution, which makes it easier to reuse but hard to detect. In previous studies, the problem of UTRD has been explored at the sentence level (Hafeez, 2022; Hafeez et al., 2023), sentence/passage level (Sameen et al., 2017), and document level (Sharjeel et al., 2017), along with benchmark corpora and approaches. However, the problem of UTRD has not been explored at the Phrasal level with respect to corpora and approaches. To fulfill this research gap, this research study has made a major contribution by developing a large benchmark manually annotated corpus of 25,001 text pairs at two levels of a rewrite: (1) Derived = 15,105 and (2) Non-Derived = 9896. In addition, we have developed, applied, evaluated, and compared baseline approaches (N-gram Overlap and Word Embedding-based approaches) with proposed Sentence Transformer-based approaches on the proposed UTRD-Phr-23 Corpus. As another contribution, we proposed a novel Sentence Transformers-based model (using a combination of eight different Sentence Transformers (ST) including paraphrase- multilingual-mpnet-base-v2, distiluse-base-multilingual-cased-v, paraphrase-multilingual-MiniLM-L12-v2, LaBSE, xlm-r-distilroberta-base-paraphrase-v1, xlm-r-100langs-bert-base-nli-mean-tokens, xlm-r-bert- base-nli-stsb-mean-tokens, and xlm-r-100langs-bert-base-nli-stsb-mean-tokens). Our proposed model outperforms with an F1 score of 0.63 compared to the best results obtained using N-gram Overlap (baseline) approach (F1 = 0.53). •Proposed gold standard benchmark corpus for UTRD at the Phrasal level.•Develop classical N-gram, and Word Embedding Approaches as Baseline.•Develop, and Propose Transfer Learning Approaches.
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2023.121063