Loading…

Cross-Lingual Text Reuse Detection at sentence level for English–Urdu language pair

In recent years, the problem of Cross-Lingual Text Reuse Detection (X-TRD) has gained the interest of researchers due to the availability of large digital repositories and automatic translation systems. These systems are promptly available and openly accessible, which makes it easier to reuse text a...

Full description

Saved in:
Bibliographic Details
Published in:Computer speech & language 2022-09, Vol.75, p.101381, Article 101381
Main Authors: Muneer, Iqra, Nawab, Rao Muhammad Adeel
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In recent years, the problem of Cross-Lingual Text Reuse Detection (X-TRD) has gained the interest of researchers due to the availability of large digital repositories and automatic translation systems. These systems are promptly available and openly accessible, which makes it easier to reuse text across the languages and hard to detect. In previous studies, different corpora and techniques have been developed for X-TRD at sentence/passage and document level for the English–Urdu language pair. However, there is a lack of large benchmark corpora and standard techniques for X-TRD for the English–Urdu language pair at the sentence level. To overcome this limitation, this study presents a large benchmark sentential cross-lingual (English–Urdu) corpus of 21,669 sentence pairs with simulated cases of X-TR, which are manually annotated at three levels of rewrite (Wholly Derived (WD) = 7,655, Partially Derived (PD) = 6,461, and Non Derived (ND) = 7,553). As a second major contribution, we have applied various state-of-the-art Cross-Lingual Sentence Transformers (CLST), and Translation plus Mono-lingual Analysis (T+MA) including N-gram Overlap (lexical), WordNet based techniques (semantic), mono-lingual word embedding-based techniques, and Kullback–Leibler Distance (KLD) (probabilistic) on our proposed sentential corpus for X-TRD. For the binary classification, the best results are obtained (F1 = 0.94) using a combination of all CLST and T+MA techniques and a combination of all T+MA techniques, whereas, for the ternary classification task, the best results are obtained (F1 = 0.84) using a combination of all CLST and T+MA techniques. The corpus will be publicly available to foster and promote research for X-TRD in an under-resourced language, such as the Urdu language. •Proposed a large benchmark corpus of 21,669 sentence pairs (English–Urdu language pair) for Cross-Lingual Text Reuse Detection.•Developed and applied various Translation + Mono-lingual Analysis based classical machine learning techniques.•Developed and applied various state-of-the-art Cross-Lingual Sentence Transformers based techniques.•The best results are obtained using combination of all CLST and T+MA techniques for both binary and ternary classification tasks.
ISSN:0885-2308
1095-8363
DOI:10.1016/j.csl.2022.101381