Loading…

Web-based Arabic/English duplicate record detection with nested blocking technique

Data accuracy and quality affects the success of any business intelligence and data mining solutions. The first step to ensure the data accuracy is to make sure that each real world object is represented once and only once in a certain dataset, this operation becomes more complicated when entities a...

Full description

Saved in:
Bibliographic Details
Main Authors: Higazy, Azza, Tobely, Tarek El, Yousef, Ahmed H., Sarhan, Amany
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Data accuracy and quality affects the success of any business intelligence and data mining solutions. The first step to ensure the data accuracy is to make sure that each real world object is represented once and only once in a certain dataset, this operation becomes more complicated when entities are identified by a string value like the case of person names. These data inaccuracy problems exist due to misspelling and wide range of typographical variations especially with non-Latin languages like Arabic. Up to authors' knowledge, the previously proposed duplicate record detection (DRD) algorithms and frameworks do not support Arabic language and have some configuration difficulties. In this paper an English/Arabic enabled web-based framework is designed and implemented, considering the wide range variations in Arabic language. Improved indexing/blocking techniques used to allow fast processing. The framework is implemented and verified by several case studies. Results showed that the framework has substantial improvements compared to known techniques.
DOI:10.1109/ICCES.2013.6707225