Loading…

Messing up with BART: error generation for evaluating data-cleaning algorithms

We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. W...

Full description

Saved in:

Bibliographic Details
Published in:	Proceedings of the VLDB Endowment 2015-10, Vol.9 (2), p.36-47
Main Authors:	Arocena, Patricia C., Glavic, Boris, Mecca, Giansalvatore, Miller, Renée J., Papotti, Paolo, Santoro, Donatello
Format:	Article
Language:	English
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that sacrifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.
ISSN:	2150-8097 2150-8097
DOI:	10.14778/2850578.2850579