Loading…

Messing up with BART: error generation for evaluating data-cleaning algorithms

We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. W...

Full description

Saved in:
Bibliographic Details
Published in:Proceedings of the VLDB Endowment 2015-10, Vol.9 (2), p.36-47
Main Authors: Arocena, Patricia C., Glavic, Boris, Mecca, Giansalvatore, Miller, Renée J., Papotti, Paolo, Santoro, Donatello
Format: Article
Language:English
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that sacrifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.
ISSN:2150-8097
2150-8097
DOI:10.14778/2850578.2850579