Loading…

Discovery of approximate (and exact) denial constraints

Maintaining data consistency is known to be hard. Recent approaches have relied on integrity constraints to deal with the problem - correct and complete constraints naturally work towards data consistency. State-of-the-art data cleaning frameworks have used the formalism known as denial constraint (...

Full description

Saved in:
Bibliographic Details
Published in:Proceedings of the VLDB Endowment 2019-11, Vol.13 (3), p.266-278
Main Authors: Pena, Eduardo H. M., de Almeida, Eduardo C., Naumann, Felix
Format: Article
Language:English
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Maintaining data consistency is known to be hard. Recent approaches have relied on integrity constraints to deal with the problem - correct and complete constraints naturally work towards data consistency. State-of-the-art data cleaning frameworks have used the formalism known as denial constraint (DC) to handle a wide range of real-world constraints. Each DC expresses a relationship between predicates that indicate which combinations of attribute values are inconsistent. The design of DCs, however, must keep pace with the complexity of data and applications. The alternative to designing DCs by hand is automatically discovering DCs from data, which is computationally expensive due to the large search space of DCs. To tackle this challenging task, we present a novel algorithm to efficiently discover DCs: DCFinder. The algorithm combines data structures called position list indexes with techniques based on predicate selectivity to efficiently validate DC candidates. Because the available data often contain errors, DCFinder is especially designed to discovering approximate DCs, i.e., DCs that may partially hold. Our experimental evaluation uses real and synthetic datasets and shows that DCFinder outperforms all the existing approximate DC discovery algorithms.
ISSN:2150-8097
2150-8097
DOI:10.14778/3368289.3368293