Loading…

Distributed implementations of dependency discovery algorithms

We analyze the problem of discovering dependencies from distributed big data. Existing (non-distributed) algorithms focus on minimizing computation by pruning the search space of possible dependencies. However, distributed algorithms must also optimize communication costs, especially in shared-nothi...

Full description

Saved in:
Bibliographic Details
Published in:Proceedings of the VLDB Endowment 2019-07, Vol.12 (11), p.1624-1636
Main Authors: Saxena, Hemant, Golab, Lukasz, Ilyas, Ihab F.
Format: Article
Language:English
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:We analyze the problem of discovering dependencies from distributed big data. Existing (non-distributed) algorithms focus on minimizing computation by pruning the search space of possible dependencies. However, distributed algorithms must also optimize communication costs, especially in shared-nothing settings, leading to a more complex optimization space. To understand this space, we introduce six primitives shared by existing dependency discovery algorithms, corresponding to data processing steps separated by communication barriers. Through case studies, we show how the primitives allow us to analyze the design space and develop communication-optimized implementations. Finally, we support our analysis with an experimental evaluation on real datasets.
ISSN:2150-8097
2150-8097
DOI:10.14778/3342263.3342638