Loading…

HySet: A hybrid framework for exact set similarity join using a GPU

Set similarity join is a fundamental operation used in a wide range of applications such as data mining, data cleaning and entity resolution. Existing methods proposed for set similarity join conform to a filter-verification framework where potential candidate pairs are generated in the filtering ph...

Full description

Saved in:
Bibliographic Details
Published in:Parallel computing 2021-07, Vol.104-105, p.102790, Article 102790
Main Authors: Bellas, Christos, Gounaris, Anastasios
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Set similarity join is a fundamental operation used in a wide range of applications such as data mining, data cleaning and entity resolution. Existing methods proposed for set similarity join conform to a filter-verification framework where potential candidate pairs are generated in the filtering phase and then undergo a verification phase to output the final result. Several different kinds of filtering techniques have been proposed and techniques also differentiate in the manner they couple filtering with verification. However, it has been shown that no globally dominant technique exists. Depending on the dataset and query characteristics, each technique has its own strong and weak points. Based on these findings, the main contribution of this work is the development of a hybrid framework for the set similarity join operation for a single GPU-equipped machine setting. Our framework encapsulates a partitioning mechanism to utilize appropriately both the CPU and the GPU. We present all technical details and we show performance speedups up to 3.25x after thorough evaluation. •Decrease runtime by achieving execution overlap between CPU and GPU.•Heterogeneity in existing filters hinders the possibility for even larger speedups.•Hybrid techniques can be scaled on a multi-GPU environment.
ISSN:0167-8191
1872-7336
DOI:10.1016/j.parco.2021.102790