Loading…

Parallel similarity joins on massive high-dimensional data using MapReduce

SUMMARY In this paper, we focus on high‐dimensional similarity join (HDSJ) using MapReduce paradigm. As the volume of the data and the number of the dimensions increase, the computation cost of HDSJ will increase exponentially. There is no existing effective approach that can process HDSJ efficientl...

Full description

Saved in:
Bibliographic Details
Published in:Concurrency and computation 2016-01, Vol.28 (1), p.166-183
Main Authors: Ma, Youzhong, Meng, Xiaofeng, Wang, Shaoya
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:SUMMARY In this paper, we focus on high‐dimensional similarity join (HDSJ) using MapReduce paradigm. As the volume of the data and the number of the dimensions increase, the computation cost of HDSJ will increase exponentially. There is no existing effective approach that can process HDSJ efficiently, so we propose a novel method called symbolic aggregate approximation (SAX)‐based HDSJ to deal with the problem. SAX is the abbreviation of symbolic aggregate approximation that is a dimensionality reduction technique and widely used in time series processing, we use SAX to represent the high‐dimensional vectors in this paper and reorganize these vectors into groups based on their SAX representations. For the very high‐dimensional vectors, we also propose an improved SAX‐based HDSJ approach. Finally, we implement SAX‐based HDSJ and improved SAX‐based HDSJ on Hadoop‐0.20.2 and perform comprehensive experiments to test the performance, we also compare SAX‐based HDSJ and improved SAX‐based HDSJ with the existing method. The experiment results show that our proposed approaches have much better performance than that of the existing method. Copyright © 2015 John Wiley & Sons, Ltd.
ISSN:1532-0626
1532-0634
DOI:10.1002/cpe.3663