Loading…
Parallel similarity joins on massive high-dimensional data using MapReduce
SUMMARY In this paper, we focus on high‐dimensional similarity join (HDSJ) using MapReduce paradigm. As the volume of the data and the number of the dimensions increase, the computation cost of HDSJ will increase exponentially. There is no existing effective approach that can process HDSJ efficientl...
Saved in:
Published in: | Concurrency and computation 2016-01, Vol.28 (1), p.166-183 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | SUMMARY
In this paper, we focus on high‐dimensional similarity join (HDSJ) using MapReduce paradigm. As the volume of the data and the number of the dimensions increase, the computation cost of HDSJ will increase exponentially. There is no existing effective approach that can process HDSJ efficiently, so we propose a novel method called symbolic aggregate approximation (SAX)‐based HDSJ to deal with the problem. SAX is the abbreviation of symbolic aggregate approximation that is a dimensionality reduction technique and widely used in time series processing, we use SAX to represent the high‐dimensional vectors in this paper and reorganize these vectors into groups based on their SAX representations. For the very high‐dimensional vectors, we also propose an improved SAX‐based HDSJ approach. Finally, we implement SAX‐based HDSJ and improved SAX‐based HDSJ on Hadoop‐0.20.2 and perform comprehensive experiments to test the performance, we also compare SAX‐based HDSJ and improved SAX‐based HDSJ with the existing method. The experiment results show that our proposed approaches have much better performance than that of the existing method. Copyright © 2015 John Wiley & Sons, Ltd. |
---|---|
ISSN: | 1532-0626 1532-0634 |
DOI: | 10.1002/cpe.3663 |