Loading…

LocationSpark: a distributed in-memory data management system for big spatial data

We present LocationSpark, a spatial data processing system built on top of Apache Spark, a widely used distributed data processing system. LocationSpark offers a rich set of spatial query operators, e.g., range search, k NN, spatio-textual operation, spatial-join, and k NN-join. To achieve high perf...

Full description

Saved in:
Bibliographic Details
Published in:Proceedings of the VLDB Endowment 2016-09, Vol.9 (13), p.1565-1568
Main Authors: Tang, Mingjie, Yu, Yongyang, Malluhi, Qutaibah M., Ouzzani, Mourad, Aref, Walid G.
Format: Article
Language:English
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:We present LocationSpark, a spatial data processing system built on top of Apache Spark, a widely used distributed data processing system. LocationSpark offers a rich set of spatial query operators, e.g., range search, k NN, spatio-textual operation, spatial-join, and k NN-join. To achieve high performance, LocationSpark employs various spatial indexes for in-memory data, and guarantees that immutable spatial indexes have low overhead with fault tolerance. In addition, we build two new layers over Spark, namely a query scheduler and a query executor. The query scheduler is responsible for mitigating skew in spatial queries, while the query executor selects the best plan based on the indexes and the nature of the spatial queries. Furthermore, to avoid unnecessary network communication overhead when processing overlapped spatial data, We embed an efficient spatial Bloom filter into LocationSpark's indexes. Finally, LocationSpark tracks frequently accessed spatial data, and dynamically flushes less frequently accessed data into disk. We evaluate our system on real workloads and demonstrate that it achieves an order of magnitude performance gain over a baseline framework.
ISSN:2150-8097
2150-8097
DOI:10.14778/3007263.3007310