Loading…

Efficient spatial data partitioning for distributed $$k$$NN joins

Parallel processing of large spatial datasets over distributed systems has become a core part of modern data analytic systems like Apache Hadoop and Apache Spark. The general-purpose design of these systems does not natively account for the data’s spatial attributes and results in poor scalability,...

Full description

Saved in:

Bibliographic Details
Published in:	Journal of big data 2022-12, Vol.9 (1), Article 77
Main Authors:	Zeidan, Ayman, Vo, Huy T.
Format:	Article
Language:	English
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c2062-db7c918601ae7ef21450c6e1bad55b5a31a72a9652dbe86784cff7203c2cd71e3
cites	cdi_FETCH-LOGICAL-c2062-db7c918601ae7ef21450c6e1bad55b5a31a72a9652dbe86784cff7203c2cd71e3
container_end_page
container_issue	1
container_start_page
container_title	Journal of big data
container_volume	9
creator	Zeidan, Ayman Vo, Huy T.
description	Parallel processing of large spatial datasets over distributed systems has become a core part of modern data analytic systems like Apache Hadoop and Apache Spark. The general-purpose design of these systems does not natively account for the data’s spatial attributes and results in poor scalability, accuracy, or prolonged runtimes. Spatial extensions remedy the problem and introduce spatial data recognition and operations. At the core of a spatial extension, a locality-preserving spatial partitioner determines how to spatially group the dataset’s objects into smaller chunks using the distributed system’s available resources. Existing spatial extensions rely on data sampling and often mismanage non-spatial data by either overlooking their memory requirements or excluding them entirely. This work discusses the various challenges that face spatial data partitioning and proposes a novel spatial partitioner for effectively processing spatial queries over large spatial datasets. For evaluation, the proposed partitioner is integrated with the well-known k -Nearest Neighbor ( $$k$$ k NN) spatial join query. Several experiments evaluate the proposal using real-world datasets. Our approach differs from existing proposals by (1) accounting for the dataset’s unique spatial traits without sampling, (2) considering the computational overhead required to handle non-spatial data, (3) minimizing partition shuffles, (4) computing the optimal utilization of the available resources, and (5) achieving accurate results. This contributes to the problem of spatial data partitioning through (1) providing a comprehensive discussion of the problems facing spatial data partitioning and processing, (2) the development of a novel spatial partitioning technique for in-memory distributed processing, (3) an effective, built-in, load-balancing methodology that reduces spatial query skews, and (4) a Spark-based implementation of the proposed work with an accurate $$k$$ k NN spatial join query. Experimental tests show up to $$1.48$$ 1.48 times improvement in runtime as well as the accuracy of results.
doi_str_mv	10.1186/s40537-022-00587-2
format	article
fullrecord	<record><control><sourceid>crossref</sourceid><recordid>TN_cdi_crossref_primary_10_1186_s40537_022_00587_2</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1186_s40537_022_00587_2</sourcerecordid><originalsourceid>FETCH-LOGICAL-c2062-db7c918601ae7ef21450c6e1bad55b5a31a72a9652dbe86784cff7203c2cd71e3</originalsourceid><addsrcrecordid>eNpN0LtOAzEQBVALgUQU8gNULrY1zMyu7U0ZReEhRaGB2vL6gRzCbmSbgr8nPAqquc1cXR3GrhFuEHt1WzqQrRZAJABkrwWdsRnhUglElOf_8iVblLIHAGxPP6qbsdUmxuRSGCsvR1uTPXBvq-VHm2uqaRrT-MrjlLlPpeY0fNTgedO8Nc1ux_dTGssVu4j2UMLi787Zy93mef0gtk_3j-vVVjgCRcIP2i1PawFt0CESdhKcCjhYL-UgbYtWk10qSX4IvdJ952LUBK0j5zWGds7ot9flqZQcojnm9G7zp0Ew3w7m18GcHMyPg6H2C4O6T6g</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Efficient spatial data partitioning for distributed $$k$$NN joins</title><source>ABI/INFORM global</source><source>Social Science Premium Collection</source><source>Springer Nature - SpringerLink Journals - Fully Open Access </source><source>Publicly Available Content (ProQuest)</source><creator>Zeidan, Ayman ; Vo, Huy T.</creator><creatorcontrib>Zeidan, Ayman ; Vo, Huy T.</creatorcontrib><description>Parallel processing of large spatial datasets over distributed systems has become a core part of modern data analytic systems like Apache Hadoop and Apache Spark. The general-purpose design of these systems does not natively account for the data’s spatial attributes and results in poor scalability, accuracy, or prolonged runtimes. Spatial extensions remedy the problem and introduce spatial data recognition and operations. At the core of a spatial extension, a locality-preserving spatial partitioner determines how to spatially group the dataset’s objects into smaller chunks using the distributed system’s available resources. Existing spatial extensions rely on data sampling and often mismanage non-spatial data by either overlooking their memory requirements or excluding them entirely. This work discusses the various challenges that face spatial data partitioning and proposes a novel spatial partitioner for effectively processing spatial queries over large spatial datasets. For evaluation, the proposed partitioner is integrated with the well-known k -Nearest Neighbor ( $$k$$ k NN) spatial join query. Several experiments evaluate the proposal using real-world datasets. Our approach differs from existing proposals by (1) accounting for the dataset’s unique spatial traits without sampling, (2) considering the computational overhead required to handle non-spatial data, (3) minimizing partition shuffles, (4) computing the optimal utilization of the available resources, and (5) achieving accurate results. This contributes to the problem of spatial data partitioning through (1) providing a comprehensive discussion of the problems facing spatial data partitioning and processing, (2) the development of a novel spatial partitioning technique for in-memory distributed processing, (3) an effective, built-in, load-balancing methodology that reduces spatial query skews, and (4) a Spark-based implementation of the proposed work with an accurate $$k$$ k NN spatial join query. Experimental tests show up to $$1.48$$ 1.48 times improvement in runtime as well as the accuracy of results.</description><identifier>ISSN: 2196-1115</identifier><identifier>EISSN: 2196-1115</identifier><identifier>DOI: 10.1186/s40537-022-00587-2</identifier><language>eng</language><ispartof>Journal of big data, 2022-12, Vol.9 (1), Article 77</ispartof><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c2062-db7c918601ae7ef21450c6e1bad55b5a31a72a9652dbe86784cff7203c2cd71e3</citedby><cites>FETCH-LOGICAL-c2062-db7c918601ae7ef21450c6e1bad55b5a31a72a9652dbe86784cff7203c2cd71e3</cites><orcidid>0000-0002-2881-5047</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27903,27904</link.rule.ids></links><search><creatorcontrib>Zeidan, Ayman</creatorcontrib><creatorcontrib>Vo, Huy T.</creatorcontrib><title>Efficient spatial data partitioning for distributed $$k$$NN joins</title><title>Journal of big data</title><description>Parallel processing of large spatial datasets over distributed systems has become a core part of modern data analytic systems like Apache Hadoop and Apache Spark. The general-purpose design of these systems does not natively account for the data’s spatial attributes and results in poor scalability, accuracy, or prolonged runtimes. Spatial extensions remedy the problem and introduce spatial data recognition and operations. At the core of a spatial extension, a locality-preserving spatial partitioner determines how to spatially group the dataset’s objects into smaller chunks using the distributed system’s available resources. Existing spatial extensions rely on data sampling and often mismanage non-spatial data by either overlooking their memory requirements or excluding them entirely. This work discusses the various challenges that face spatial data partitioning and proposes a novel spatial partitioner for effectively processing spatial queries over large spatial datasets. For evaluation, the proposed partitioner is integrated with the well-known k -Nearest Neighbor ( $$k$$ k NN) spatial join query. Several experiments evaluate the proposal using real-world datasets. Our approach differs from existing proposals by (1) accounting for the dataset’s unique spatial traits without sampling, (2) considering the computational overhead required to handle non-spatial data, (3) minimizing partition shuffles, (4) computing the optimal utilization of the available resources, and (5) achieving accurate results. This contributes to the problem of spatial data partitioning through (1) providing a comprehensive discussion of the problems facing spatial data partitioning and processing, (2) the development of a novel spatial partitioning technique for in-memory distributed processing, (3) an effective, built-in, load-balancing methodology that reduces spatial query skews, and (4) a Spark-based implementation of the proposed work with an accurate $$k$$ k NN spatial join query. Experimental tests show up to $$1.48$$ 1.48 times improvement in runtime as well as the accuracy of results.</description><issn>2196-1115</issn><issn>2196-1115</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNpN0LtOAzEQBVALgUQU8gNULrY1zMyu7U0ZReEhRaGB2vL6gRzCbmSbgr8nPAqquc1cXR3GrhFuEHt1WzqQrRZAJABkrwWdsRnhUglElOf_8iVblLIHAGxPP6qbsdUmxuRSGCsvR1uTPXBvq-VHm2uqaRrT-MrjlLlPpeY0fNTgedO8Nc1ux_dTGssVu4j2UMLi787Zy93mef0gtk_3j-vVVjgCRcIP2i1PawFt0CESdhKcCjhYL-UgbYtWk10qSX4IvdJ952LUBK0j5zWGds7ot9flqZQcojnm9G7zp0Ew3w7m18GcHMyPg6H2C4O6T6g</recordid><startdate>20221201</startdate><enddate>20221201</enddate><creator>Zeidan, Ayman</creator><creator>Vo, Huy T.</creator><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-2881-5047</orcidid></search><sort><creationdate>20221201</creationdate><title>Efficient spatial data partitioning for distributed $$k$$NN joins</title><author>Zeidan, Ayman ; Vo, Huy T.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c2062-db7c918601ae7ef21450c6e1bad55b5a31a72a9652dbe86784cff7203c2cd71e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zeidan, Ayman</creatorcontrib><creatorcontrib>Vo, Huy T.</creatorcontrib><collection>CrossRef</collection><jtitle>Journal of big data</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zeidan, Ayman</au><au>Vo, Huy T.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Efficient spatial data partitioning for distributed $$k$$NN joins</atitle><jtitle>Journal of big data</jtitle><date>2022-12-01</date><risdate>2022</risdate><volume>9</volume><issue>1</issue><artnum>77</artnum><issn>2196-1115</issn><eissn>2196-1115</eissn><abstract>Parallel processing of large spatial datasets over distributed systems has become a core part of modern data analytic systems like Apache Hadoop and Apache Spark. The general-purpose design of these systems does not natively account for the data’s spatial attributes and results in poor scalability, accuracy, or prolonged runtimes. Spatial extensions remedy the problem and introduce spatial data recognition and operations. At the core of a spatial extension, a locality-preserving spatial partitioner determines how to spatially group the dataset’s objects into smaller chunks using the distributed system’s available resources. Existing spatial extensions rely on data sampling and often mismanage non-spatial data by either overlooking their memory requirements or excluding them entirely. This work discusses the various challenges that face spatial data partitioning and proposes a novel spatial partitioner for effectively processing spatial queries over large spatial datasets. For evaluation, the proposed partitioner is integrated with the well-known k -Nearest Neighbor ( $$k$$ k NN) spatial join query. Several experiments evaluate the proposal using real-world datasets. Our approach differs from existing proposals by (1) accounting for the dataset’s unique spatial traits without sampling, (2) considering the computational overhead required to handle non-spatial data, (3) minimizing partition shuffles, (4) computing the optimal utilization of the available resources, and (5) achieving accurate results. This contributes to the problem of spatial data partitioning through (1) providing a comprehensive discussion of the problems facing spatial data partitioning and processing, (2) the development of a novel spatial partitioning technique for in-memory distributed processing, (3) an effective, built-in, load-balancing methodology that reduces spatial query skews, and (4) a Spark-based implementation of the proposed work with an accurate $$k$$ k NN spatial join query. Experimental tests show up to $$1.48$$ 1.48 times improvement in runtime as well as the accuracy of results.</abstract><doi>10.1186/s40537-022-00587-2</doi><orcidid>https://orcid.org/0000-0002-2881-5047</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2196-1115
ispartof	Journal of big data, 2022-12, Vol.9 (1), Article 77
issn	2196-1115 2196-1115
language	eng
recordid	cdi_crossref_primary_10_1186_s40537_022_00587_2
source	ABI/INFORM global; Social Science Premium Collection; Springer Nature - SpringerLink Journals - Fully Open Access ; Publicly Available Content (ProQuest)
title	Efficient spatial data partitioning for distributed $$k$$NN joins
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-26T01%3A49%3A50IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Efficient%20spatial%20data%20partitioning%20for%20distributed%20$$k$$NN%20joins&rft.jtitle=Journal%20of%20big%20data&rft.au=Zeidan,%20Ayman&rft.date=2022-12-01&rft.volume=9&rft.issue=1&rft.artnum=77&rft.issn=2196-1115&rft.eissn=2196-1115&rft_id=info:doi/10.1186/s40537-022-00587-2&rft_dat=%3Ccrossref%3E10_1186_s40537_022_00587_2%3C/crossref%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c2062-db7c918601ae7ef21450c6e1bad55b5a31a72a9652dbe86784cff7203c2cd71e3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true