Loading…

Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI

There are several popular Big Data processing frameworks including Apache Spark, Dask, and Ray. The Apache Spark software provides an easy-to-use high-level API in different languages including Scala, Java, and Python. Spark supports parallel and distributed execution of user workloads by supporting...

Full description

Saved in:
Bibliographic Details
Main Authors: Al-Attar, Kinan, Shafi, Aamir, Abduljabbar, Mustafa, Subramoni, Hari, Panda, Dhabaleswar K.
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page 81
container_issue
container_start_page 71
container_title
container_volume
creator Al-Attar, Kinan
Shafi, Aamir
Abduljabbar, Mustafa
Subramoni, Hari
Panda, Dhabaleswar K.
description There are several popular Big Data processing frameworks including Apache Spark, Dask, and Ray. The Apache Spark software provides an easy-to-use high-level API in different languages including Scala, Java, and Python. Spark supports parallel and distributed execution of user workloads by supporting communication using an event-driven framework called Netty. Some efforts - including RDMA-Spark and SparkUCX - were made in the past to optimize Apache Spark on High-Performance Computing (HPC) systems equipped with high-performance interconnects like InfiniBand. In the HPC community, Message Passing Interface (MPI) libraries are widely adopted for parallelizing science and engineering applications. This paper presents MPI4Spark which uses MPI for communication in a parallel and distributed setting on HPC systems. MPI4Spark can launch the Spark ecosystem using MPI launchers to utilize MPI communication inside the Big Data framework. It also maintains isolation for application execution on worker nodes by forking new processes using Dynamic Process Management (DPM). It bridges semantic differences between the event-driven communication in Spark compared to the application-driven communication engine in MPI. MPI4Spark also provides portability and performance benefits as it is capable of utilizing popular HPC interconnects including InfiniBand, Omni-Path, Slingshot, and others. The performance of MPI4Spark is evaluated against RDMA-Spark and Vanilla Spark using OSU HiBD Benchmarks (OHB) and Intel HiBench that contain a variety of Resilient Distributed Dataset (RDD), Graph Processing, and Machine Learning workloads. This evaluation is done on three HPC systems including TACC Frontera, TACC Stampede2, and an internal cluster. MPI4Spark outperforms Vanilla Spark and RDMA-Spark by 4.23x and 2.04x, respectively, on the TACC Frontera system using 448 processing cores (8 Spark workers) for the GroupByTest benchmark in OHB. The communication performance of MPI4Spark is 13.08x and 5.56x better than Vanilla Spark and RDMA-Spark, respectively.
doi_str_mv 10.1109/CLUSTER51413.2022.00022
format conference_proceeding
fullrecord <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_9912687</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9912687</ieee_id><sourcerecordid>9912687</sourcerecordid><originalsourceid>FETCH-LOGICAL-i203t-c55f1daf4e80deb29160a3d9e060bfacf4ab332ee72ff08838b1201e255c950d3</originalsourceid><addsrcrecordid>eNotzNFOwjAYBeBqYiKgT-CFfYHh33btWu_MAkIyIpHtmnTbX6y6jXQjxLd3Bm_OuTnnI-SRwZwxME9pVuzyxbtkMRNzDpzPAca8IlOmlIyNlgquyYQzpSPDpbgl077_BBCJADUhxe5owxfdIA493WzXzzTvzjbUPV35w0e0xeC60Ni2Qpp2TXNqfWUH37V0GWyD5278jgN6UU69bw9_yh25cfa7x_v_npFiucjTVZS9va7TlyzyHMQQVVI6VlsXo4YaS26YAitqg6CgdLZysS2F4IgJdw60FrpkHBhyKSsjoRYz8nBxPSLuj8E3NvzsjWFc6UT8AslzUaM</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI</title><source>IEEE Xplore All Conference Series</source><creator>Al-Attar, Kinan ; Shafi, Aamir ; Abduljabbar, Mustafa ; Subramoni, Hari ; Panda, Dhabaleswar K.</creator><creatorcontrib>Al-Attar, Kinan ; Shafi, Aamir ; Abduljabbar, Mustafa ; Subramoni, Hari ; Panda, Dhabaleswar K.</creatorcontrib><description>There are several popular Big Data processing frameworks including Apache Spark, Dask, and Ray. The Apache Spark software provides an easy-to-use high-level API in different languages including Scala, Java, and Python. Spark supports parallel and distributed execution of user workloads by supporting communication using an event-driven framework called Netty. Some efforts - including RDMA-Spark and SparkUCX - were made in the past to optimize Apache Spark on High-Performance Computing (HPC) systems equipped with high-performance interconnects like InfiniBand. In the HPC community, Message Passing Interface (MPI) libraries are widely adopted for parallelizing science and engineering applications. This paper presents MPI4Spark which uses MPI for communication in a parallel and distributed setting on HPC systems. MPI4Spark can launch the Spark ecosystem using MPI launchers to utilize MPI communication inside the Big Data framework. It also maintains isolation for application execution on worker nodes by forking new processes using Dynamic Process Management (DPM). It bridges semantic differences between the event-driven communication in Spark compared to the application-driven communication engine in MPI. MPI4Spark also provides portability and performance benefits as it is capable of utilizing popular HPC interconnects including InfiniBand, Omni-Path, Slingshot, and others. The performance of MPI4Spark is evaluated against RDMA-Spark and Vanilla Spark using OSU HiBD Benchmarks (OHB) and Intel HiBench that contain a variety of Resilient Distributed Dataset (RDD), Graph Processing, and Machine Learning workloads. This evaluation is done on three HPC systems including TACC Frontera, TACC Stampede2, and an internal cluster. MPI4Spark outperforms Vanilla Spark and RDMA-Spark by 4.23x and 2.04x, respectively, on the TACC Frontera system using 448 processing cores (8 Spark workers) for the GroupByTest benchmark in OHB. The communication performance of MPI4Spark is 13.08x and 5.56x better than Vanilla Spark and RDMA-Spark, respectively.</description><identifier>EISSN: 2168-9253</identifier><identifier>EISBN: 1665498560</identifier><identifier>EISBN: 9781665498562</identifier><identifier>DOI: 10.1109/CLUSTER51413.2022.00022</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Apache Spark ; Benchmark testing ; Big Data ; Cluster computing ; Machine learning ; Message passing ; MPI ; Netty ; Semantics ; Software</subject><ispartof>2022 IEEE International Conference on Cluster Computing (CLUSTER), 2022, p.71-81</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9912687$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,777,781,786,787,23911,23912,25121,27906,54536,54913</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9912687$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Al-Attar, Kinan</creatorcontrib><creatorcontrib>Shafi, Aamir</creatorcontrib><creatorcontrib>Abduljabbar, Mustafa</creatorcontrib><creatorcontrib>Subramoni, Hari</creatorcontrib><creatorcontrib>Panda, Dhabaleswar K.</creatorcontrib><title>Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI</title><title>2022 IEEE International Conference on Cluster Computing (CLUSTER)</title><addtitle>CLUSTER</addtitle><description>There are several popular Big Data processing frameworks including Apache Spark, Dask, and Ray. The Apache Spark software provides an easy-to-use high-level API in different languages including Scala, Java, and Python. Spark supports parallel and distributed execution of user workloads by supporting communication using an event-driven framework called Netty. Some efforts - including RDMA-Spark and SparkUCX - were made in the past to optimize Apache Spark on High-Performance Computing (HPC) systems equipped with high-performance interconnects like InfiniBand. In the HPC community, Message Passing Interface (MPI) libraries are widely adopted for parallelizing science and engineering applications. This paper presents MPI4Spark which uses MPI for communication in a parallel and distributed setting on HPC systems. MPI4Spark can launch the Spark ecosystem using MPI launchers to utilize MPI communication inside the Big Data framework. It also maintains isolation for application execution on worker nodes by forking new processes using Dynamic Process Management (DPM). It bridges semantic differences between the event-driven communication in Spark compared to the application-driven communication engine in MPI. MPI4Spark also provides portability and performance benefits as it is capable of utilizing popular HPC interconnects including InfiniBand, Omni-Path, Slingshot, and others. The performance of MPI4Spark is evaluated against RDMA-Spark and Vanilla Spark using OSU HiBD Benchmarks (OHB) and Intel HiBench that contain a variety of Resilient Distributed Dataset (RDD), Graph Processing, and Machine Learning workloads. This evaluation is done on three HPC systems including TACC Frontera, TACC Stampede2, and an internal cluster. MPI4Spark outperforms Vanilla Spark and RDMA-Spark by 4.23x and 2.04x, respectively, on the TACC Frontera system using 448 processing cores (8 Spark workers) for the GroupByTest benchmark in OHB. The communication performance of MPI4Spark is 13.08x and 5.56x better than Vanilla Spark and RDMA-Spark, respectively.</description><subject>Apache Spark</subject><subject>Benchmark testing</subject><subject>Big Data</subject><subject>Cluster computing</subject><subject>Machine learning</subject><subject>Message passing</subject><subject>MPI</subject><subject>Netty</subject><subject>Semantics</subject><subject>Software</subject><issn>2168-9253</issn><isbn>1665498560</isbn><isbn>9781665498562</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2022</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotzNFOwjAYBeBqYiKgT-CFfYHh33btWu_MAkIyIpHtmnTbX6y6jXQjxLd3Bm_OuTnnI-SRwZwxME9pVuzyxbtkMRNzDpzPAca8IlOmlIyNlgquyYQzpSPDpbgl077_BBCJADUhxe5owxfdIA493WzXzzTvzjbUPV35w0e0xeC60Ni2Qpp2TXNqfWUH37V0GWyD5278jgN6UU69bw9_yh25cfa7x_v_npFiucjTVZS9va7TlyzyHMQQVVI6VlsXo4YaS26YAitqg6CgdLZysS2F4IgJdw60FrpkHBhyKSsjoRYz8nBxPSLuj8E3NvzsjWFc6UT8AslzUaM</recordid><startdate>202209</startdate><enddate>202209</enddate><creator>Al-Attar, Kinan</creator><creator>Shafi, Aamir</creator><creator>Abduljabbar, Mustafa</creator><creator>Subramoni, Hari</creator><creator>Panda, Dhabaleswar K.</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>202209</creationdate><title>Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI</title><author>Al-Attar, Kinan ; Shafi, Aamir ; Abduljabbar, Mustafa ; Subramoni, Hari ; Panda, Dhabaleswar K.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i203t-c55f1daf4e80deb29160a3d9e060bfacf4ab332ee72ff08838b1201e255c950d3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Apache Spark</topic><topic>Benchmark testing</topic><topic>Big Data</topic><topic>Cluster computing</topic><topic>Machine learning</topic><topic>Message passing</topic><topic>MPI</topic><topic>Netty</topic><topic>Semantics</topic><topic>Software</topic><toplevel>online_resources</toplevel><creatorcontrib>Al-Attar, Kinan</creatorcontrib><creatorcontrib>Shafi, Aamir</creatorcontrib><creatorcontrib>Abduljabbar, Mustafa</creatorcontrib><creatorcontrib>Subramoni, Hari</creatorcontrib><creatorcontrib>Panda, Dhabaleswar K.</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Xplore</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Al-Attar, Kinan</au><au>Shafi, Aamir</au><au>Abduljabbar, Mustafa</au><au>Subramoni, Hari</au><au>Panda, Dhabaleswar K.</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI</atitle><btitle>2022 IEEE International Conference on Cluster Computing (CLUSTER)</btitle><stitle>CLUSTER</stitle><date>2022-09</date><risdate>2022</risdate><spage>71</spage><epage>81</epage><pages>71-81</pages><eissn>2168-9253</eissn><eisbn>1665498560</eisbn><eisbn>9781665498562</eisbn><coden>IEEPAD</coden><abstract>There are several popular Big Data processing frameworks including Apache Spark, Dask, and Ray. The Apache Spark software provides an easy-to-use high-level API in different languages including Scala, Java, and Python. Spark supports parallel and distributed execution of user workloads by supporting communication using an event-driven framework called Netty. Some efforts - including RDMA-Spark and SparkUCX - were made in the past to optimize Apache Spark on High-Performance Computing (HPC) systems equipped with high-performance interconnects like InfiniBand. In the HPC community, Message Passing Interface (MPI) libraries are widely adopted for parallelizing science and engineering applications. This paper presents MPI4Spark which uses MPI for communication in a parallel and distributed setting on HPC systems. MPI4Spark can launch the Spark ecosystem using MPI launchers to utilize MPI communication inside the Big Data framework. It also maintains isolation for application execution on worker nodes by forking new processes using Dynamic Process Management (DPM). It bridges semantic differences between the event-driven communication in Spark compared to the application-driven communication engine in MPI. MPI4Spark also provides portability and performance benefits as it is capable of utilizing popular HPC interconnects including InfiniBand, Omni-Path, Slingshot, and others. The performance of MPI4Spark is evaluated against RDMA-Spark and Vanilla Spark using OSU HiBD Benchmarks (OHB) and Intel HiBench that contain a variety of Resilient Distributed Dataset (RDD), Graph Processing, and Machine Learning workloads. This evaluation is done on three HPC systems including TACC Frontera, TACC Stampede2, and an internal cluster. MPI4Spark outperforms Vanilla Spark and RDMA-Spark by 4.23x and 2.04x, respectively, on the TACC Frontera system using 448 processing cores (8 Spark workers) for the GroupByTest benchmark in OHB. The communication performance of MPI4Spark is 13.08x and 5.56x better than Vanilla Spark and RDMA-Spark, respectively.</abstract><pub>IEEE</pub><doi>10.1109/CLUSTER51413.2022.00022</doi><tpages>11</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier EISSN: 2168-9253
ispartof 2022 IEEE International Conference on Cluster Computing (CLUSTER), 2022, p.71-81
issn 2168-9253
language eng
recordid cdi_ieee_primary_9912687
source IEEE Xplore All Conference Series
subjects Apache Spark
Benchmark testing
Big Data
Cluster computing
Machine learning
Message passing
MPI
Netty
Semantics
Software
title Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T11%3A59%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Spark%20Meets%20MPI:%20Towards%20High-Performance%20Communication%20Framework%20for%20Spark%20using%20MPI&rft.btitle=2022%20IEEE%20International%20Conference%20on%20Cluster%20Computing%20(CLUSTER)&rft.au=Al-Attar,%20Kinan&rft.date=2022-09&rft.spage=71&rft.epage=81&rft.pages=71-81&rft.eissn=2168-9253&rft.coden=IEEPAD&rft_id=info:doi/10.1109/CLUSTER51413.2022.00022&rft.eisbn=1665498560&rft.eisbn_list=9781665498562&rft_dat=%3Cieee_CHZPO%3E9912687%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i203t-c55f1daf4e80deb29160a3d9e060bfacf4ab332ee72ff08838b1201e255c950d3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=9912687&rfr_iscdi=true