Loading…

Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI

There are several popular Big Data processing frameworks including Apache Spark, Dask, and Ray. The Apache Spark software provides an easy-to-use high-level API in different languages including Scala, Java, and Python. Spark supports parallel and distributed execution of user workloads by supporting...

Full description

Saved in:

Bibliographic Details
Main Authors:	Al-Attar, Kinan, Shafi, Aamir, Abduljabbar, Mustafa, Subramoni, Hari, Panda, Dhabaleswar K.
Format:	Conference Proceeding
Language:	English
Subjects:	Apache Spark Benchmark testing Big Data Cluster computing Machine learning Message passing MPI Netty Semantics Software
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	81
container_issue
container_start_page	71
container_title
container_volume
creator	Al-Attar, Kinan Shafi, Aamir Abduljabbar, Mustafa Subramoni, Hari Panda, Dhabaleswar K.
description	There are several popular Big Data processing frameworks including Apache Spark, Dask, and Ray. The Apache Spark software provides an easy-to-use high-level API in different languages including Scala, Java, and Python. Spark supports parallel and distributed execution of user workloads by supporting communication using an event-driven framework called Netty. Some efforts - including RDMA-Spark and SparkUCX - were made in the past to optimize Apache Spark on High-Performance Computing (HPC) systems equipped with high-performance interconnects like InfiniBand. In the HPC community, Message Passing Interface (MPI) libraries are widely adopted for parallelizing science and engineering applications. This paper presents MPI4Spark which uses MPI for communication in a parallel and distributed setting on HPC systems. MPI4Spark can launch the Spark ecosystem using MPI launchers to utilize MPI communication inside the Big Data framework. It also maintains isolation for application execution on worker nodes by forking new processes using Dynamic Process Management (DPM). It bridges semantic differences between the event-driven communication in Spark compared to the application-driven communication engine in MPI. MPI4Spark also provides portability and performance benefits as it is capable of utilizing popular HPC interconnects including InfiniBand, Omni-Path, Slingshot, and others. The performance of MPI4Spark is evaluated against RDMA-Spark and Vanilla Spark using OSU HiBD Benchmarks (OHB) and Intel HiBench that contain a variety of Resilient Distributed Dataset (RDD), Graph Processing, and Machine Learning workloads. This evaluation is done on three HPC systems including TACC Frontera, TACC Stampede2, and an internal cluster. MPI4Spark outperforms Vanilla Spark and RDMA-Spark by 4.23x and 2.04x, respectively, on the TACC Frontera system using 448 processing cores (8 Spark workers) for the GroupByTest benchmark in OHB. The communication performance of MPI4Spark is 13.08x and 5.56x better than Vanilla Spark and RDMA-Spark, respectively.
doi_str_mv	10.1109/CLUSTER51413.2022.00022
format	conference_proceeding
fullrecord	<record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_9912687</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9912687</ieee_id><sourcerecordid>9912687</sourcerecordid><originalsourceid>FETCH-LOGICAL-i203t-c55f1daf4e80deb29160a3d9e060bfacf4ab332ee72ff08838b1201e255c950d3</originalsourceid><addsrcrecordid>eNotzNFOwjAYBeBqYiKgT-CFfYHh33btWu_MAkIyIpHtmnTbX6y6jXQjxLd3Bm_OuTnnI-SRwZwxME9pVuzyxbtkMRNzDpzPAca8IlOmlIyNlgquyYQzpSPDpbgl077_BBCJADUhxe5owxfdIA493WzXzzTvzjbUPV35w0e0xeC60Ni2Qpp2TXNqfWUH37V0GWyD5278jgN6UU69bw9_yh25cfa7x_v_npFiucjTVZS9va7TlyzyHMQQVVI6VlsXo4YaS26YAitqg6CgdLZysS2F4IgJdw60FrpkHBhyKSsjoRYz8nBxPSLuj8E3NvzsjWFc6UT8AslzUaM</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI</title><source>IEEE Xplore All Conference Series</source><creator>Al-Attar, Kinan ; Shafi, Aamir ; Abduljabbar, Mustafa ; Subramoni, Hari ; Panda, Dhabaleswar K.</creator><creatorcontrib>Al-Attar, Kinan ; Shafi, Aamir ; Abduljabbar, Mustafa ; Subramoni, Hari ; Panda, Dhabaleswar K.</creatorcontrib><description>There are several popular Big Data processing frameworks including Apache Spark, Dask, and Ray. The Apache Spark software provides an easy-to-use high-level API in different languages including Scala, Java, and Python. Spark supports parallel and distributed execution of user workloads by supporting communication using an event-driven framework called Netty. Some efforts - including RDMA-Spark and SparkUCX - were made in the past to optimize Apache Spark on High-Performance Computing (HPC) systems equipped with high-performance interconnects like InfiniBand. In the HPC community, Message Passing Interface (MPI) libraries are widely adopted for parallelizing science and engineering applications. This paper presents MPI4Spark which uses MPI for communication in a parallel and distributed setting on HPC systems. MPI4Spark can launch the Spark ecosystem using MPI launchers to utilize MPI communication inside the Big Data framework. It also maintains isolation for application execution on worker nodes by forking new processes using Dynamic Process Management (DPM). It bridges semantic differences between the event-driven communication in Spark compared to the application-driven communication engine in MPI. MPI4Spark also provides portability and performance benefits as it is capable of utilizing popular HPC interconnects including InfiniBand, Omni-Path, Slingshot, and others. The performance of MPI4Spark is evaluated against RDMA-Spark and Vanilla Spark using OSU HiBD Benchmarks (OHB) and Intel HiBench that contain a variety of Resilient Distributed Dataset (RDD), Graph Processing, and Machine Learning workloads. This evaluation is done on three HPC systems including TACC Frontera, TACC Stampede2, and an internal cluster. MPI4Spark outperforms Vanilla Spark and RDMA-Spark by 4.23x and 2.04x, respectively, on the TACC Frontera system using 448 processing cores (8 Spark workers) for the GroupByTest benchmark in OHB. The communication performance of MPI4Spark is 13.08x and 5.56x better than Vanilla Spark and RDMA-Spark, respectively.</description><identifier>EISSN: 2168-9253</identifier><identifier>EISBN: 1665498560</identifier><identifier>EISBN: 9781665498562</identifier><identifier>DOI: 10.1109/CLUSTER51413.2022.00022</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Apache Spark ; Benchmark testing ; Big Data ; Cluster computing ; Machine learning ; Message passing ; MPI ; Netty ; Semantics ; Software</subject><ispartof>2022 IEEE International Conference on Cluster Computing (CLUSTER), 2022, p.71-81</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9912687$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,777,781,786,787,23911,23912,25121,27906,54536,54913</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9912687$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Al-Attar, Kinan</creatorcontrib><creatorcontrib>Shafi, Aamir</creatorcontrib><creatorcontrib>Abduljabbar, Mustafa</creatorcontrib><creatorcontrib>Subramoni, Hari</creatorcontrib><creatorcontrib>Panda, Dhabaleswar K.</creatorcontrib><title>Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI</title><title>2022 IEEE International Conference on Cluster Computing (CLUSTER)</title><addtitle>CLUSTER</addtitle><description>There are several popular Big Data processing frameworks including Apache Spark, Dask, and Ray. The Apache Spark software provides an easy-to-use high-level API in different languages including Scala, Java, and Python. Spark supports parallel and distributed execution of user workloads by supporting communication using an event-driven framework called Netty. Some efforts - including RDMA-Spark and SparkUCX - were made in the past to optimize Apache Spark on High-Performance Computing (HPC) systems equipped with high-performance interconnects like InfiniBand. In the HPC community, Message Passing Interface (MPI) libraries are widely adopted for parallelizing science and engineering applications. This paper presents MPI4Spark which uses MPI for communication in a parallel and distributed setting on HPC systems. MPI4Spark can launch the Spark ecosystem using MPI launchers to utilize MPI communication inside the Big Data framework. It also maintains isolation for application execution on worker nodes by forking new processes using Dynamic Process Management (DPM). It bridges semantic differences between the event-driven communication in Spark compared to the application-driven communication engine in MPI. MPI4Spark also provides portability and performance benefits as it is capable of utilizing popular HPC interconnects including InfiniBand, Omni-Path, Slingshot, and others. The performance of MPI4Spark is evaluated against RDMA-Spark and Vanilla Spark using OSU HiBD Benchmarks (OHB) and Intel HiBench that contain a variety of Resilient Distributed Dataset (RDD), Graph Processing, and Machine Learning workloads. This evaluation is done on three HPC systems including TACC Frontera, TACC Stampede2, and an internal cluster. MPI4Spark outperforms Vanilla Spark and RDMA-Spark by 4.23x and 2.04x, respectively, on the TACC Frontera system using 448 processing cores (8 Spark workers) for the GroupByTest benchmark in OHB. The communication performance of MPI4Spark is 13.08x and 5.56x better than Vanilla Spark and RDMA-Spark, respectively.</description><subject>Apache Spark</subject><subject>Benchmark testing</subject><subject>Big Data</subject><subject>Cluster computing</subject><subject>Machine learning</subject><subject>Message passing</subject><subject>MPI</subject><subject>Netty</subject><subject>Semantics</subject><subject>Software</subject><issn>2168-9253</issn><isbn>1665498560</isbn><isbn>9781665498562</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2022</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotzNFOwjAYBeBqYiKgT-CFfYHh33btWu_MAkIyIpHtmnTbX6y6jXQjxLd3Bm_OuTnnI-SRwZwxME9pVuzyxbtkMRNzDpzPAca8IlOmlIyNlgquyYQzpSPDpbgl077_BBCJADUhxe5owxfdIA493WzXzzTvzjbUPV35w0e0xeC60Ni2Qpp2TXNqfWUH37V0GWyD5278jgN6UU69bw9_yh25cfa7x_v_npFiucjTVZS9va7TlyzyHMQQVVI6VlsXo4YaS26YAitqg6CgdLZysS2F4IgJdw60FrpkHBhyKSsjoRYz8nBxPSLuj8E3NvzsjWFc6UT8AslzUaM</recordid><startdate>202209</startdate><enddate>202209</enddate><creator>Al-Attar, Kinan</creator><creator>Shafi, Aamir</creator><creator>Abduljabbar, Mustafa</creator><creator>Subramoni, Hari</creator><creator>Panda, Dhabaleswar K.</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>202209</creationdate><title>Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI</title><author>Al-Attar, Kinan ; Shafi, Aamir ; Abduljabbar, Mustafa ; Subramoni, Hari ; Panda, Dhabaleswar K.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i203t-c55f1daf4e80deb29160a3d9e060bfacf4ab332ee72ff08838b1201e255c950d3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Apache Spark</topic><topic>Benchmark testing</topic><topic>Big Data</topic><topic>Cluster computing</topic><topic>Machine learning</topic><topic>Message passing</topic><topic>MPI</topic><topic>Netty</topic><topic>Semantics</topic><topic>Software</topic><toplevel>online_resources</toplevel><creatorcontrib>Al-Attar, Kinan</creatorcontrib><creatorcontrib>Shafi, Aamir</creatorcontrib><creatorcontrib>Abduljabbar, Mustafa</creatorcontrib><creatorcontrib>Subramoni, Hari</creatorcontrib><creatorcontrib>Panda, Dhabaleswar K.</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Xplore</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Al-Attar, Kinan</au><au>Shafi, Aamir</au><au>Abduljabbar, Mustafa</au><au>Subramoni, Hari</au><au>Panda, Dhabaleswar K.</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI</atitle><btitle>2022 IEEE International Conference on Cluster Computing (CLUSTER)</btitle><stitle>CLUSTER</stitle><date>2022-09</date><risdate>2022</risdate><spage>71</spage><epage>81</epage><pages>71-81</pages><eissn>2168-9253</eissn><eisbn>1665498560</eisbn><eisbn>9781665498562</eisbn><coden>IEEPAD</coden><abstract>There are several popular Big Data processing frameworks including Apache Spark, Dask, and Ray. The Apache Spark software provides an easy-to-use high-level API in different languages including Scala, Java, and Python. Spark supports parallel and distributed execution of user workloads by supporting communication using an event-driven framework called Netty. Some efforts - including RDMA-Spark and SparkUCX - were made in the past to optimize Apache Spark on High-Performance Computing (HPC) systems equipped with high-performance interconnects like InfiniBand. In the HPC community, Message Passing Interface (MPI) libraries are widely adopted for parallelizing science and engineering applications. This paper presents MPI4Spark which uses MPI for communication in a parallel and distributed setting on HPC systems. MPI4Spark can launch the Spark ecosystem using MPI launchers to utilize MPI communication inside the Big Data framework. It also maintains isolation for application execution on worker nodes by forking new processes using Dynamic Process Management (DPM). It bridges semantic differences between the event-driven communication in Spark compared to the application-driven communication engine in MPI. MPI4Spark also provides portability and performance benefits as it is capable of utilizing popular HPC interconnects including InfiniBand, Omni-Path, Slingshot, and others. The performance of MPI4Spark is evaluated against RDMA-Spark and Vanilla Spark using OSU HiBD Benchmarks (OHB) and Intel HiBench that contain a variety of Resilient Distributed Dataset (RDD), Graph Processing, and Machine Learning workloads. This evaluation is done on three HPC systems including TACC Frontera, TACC Stampede2, and an internal cluster. MPI4Spark outperforms Vanilla Spark and RDMA-Spark by 4.23x and 2.04x, respectively, on the TACC Frontera system using 448 processing cores (8 Spark workers) for the GroupByTest benchmark in OHB. The communication performance of MPI4Spark is 13.08x and 5.56x better than Vanilla Spark and RDMA-Spark, respectively.</abstract><pub>IEEE</pub><doi>10.1109/CLUSTER51413.2022.00022</doi><tpages>11</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	EISSN: 2168-9253
ispartof	2022 IEEE International Conference on Cluster Computing (CLUSTER), 2022, p.71-81
issn	2168-9253
language	eng
recordid	cdi_ieee_primary_9912687
source	IEEE Xplore All Conference Series
subjects	Apache Spark Benchmark testing Big Data Cluster computing Machine learning Message passing MPI Netty Semantics Software
title	Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T11%3A59%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Spark%20Meets%20MPI:%20Towards%20High-Performance%20Communication%20Framework%20for%20Spark%20using%20MPI&rft.btitle=2022%20IEEE%20International%20Conference%20on%20Cluster%20Computing%20(CLUSTER)&rft.au=Al-Attar,%20Kinan&rft.date=2022-09&rft.spage=71&rft.epage=81&rft.pages=71-81&rft.eissn=2168-9253&rft.coden=IEEPAD&rft_id=info:doi/10.1109/CLUSTER51413.2022.00022&rft.eisbn=1665498560&rft.eisbn_list=9781665498562&rft_dat=%3Cieee_CHZPO%3E9912687%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i203t-c55f1daf4e80deb29160a3d9e060bfacf4ab332ee72ff08838b1201e255c950d3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=9912687&rfr_iscdi=true