Loading…

Towards optimizing the execution of spark scientific workflows using machine learning‐based parameter tuning

Summary In the last few years, Apache Spark has become a de facto the standard framework for big data systems on both industry and academy projects. Spark is used to execute compute‐ and data‐intensive workflows in distinct areas like biology and astronomy. Although Spark is an easy‐to‐install frame...

Full description

Saved in:
Bibliographic Details
Published in:Concurrency and computation 2021-03, Vol.33 (5), p.n/a
Main Authors: Oliveira, Douglas, Porto, Fábio, Boeres, Cristina, Oliveira, Daniel
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c3462-78082a342a416d7f98df7159c6ae793b5ca252870bba337bf68c83b84a9986dc3
cites cdi_FETCH-LOGICAL-c3462-78082a342a416d7f98df7159c6ae793b5ca252870bba337bf68c83b84a9986dc3
container_end_page n/a
container_issue 5
container_start_page
container_title Concurrency and computation
container_volume 33
creator Oliveira, Douglas
Porto, Fábio
Boeres, Cristina
Oliveira, Daniel
description Summary In the last few years, Apache Spark has become a de facto the standard framework for big data systems on both industry and academy projects. Spark is used to execute compute‐ and data‐intensive workflows in distinct areas like biology and astronomy. Although Spark is an easy‐to‐install framework, it has more than one hundred parameters to be set, besides domain‐specific parameters of each workflow. In this way, to execute Spark‐based workflows efficiently, the user has to fine‐tune a myriad of Spark and workflow parameters (eg, partitioning strategy, the average size of a DNA sequence, etc.). This configuration task cannot be manually performed in a trial‐and‐error manner since it is tedious and error‐prone. This article proposes an approach that focuses on generating interpretable predictive machine learning models (ie, decision trees), and then extract useful rules (ie, patterns) from these models that can be applied to configure parameters of future executions of the workflow and Spark for nonexperts users. In the experiments presented in this article, the proposed parameter configuration approach led to better performance in processing Spark workflows. Finally, the approach introduced here reduced the number of parameters to be configured by identifying the most relevant domain‐specific ones related to the workflow performance in the predictive model.
doi_str_mv 10.1002/cpe.5972
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2488767454</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2488767454</sourcerecordid><originalsourceid>FETCH-LOGICAL-c3462-78082a342a416d7f98df7159c6ae793b5ca252870bba337bf68c83b84a9986dc3</originalsourceid><addsrcrecordid>eNp1kEtOwzAQhi0EEqUgcQRLbNik-BXbWaKqPKRKsIC15TgOdZvEwU5UyoojcEZOQkIRO1Yzmvn-GekD4ByjGUaIXJnWztJMkAMwwSklCeKUHf71hB-DkxjXCGGMKJ6A5slvdSgi9G3navfumhfYrSy0b9b0nfMN9CWMrQ4bGI2zTedKZ-DWh01Z-W2EfRwTtTYr11hYWR2aYfD18ZnraAs4BHVtOxtg14-LU3BU6iras986Bc83i6f5XbJ8uL2fXy8TQxkniZBIEk0Z0QzzQpSZLEqB08xwbUVG89RokhIpUJ5rSkVecmkkzSXTWSZ5YegUXOzvtsG_9jZ2au370AwvFWFSCi5Yygbqck-Z4GMMtlRtcLUOO4WRGm2qwaYabQ5oske3rrK7fzk1f1z88N95FHin</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2488767454</pqid></control><display><type>article</type><title>Towards optimizing the execution of spark scientific workflows using machine learning‐based parameter tuning</title><source>Wiley</source><creator>Oliveira, Douglas ; Porto, Fábio ; Boeres, Cristina ; Oliveira, Daniel</creator><creatorcontrib>Oliveira, Douglas ; Porto, Fábio ; Boeres, Cristina ; Oliveira, Daniel</creatorcontrib><description>Summary In the last few years, Apache Spark has become a de facto the standard framework for big data systems on both industry and academy projects. Spark is used to execute compute‐ and data‐intensive workflows in distinct areas like biology and astronomy. Although Spark is an easy‐to‐install framework, it has more than one hundred parameters to be set, besides domain‐specific parameters of each workflow. In this way, to execute Spark‐based workflows efficiently, the user has to fine‐tune a myriad of Spark and workflow parameters (eg, partitioning strategy, the average size of a DNA sequence, etc.). This configuration task cannot be manually performed in a trial‐and‐error manner since it is tedious and error‐prone. This article proposes an approach that focuses on generating interpretable predictive machine learning models (ie, decision trees), and then extract useful rules (ie, patterns) from these models that can be applied to configure parameters of future executions of the workflow and Spark for nonexperts users. In the experiments presented in this article, the proposed parameter configuration approach led to better performance in processing Spark workflows. Finally, the approach introduced here reduced the number of parameters to be configured by identifying the most relevant domain‐specific ones related to the workflow performance in the predictive model.</description><identifier>ISSN: 1532-0626</identifier><identifier>EISSN: 1532-0634</identifier><identifier>DOI: 10.1002/cpe.5972</identifier><language>eng</language><publisher>Hoboken: Wiley Subscription Services, Inc</publisher><subject>Apache spark ; Astronomy ; Configurations ; Data systems ; Decision trees ; Domains ; Machine learning ; Parameter identification ; Performance prediction ; Prediction models ; scientific workflows ; Spark parameter tuning ; Workflow</subject><ispartof>Concurrency and computation, 2021-03, Vol.33 (5), p.n/a</ispartof><rights>2020 John Wiley &amp; Sons Ltd</rights><rights>2021 John Wiley &amp; Sons, Ltd.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c3462-78082a342a416d7f98df7159c6ae793b5ca252870bba337bf68c83b84a9986dc3</citedby><cites>FETCH-LOGICAL-c3462-78082a342a416d7f98df7159c6ae793b5ca252870bba337bf68c83b84a9986dc3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Oliveira, Douglas</creatorcontrib><creatorcontrib>Porto, Fábio</creatorcontrib><creatorcontrib>Boeres, Cristina</creatorcontrib><creatorcontrib>Oliveira, Daniel</creatorcontrib><title>Towards optimizing the execution of spark scientific workflows using machine learning‐based parameter tuning</title><title>Concurrency and computation</title><description>Summary In the last few years, Apache Spark has become a de facto the standard framework for big data systems on both industry and academy projects. Spark is used to execute compute‐ and data‐intensive workflows in distinct areas like biology and astronomy. Although Spark is an easy‐to‐install framework, it has more than one hundred parameters to be set, besides domain‐specific parameters of each workflow. In this way, to execute Spark‐based workflows efficiently, the user has to fine‐tune a myriad of Spark and workflow parameters (eg, partitioning strategy, the average size of a DNA sequence, etc.). This configuration task cannot be manually performed in a trial‐and‐error manner since it is tedious and error‐prone. This article proposes an approach that focuses on generating interpretable predictive machine learning models (ie, decision trees), and then extract useful rules (ie, patterns) from these models that can be applied to configure parameters of future executions of the workflow and Spark for nonexperts users. In the experiments presented in this article, the proposed parameter configuration approach led to better performance in processing Spark workflows. Finally, the approach introduced here reduced the number of parameters to be configured by identifying the most relevant domain‐specific ones related to the workflow performance in the predictive model.</description><subject>Apache spark</subject><subject>Astronomy</subject><subject>Configurations</subject><subject>Data systems</subject><subject>Decision trees</subject><subject>Domains</subject><subject>Machine learning</subject><subject>Parameter identification</subject><subject>Performance prediction</subject><subject>Prediction models</subject><subject>scientific workflows</subject><subject>Spark parameter tuning</subject><subject>Workflow</subject><issn>1532-0626</issn><issn>1532-0634</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><recordid>eNp1kEtOwzAQhi0EEqUgcQRLbNik-BXbWaKqPKRKsIC15TgOdZvEwU5UyoojcEZOQkIRO1Yzmvn-GekD4ByjGUaIXJnWztJMkAMwwSklCeKUHf71hB-DkxjXCGGMKJ6A5slvdSgi9G3navfumhfYrSy0b9b0nfMN9CWMrQ4bGI2zTedKZ-DWh01Z-W2EfRwTtTYr11hYWR2aYfD18ZnraAs4BHVtOxtg14-LU3BU6iras986Bc83i6f5XbJ8uL2fXy8TQxkniZBIEk0Z0QzzQpSZLEqB08xwbUVG89RokhIpUJ5rSkVecmkkzSXTWSZ5YegUXOzvtsG_9jZ2au370AwvFWFSCi5Yygbqck-Z4GMMtlRtcLUOO4WRGm2qwaYabQ5oske3rrK7fzk1f1z88N95FHin</recordid><startdate>20210310</startdate><enddate>20210310</enddate><creator>Oliveira, Douglas</creator><creator>Porto, Fábio</creator><creator>Boeres, Cristina</creator><creator>Oliveira, Daniel</creator><general>Wiley Subscription Services, Inc</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20210310</creationdate><title>Towards optimizing the execution of spark scientific workflows using machine learning‐based parameter tuning</title><author>Oliveira, Douglas ; Porto, Fábio ; Boeres, Cristina ; Oliveira, Daniel</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c3462-78082a342a416d7f98df7159c6ae793b5ca252870bba337bf68c83b84a9986dc3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Apache spark</topic><topic>Astronomy</topic><topic>Configurations</topic><topic>Data systems</topic><topic>Decision trees</topic><topic>Domains</topic><topic>Machine learning</topic><topic>Parameter identification</topic><topic>Performance prediction</topic><topic>Prediction models</topic><topic>scientific workflows</topic><topic>Spark parameter tuning</topic><topic>Workflow</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Oliveira, Douglas</creatorcontrib><creatorcontrib>Porto, Fábio</creatorcontrib><creatorcontrib>Boeres, Cristina</creatorcontrib><creatorcontrib>Oliveira, Daniel</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Concurrency and computation</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Oliveira, Douglas</au><au>Porto, Fábio</au><au>Boeres, Cristina</au><au>Oliveira, Daniel</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Towards optimizing the execution of spark scientific workflows using machine learning‐based parameter tuning</atitle><jtitle>Concurrency and computation</jtitle><date>2021-03-10</date><risdate>2021</risdate><volume>33</volume><issue>5</issue><epage>n/a</epage><issn>1532-0626</issn><eissn>1532-0634</eissn><abstract>Summary In the last few years, Apache Spark has become a de facto the standard framework for big data systems on both industry and academy projects. Spark is used to execute compute‐ and data‐intensive workflows in distinct areas like biology and astronomy. Although Spark is an easy‐to‐install framework, it has more than one hundred parameters to be set, besides domain‐specific parameters of each workflow. In this way, to execute Spark‐based workflows efficiently, the user has to fine‐tune a myriad of Spark and workflow parameters (eg, partitioning strategy, the average size of a DNA sequence, etc.). This configuration task cannot be manually performed in a trial‐and‐error manner since it is tedious and error‐prone. This article proposes an approach that focuses on generating interpretable predictive machine learning models (ie, decision trees), and then extract useful rules (ie, patterns) from these models that can be applied to configure parameters of future executions of the workflow and Spark for nonexperts users. In the experiments presented in this article, the proposed parameter configuration approach led to better performance in processing Spark workflows. Finally, the approach introduced here reduced the number of parameters to be configured by identifying the most relevant domain‐specific ones related to the workflow performance in the predictive model.</abstract><cop>Hoboken</cop><pub>Wiley Subscription Services, Inc</pub><doi>10.1002/cpe.5972</doi><tpages>35</tpages></addata></record>
fulltext fulltext
identifier ISSN: 1532-0626
ispartof Concurrency and computation, 2021-03, Vol.33 (5), p.n/a
issn 1532-0626
1532-0634
language eng
recordid cdi_proquest_journals_2488767454
source Wiley
subjects Apache spark
Astronomy
Configurations
Data systems
Decision trees
Domains
Machine learning
Parameter identification
Performance prediction
Prediction models
scientific workflows
Spark parameter tuning
Workflow
title Towards optimizing the execution of spark scientific workflows using machine learning‐based parameter tuning
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-27T18%3A41%3A18IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Towards%20optimizing%20the%20execution%20of%20spark%20scientific%20workflows%20using%20machine%20learning%E2%80%90based%20parameter%20tuning&rft.jtitle=Concurrency%20and%20computation&rft.au=Oliveira,%20Douglas&rft.date=2021-03-10&rft.volume=33&rft.issue=5&rft.epage=n/a&rft.issn=1532-0626&rft.eissn=1532-0634&rft_id=info:doi/10.1002/cpe.5972&rft_dat=%3Cproquest_cross%3E2488767454%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c3462-78082a342a416d7f98df7159c6ae793b5ca252870bba337bf68c83b84a9986dc3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2488767454&rft_id=info:pmid/&rfr_iscdi=true