Loading…
Towards optimizing the execution of spark scientific workflows using machine learning‐based parameter tuning
Summary In the last few years, Apache Spark has become a de facto the standard framework for big data systems on both industry and academy projects. Spark is used to execute compute‐ and data‐intensive workflows in distinct areas like biology and astronomy. Although Spark is an easy‐to‐install frame...
Saved in:
Published in: | Concurrency and computation 2021-03, Vol.33 (5), p.n/a |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c3462-78082a342a416d7f98df7159c6ae793b5ca252870bba337bf68c83b84a9986dc3 |
---|---|
cites | cdi_FETCH-LOGICAL-c3462-78082a342a416d7f98df7159c6ae793b5ca252870bba337bf68c83b84a9986dc3 |
container_end_page | n/a |
container_issue | 5 |
container_start_page | |
container_title | Concurrency and computation |
container_volume | 33 |
creator | Oliveira, Douglas Porto, Fábio Boeres, Cristina Oliveira, Daniel |
description | Summary
In the last few years, Apache Spark has become a de facto the standard framework for big data systems on both industry and academy projects. Spark is used to execute compute‐ and data‐intensive workflows in distinct areas like biology and astronomy. Although Spark is an easy‐to‐install framework, it has more than one hundred parameters to be set, besides domain‐specific parameters of each workflow. In this way, to execute Spark‐based workflows efficiently, the user has to fine‐tune a myriad of Spark and workflow parameters (eg, partitioning strategy, the average size of a DNA sequence, etc.). This configuration task cannot be manually performed in a trial‐and‐error manner since it is tedious and error‐prone. This article proposes an approach that focuses on generating interpretable predictive machine learning models (ie, decision trees), and then extract useful rules (ie, patterns) from these models that can be applied to configure parameters of future executions of the workflow and Spark for nonexperts users. In the experiments presented in this article, the proposed parameter configuration approach led to better performance in processing Spark workflows. Finally, the approach introduced here reduced the number of parameters to be configured by identifying the most relevant domain‐specific ones related to the workflow performance in the predictive model. |
doi_str_mv | 10.1002/cpe.5972 |
format | article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2488767454</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2488767454</sourcerecordid><originalsourceid>FETCH-LOGICAL-c3462-78082a342a416d7f98df7159c6ae793b5ca252870bba337bf68c83b84a9986dc3</originalsourceid><addsrcrecordid>eNp1kEtOwzAQhi0EEqUgcQRLbNik-BXbWaKqPKRKsIC15TgOdZvEwU5UyoojcEZOQkIRO1Yzmvn-GekD4ByjGUaIXJnWztJMkAMwwSklCeKUHf71hB-DkxjXCGGMKJ6A5slvdSgi9G3navfumhfYrSy0b9b0nfMN9CWMrQ4bGI2zTedKZ-DWh01Z-W2EfRwTtTYr11hYWR2aYfD18ZnraAs4BHVtOxtg14-LU3BU6iras986Bc83i6f5XbJ8uL2fXy8TQxkniZBIEk0Z0QzzQpSZLEqB08xwbUVG89RokhIpUJ5rSkVecmkkzSXTWSZ5YegUXOzvtsG_9jZ2au370AwvFWFSCi5Yygbqck-Z4GMMtlRtcLUOO4WRGm2qwaYabQ5oske3rrK7fzk1f1z88N95FHin</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2488767454</pqid></control><display><type>article</type><title>Towards optimizing the execution of spark scientific workflows using machine learning‐based parameter tuning</title><source>Wiley</source><creator>Oliveira, Douglas ; Porto, Fábio ; Boeres, Cristina ; Oliveira, Daniel</creator><creatorcontrib>Oliveira, Douglas ; Porto, Fábio ; Boeres, Cristina ; Oliveira, Daniel</creatorcontrib><description>Summary
In the last few years, Apache Spark has become a de facto the standard framework for big data systems on both industry and academy projects. Spark is used to execute compute‐ and data‐intensive workflows in distinct areas like biology and astronomy. Although Spark is an easy‐to‐install framework, it has more than one hundred parameters to be set, besides domain‐specific parameters of each workflow. In this way, to execute Spark‐based workflows efficiently, the user has to fine‐tune a myriad of Spark and workflow parameters (eg, partitioning strategy, the average size of a DNA sequence, etc.). This configuration task cannot be manually performed in a trial‐and‐error manner since it is tedious and error‐prone. This article proposes an approach that focuses on generating interpretable predictive machine learning models (ie, decision trees), and then extract useful rules (ie, patterns) from these models that can be applied to configure parameters of future executions of the workflow and Spark for nonexperts users. In the experiments presented in this article, the proposed parameter configuration approach led to better performance in processing Spark workflows. Finally, the approach introduced here reduced the number of parameters to be configured by identifying the most relevant domain‐specific ones related to the workflow performance in the predictive model.</description><identifier>ISSN: 1532-0626</identifier><identifier>EISSN: 1532-0634</identifier><identifier>DOI: 10.1002/cpe.5972</identifier><language>eng</language><publisher>Hoboken: Wiley Subscription Services, Inc</publisher><subject>Apache spark ; Astronomy ; Configurations ; Data systems ; Decision trees ; Domains ; Machine learning ; Parameter identification ; Performance prediction ; Prediction models ; scientific workflows ; Spark parameter tuning ; Workflow</subject><ispartof>Concurrency and computation, 2021-03, Vol.33 (5), p.n/a</ispartof><rights>2020 John Wiley & Sons Ltd</rights><rights>2021 John Wiley & Sons, Ltd.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c3462-78082a342a416d7f98df7159c6ae793b5ca252870bba337bf68c83b84a9986dc3</citedby><cites>FETCH-LOGICAL-c3462-78082a342a416d7f98df7159c6ae793b5ca252870bba337bf68c83b84a9986dc3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Oliveira, Douglas</creatorcontrib><creatorcontrib>Porto, Fábio</creatorcontrib><creatorcontrib>Boeres, Cristina</creatorcontrib><creatorcontrib>Oliveira, Daniel</creatorcontrib><title>Towards optimizing the execution of spark scientific workflows using machine learning‐based parameter tuning</title><title>Concurrency and computation</title><description>Summary
In the last few years, Apache Spark has become a de facto the standard framework for big data systems on both industry and academy projects. Spark is used to execute compute‐ and data‐intensive workflows in distinct areas like biology and astronomy. Although Spark is an easy‐to‐install framework, it has more than one hundred parameters to be set, besides domain‐specific parameters of each workflow. In this way, to execute Spark‐based workflows efficiently, the user has to fine‐tune a myriad of Spark and workflow parameters (eg, partitioning strategy, the average size of a DNA sequence, etc.). This configuration task cannot be manually performed in a trial‐and‐error manner since it is tedious and error‐prone. This article proposes an approach that focuses on generating interpretable predictive machine learning models (ie, decision trees), and then extract useful rules (ie, patterns) from these models that can be applied to configure parameters of future executions of the workflow and Spark for nonexperts users. In the experiments presented in this article, the proposed parameter configuration approach led to better performance in processing Spark workflows. Finally, the approach introduced here reduced the number of parameters to be configured by identifying the most relevant domain‐specific ones related to the workflow performance in the predictive model.</description><subject>Apache spark</subject><subject>Astronomy</subject><subject>Configurations</subject><subject>Data systems</subject><subject>Decision trees</subject><subject>Domains</subject><subject>Machine learning</subject><subject>Parameter identification</subject><subject>Performance prediction</subject><subject>Prediction models</subject><subject>scientific workflows</subject><subject>Spark parameter tuning</subject><subject>Workflow</subject><issn>1532-0626</issn><issn>1532-0634</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><recordid>eNp1kEtOwzAQhi0EEqUgcQRLbNik-BXbWaKqPKRKsIC15TgOdZvEwU5UyoojcEZOQkIRO1Yzmvn-GekD4ByjGUaIXJnWztJMkAMwwSklCeKUHf71hB-DkxjXCGGMKJ6A5slvdSgi9G3navfumhfYrSy0b9b0nfMN9CWMrQ4bGI2zTedKZ-DWh01Z-W2EfRwTtTYr11hYWR2aYfD18ZnraAs4BHVtOxtg14-LU3BU6iras986Bc83i6f5XbJ8uL2fXy8TQxkniZBIEk0Z0QzzQpSZLEqB08xwbUVG89RokhIpUJ5rSkVecmkkzSXTWSZ5YegUXOzvtsG_9jZ2au370AwvFWFSCi5Yygbqck-Z4GMMtlRtcLUOO4WRGm2qwaYabQ5oske3rrK7fzk1f1z88N95FHin</recordid><startdate>20210310</startdate><enddate>20210310</enddate><creator>Oliveira, Douglas</creator><creator>Porto, Fábio</creator><creator>Boeres, Cristina</creator><creator>Oliveira, Daniel</creator><general>Wiley Subscription Services, Inc</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20210310</creationdate><title>Towards optimizing the execution of spark scientific workflows using machine learning‐based parameter tuning</title><author>Oliveira, Douglas ; Porto, Fábio ; Boeres, Cristina ; Oliveira, Daniel</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c3462-78082a342a416d7f98df7159c6ae793b5ca252870bba337bf68c83b84a9986dc3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Apache spark</topic><topic>Astronomy</topic><topic>Configurations</topic><topic>Data systems</topic><topic>Decision trees</topic><topic>Domains</topic><topic>Machine learning</topic><topic>Parameter identification</topic><topic>Performance prediction</topic><topic>Prediction models</topic><topic>scientific workflows</topic><topic>Spark parameter tuning</topic><topic>Workflow</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Oliveira, Douglas</creatorcontrib><creatorcontrib>Porto, Fábio</creatorcontrib><creatorcontrib>Boeres, Cristina</creatorcontrib><creatorcontrib>Oliveira, Daniel</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Concurrency and computation</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Oliveira, Douglas</au><au>Porto, Fábio</au><au>Boeres, Cristina</au><au>Oliveira, Daniel</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Towards optimizing the execution of spark scientific workflows using machine learning‐based parameter tuning</atitle><jtitle>Concurrency and computation</jtitle><date>2021-03-10</date><risdate>2021</risdate><volume>33</volume><issue>5</issue><epage>n/a</epage><issn>1532-0626</issn><eissn>1532-0634</eissn><abstract>Summary
In the last few years, Apache Spark has become a de facto the standard framework for big data systems on both industry and academy projects. Spark is used to execute compute‐ and data‐intensive workflows in distinct areas like biology and astronomy. Although Spark is an easy‐to‐install framework, it has more than one hundred parameters to be set, besides domain‐specific parameters of each workflow. In this way, to execute Spark‐based workflows efficiently, the user has to fine‐tune a myriad of Spark and workflow parameters (eg, partitioning strategy, the average size of a DNA sequence, etc.). This configuration task cannot be manually performed in a trial‐and‐error manner since it is tedious and error‐prone. This article proposes an approach that focuses on generating interpretable predictive machine learning models (ie, decision trees), and then extract useful rules (ie, patterns) from these models that can be applied to configure parameters of future executions of the workflow and Spark for nonexperts users. In the experiments presented in this article, the proposed parameter configuration approach led to better performance in processing Spark workflows. Finally, the approach introduced here reduced the number of parameters to be configured by identifying the most relevant domain‐specific ones related to the workflow performance in the predictive model.</abstract><cop>Hoboken</cop><pub>Wiley Subscription Services, Inc</pub><doi>10.1002/cpe.5972</doi><tpages>35</tpages></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1532-0626 |
ispartof | Concurrency and computation, 2021-03, Vol.33 (5), p.n/a |
issn | 1532-0626 1532-0634 |
language | eng |
recordid | cdi_proquest_journals_2488767454 |
source | Wiley |
subjects | Apache spark Astronomy Configurations Data systems Decision trees Domains Machine learning Parameter identification Performance prediction Prediction models scientific workflows Spark parameter tuning Workflow |
title | Towards optimizing the execution of spark scientific workflows using machine learning‐based parameter tuning |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-27T18%3A41%3A18IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Towards%20optimizing%20the%20execution%20of%20spark%20scientific%20workflows%20using%20machine%20learning%E2%80%90based%20parameter%20tuning&rft.jtitle=Concurrency%20and%20computation&rft.au=Oliveira,%20Douglas&rft.date=2021-03-10&rft.volume=33&rft.issue=5&rft.epage=n/a&rft.issn=1532-0626&rft.eissn=1532-0634&rft_id=info:doi/10.1002/cpe.5972&rft_dat=%3Cproquest_cross%3E2488767454%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c3462-78082a342a416d7f98df7159c6ae793b5ca252870bba337bf68c83b84a9986dc3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2488767454&rft_id=info:pmid/&rfr_iscdi=true |