Loading…

Enabling scalable and adaptive machine learning training via serverless computing on public cloud

In today’s production machine learning (ML) systems, models are continuously trained, improved, and deployed. ML design and training are becoming a continuous workflow of various tasks that have dynamic resource demands. Serverless computing is an emerging cloud paradigm that provides transparent re...

Full description

Saved in:

Bibliographic Details
Published in:	Performance evaluation 2025-03, Vol.167, p.102451, Article 102451
Main Authors:	Ali, Ahsan, Ma, Xiaolong, Zawad, Syed, Aditya, Paarijaat, Akkus, Istemi Ekin, Chen, Ruichuan, Yang, Lei, Yan, Feng
Format:	Article
Language:	English
Subjects:	Machine learning Resource management Serverless computing
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites	cdi_FETCH-LOGICAL-c961-9c3b197d262f039890e3391f9dcf73e8c349ef19eea92469d56572cd397ec9133
container_end_page
container_issue
container_start_page	102451
container_title	Performance evaluation
container_volume	167
creator	Ali, Ahsan Ma, Xiaolong Zawad, Syed Aditya, Paarijaat Akkus, Istemi Ekin Chen, Ruichuan Yang, Lei Yan, Feng
description	In today’s production machine learning (ML) systems, models are continuously trained, improved, and deployed. ML design and training are becoming a continuous workflow of various tasks that have dynamic resource demands. Serverless computing is an emerging cloud paradigm that provides transparent resource management and scaling for users and has the potential to revolutionize the routine of ML design and training. However, hosting modern ML workflows on existing serverless platforms has non-trivial challenges due to their intrinsic design limitations such as stateless nature, limited communication support across function instances, and limited function execution duration. These limitations result in a lack of an overarching view and adaptation mechanism for training dynamics, and an amplification of existing problems in ML workflows. To address the above challenges, we propose SMLT, an automated, scalable and adaptive serverless framework on public cloud to enable efficient and user-centric ML design and training. SMLT employs an automated and adaptive scheduling mechanism to dynamically optimize the deployment and resource scaling for ML tasks during training. SMLT further enables user-centric ML workflow execution by supporting user-specified training deadline and budget limit. In addition, by providing an end-to-end design, SMLT solves the intrinsic problems in public cloud serverless platforms such as the communication overhead, limited function execution duration, need for repeated initialization, and also provides explicit fault tolerance for ML training. SMLT is open-sourced and compatible with all major ML frameworks. Our experimental evaluation with large, sophisticated modern ML models demonstrates that SMLT outperforms the state-of-the-art VM-based systems and existing public cloud serverless ML training frameworks in both training speed (up to 8×) and monetary cost (up to 3×).
doi_str_mv	10.1016/j.peva.2024.102451
format	article
fullrecord	<record><control><sourceid>elsevier_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1016_j_peva_2024_102451</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0166531624000567</els_id><sourcerecordid>S0166531624000567</sourcerecordid><originalsourceid>FETCH-LOGICAL-c961-9c3b197d262f039890e3391f9dcf73e8c349ef19eea92469d56572cd397ec9133</originalsourceid><addsrcrecordid>eNp9kM1OwzAQhH0AiVJ4AU5-gRT_JE4tcUFV-ZEqcend2q434Cp1IjuNxNuTUM6cdrSzM1p9jD1IsZJCmsfjqqcRVkqoclqospJXbDEZpqi0NDfsNuejEKKqtVgw2EY4tCF-8ozQTpI4RM_BQz-EkfgJ8CtE4i1BivPZkCD8ijEAz5RGSi3lzLE79edhNrrI-_PUiRzb7uzv2HUDbab7v7lk-5ftfvNW7D5e3zfPuwKtkYVFfZC29sqoRmi7toK0trKxHpta0xp1aamRlgisKo31lalqhV7bmtBKrZdMXWoxdTknalyfwgnSt5PCzVzc0c1c3MzFXbhMoadLiKbHxkDJZQwUkXxIhIPzXfgv_gNoaG81</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Enabling scalable and adaptive machine learning training via serverless computing on public cloud</title><source>Elsevier:Jisc Collections:Elsevier Read and Publish Agreement 2022-2024:Freedom Collection (Reading list)</source><creator>Ali, Ahsan ; Ma, Xiaolong ; Zawad, Syed ; Aditya, Paarijaat ; Akkus, Istemi Ekin ; Chen, Ruichuan ; Yang, Lei ; Yan, Feng</creator><creatorcontrib>Ali, Ahsan ; Ma, Xiaolong ; Zawad, Syed ; Aditya, Paarijaat ; Akkus, Istemi Ekin ; Chen, Ruichuan ; Yang, Lei ; Yan, Feng</creatorcontrib><description>In today’s production machine learning (ML) systems, models are continuously trained, improved, and deployed. ML design and training are becoming a continuous workflow of various tasks that have dynamic resource demands. Serverless computing is an emerging cloud paradigm that provides transparent resource management and scaling for users and has the potential to revolutionize the routine of ML design and training. However, hosting modern ML workflows on existing serverless platforms has non-trivial challenges due to their intrinsic design limitations such as stateless nature, limited communication support across function instances, and limited function execution duration. These limitations result in a lack of an overarching view and adaptation mechanism for training dynamics, and an amplification of existing problems in ML workflows. To address the above challenges, we propose SMLT, an automated, scalable and adaptive serverless framework on public cloud to enable efficient and user-centric ML design and training. SMLT employs an automated and adaptive scheduling mechanism to dynamically optimize the deployment and resource scaling for ML tasks during training. SMLT further enables user-centric ML workflow execution by supporting user-specified training deadline and budget limit. In addition, by providing an end-to-end design, SMLT solves the intrinsic problems in public cloud serverless platforms such as the communication overhead, limited function execution duration, need for repeated initialization, and also provides explicit fault tolerance for ML training. SMLT is open-sourced and compatible with all major ML frameworks. Our experimental evaluation with large, sophisticated modern ML models demonstrates that SMLT outperforms the state-of-the-art VM-based systems and existing public cloud serverless ML training frameworks in both training speed (up to 8×) and monetary cost (up to 3×).</description><identifier>ISSN: 0166-5316</identifier><identifier>DOI: 10.1016/j.peva.2024.102451</identifier><language>eng</language><publisher>Elsevier B.V</publisher><subject>Machine learning ; Resource management ; Serverless computing</subject><ispartof>Performance evaluation, 2025-03, Vol.167, p.102451, Article 102451</ispartof><rights>2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c961-9c3b197d262f039890e3391f9dcf73e8c349ef19eea92469d56572cd397ec9133</cites><orcidid>0009-0009-1374-7086</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Ali, Ahsan</creatorcontrib><creatorcontrib>Ma, Xiaolong</creatorcontrib><creatorcontrib>Zawad, Syed</creatorcontrib><creatorcontrib>Aditya, Paarijaat</creatorcontrib><creatorcontrib>Akkus, Istemi Ekin</creatorcontrib><creatorcontrib>Chen, Ruichuan</creatorcontrib><creatorcontrib>Yang, Lei</creatorcontrib><creatorcontrib>Yan, Feng</creatorcontrib><title>Enabling scalable and adaptive machine learning training via serverless computing on public cloud</title><title>Performance evaluation</title><description>In today’s production machine learning (ML) systems, models are continuously trained, improved, and deployed. ML design and training are becoming a continuous workflow of various tasks that have dynamic resource demands. Serverless computing is an emerging cloud paradigm that provides transparent resource management and scaling for users and has the potential to revolutionize the routine of ML design and training. However, hosting modern ML workflows on existing serverless platforms has non-trivial challenges due to their intrinsic design limitations such as stateless nature, limited communication support across function instances, and limited function execution duration. These limitations result in a lack of an overarching view and adaptation mechanism for training dynamics, and an amplification of existing problems in ML workflows. To address the above challenges, we propose SMLT, an automated, scalable and adaptive serverless framework on public cloud to enable efficient and user-centric ML design and training. SMLT employs an automated and adaptive scheduling mechanism to dynamically optimize the deployment and resource scaling for ML tasks during training. SMLT further enables user-centric ML workflow execution by supporting user-specified training deadline and budget limit. In addition, by providing an end-to-end design, SMLT solves the intrinsic problems in public cloud serverless platforms such as the communication overhead, limited function execution duration, need for repeated initialization, and also provides explicit fault tolerance for ML training. SMLT is open-sourced and compatible with all major ML frameworks. Our experimental evaluation with large, sophisticated modern ML models demonstrates that SMLT outperforms the state-of-the-art VM-based systems and existing public cloud serverless ML training frameworks in both training speed (up to 8×) and monetary cost (up to 3×).</description><subject>Machine learning</subject><subject>Resource management</subject><subject>Serverless computing</subject><issn>0166-5316</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2025</creationdate><recordtype>article</recordtype><recordid>eNp9kM1OwzAQhH0AiVJ4AU5-gRT_JE4tcUFV-ZEqcend2q434Cp1IjuNxNuTUM6cdrSzM1p9jD1IsZJCmsfjqqcRVkqoclqospJXbDEZpqi0NDfsNuejEKKqtVgw2EY4tCF-8ozQTpI4RM_BQz-EkfgJ8CtE4i1BivPZkCD8ijEAz5RGSi3lzLE79edhNrrI-_PUiRzb7uzv2HUDbab7v7lk-5ftfvNW7D5e3zfPuwKtkYVFfZC29sqoRmi7toK0trKxHpta0xp1aamRlgisKo31lalqhV7bmtBKrZdMXWoxdTknalyfwgnSt5PCzVzc0c1c3MzFXbhMoadLiKbHxkDJZQwUkXxIhIPzXfgv_gNoaG81</recordid><startdate>202503</startdate><enddate>202503</enddate><creator>Ali, Ahsan</creator><creator>Ma, Xiaolong</creator><creator>Zawad, Syed</creator><creator>Aditya, Paarijaat</creator><creator>Akkus, Istemi Ekin</creator><creator>Chen, Ruichuan</creator><creator>Yang, Lei</creator><creator>Yan, Feng</creator><general>Elsevier B.V</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0009-0009-1374-7086</orcidid></search><sort><creationdate>202503</creationdate><title>Enabling scalable and adaptive machine learning training via serverless computing on public cloud</title><author>Ali, Ahsan ; Ma, Xiaolong ; Zawad, Syed ; Aditya, Paarijaat ; Akkus, Istemi Ekin ; Chen, Ruichuan ; Yang, Lei ; Yan, Feng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c961-9c3b197d262f039890e3391f9dcf73e8c349ef19eea92469d56572cd397ec9133</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2025</creationdate><topic>Machine learning</topic><topic>Resource management</topic><topic>Serverless computing</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ali, Ahsan</creatorcontrib><creatorcontrib>Ma, Xiaolong</creatorcontrib><creatorcontrib>Zawad, Syed</creatorcontrib><creatorcontrib>Aditya, Paarijaat</creatorcontrib><creatorcontrib>Akkus, Istemi Ekin</creatorcontrib><creatorcontrib>Chen, Ruichuan</creatorcontrib><creatorcontrib>Yang, Lei</creatorcontrib><creatorcontrib>Yan, Feng</creatorcontrib><collection>CrossRef</collection><jtitle>Performance evaluation</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ali, Ahsan</au><au>Ma, Xiaolong</au><au>Zawad, Syed</au><au>Aditya, Paarijaat</au><au>Akkus, Istemi Ekin</au><au>Chen, Ruichuan</au><au>Yang, Lei</au><au>Yan, Feng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Enabling scalable and adaptive machine learning training via serverless computing on public cloud</atitle><jtitle>Performance evaluation</jtitle><date>2025-03</date><risdate>2025</risdate><volume>167</volume><spage>102451</spage><pages>102451-</pages><artnum>102451</artnum><issn>0166-5316</issn><abstract>In today’s production machine learning (ML) systems, models are continuously trained, improved, and deployed. ML design and training are becoming a continuous workflow of various tasks that have dynamic resource demands. Serverless computing is an emerging cloud paradigm that provides transparent resource management and scaling for users and has the potential to revolutionize the routine of ML design and training. However, hosting modern ML workflows on existing serverless platforms has non-trivial challenges due to their intrinsic design limitations such as stateless nature, limited communication support across function instances, and limited function execution duration. These limitations result in a lack of an overarching view and adaptation mechanism for training dynamics, and an amplification of existing problems in ML workflows. To address the above challenges, we propose SMLT, an automated, scalable and adaptive serverless framework on public cloud to enable efficient and user-centric ML design and training. SMLT employs an automated and adaptive scheduling mechanism to dynamically optimize the deployment and resource scaling for ML tasks during training. SMLT further enables user-centric ML workflow execution by supporting user-specified training deadline and budget limit. In addition, by providing an end-to-end design, SMLT solves the intrinsic problems in public cloud serverless platforms such as the communication overhead, limited function execution duration, need for repeated initialization, and also provides explicit fault tolerance for ML training. SMLT is open-sourced and compatible with all major ML frameworks. Our experimental evaluation with large, sophisticated modern ML models demonstrates that SMLT outperforms the state-of-the-art VM-based systems and existing public cloud serverless ML training frameworks in both training speed (up to 8×) and monetary cost (up to 3×).</abstract><pub>Elsevier B.V</pub><doi>10.1016/j.peva.2024.102451</doi><orcidid>https://orcid.org/0009-0009-1374-7086</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 0166-5316
ispartof	Performance evaluation, 2025-03, Vol.167, p.102451, Article 102451
issn	0166-5316
language	eng
recordid	cdi_crossref_primary_10_1016_j_peva_2024_102451
source	Elsevier:Jisc Collections:Elsevier Read and Publish Agreement 2022-2024:Freedom Collection (Reading list)
subjects	Machine learning Resource management Serverless computing
title	Enabling scalable and adaptive machine learning training via serverless computing on public cloud
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T01%3A56%3A02IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-elsevier_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Enabling%20scalable%20and%20adaptive%20machine%20learning%20training%20via%20serverless%20computing%20on%20public%20cloud&rft.jtitle=Performance%20evaluation&rft.au=Ali,%20Ahsan&rft.date=2025-03&rft.volume=167&rft.spage=102451&rft.pages=102451-&rft.artnum=102451&rft.issn=0166-5316&rft_id=info:doi/10.1016/j.peva.2024.102451&rft_dat=%3Celsevier_cross%3ES0166531624000567%3C/elsevier_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c961-9c3b197d262f039890e3391f9dcf73e8c349ef19eea92469d56572cd397ec9133%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true