Loading…
Characterization and prediction of deep learning workloads in large-scale GPU datacenters
Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep underst...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | 15 |
container_issue | |
container_start_page | 1 |
container_title | |
container_volume | |
creator | Hu, Qinghao Sun, Peng Yan, Shengen Wen, Yonggang Zhang, Tianwei |
description | Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design (1) a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5×; (2) a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%. |
doi_str_mv | 10.1145/3458817.3476223 |
format | conference_proceeding |
fullrecord | <record><control><sourceid>acm_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_9910054</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9910054</ieee_id><sourcerecordid>acm_books_10_1145_3458817_3476223</sourcerecordid><originalsourceid>FETCH-LOGICAL-a177t-9aa7443a09811d48fcf90be45357657d27fbc6adfd02f2ee1105389a9d7c6b913</originalsourceid><addsrcrecordid>eNqNkD1PwzAURQ0IiQo6M7B4ZEnxV2J7RBUUpEow0IHJerGfi2maVE4kBL-eQDsxMV1dnas7HEIuOZtxrsobqUpjuJ5JpSsh5BGZWm1GwKRRSvBjMhG80oWSUp_8YWdk2vfvjDFhNJeCTcjr_A0y-AFz-oIhdS2FNtBdxpD8b-0iDYg72iDkNrVr-tHlTdNB6GlqaQN5jUXvoUG6eF7RAAN4bMe7_oKcRmh6nB7ynKzu717mD8XyafE4v10WwLUeCguglZLArOE8KBN9tKxGVcpSV6UOQsfaVxBiYCIKRM5ZKY0FG7SvasvlObna_yZEdLuctpA_nbWcsVKN9HpPwW9d3XWb3nHmfjy6g0d38DhOZ_-cujonjPIbOoZuTw</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Characterization and prediction of deep learning workloads in large-scale GPU datacenters</title><source>IEEE Xplore All Conference Series</source><creator>Hu, Qinghao ; Sun, Peng ; Yan, Shengen ; Wen, Yonggang ; Zhang, Tianwei</creator><creatorcontrib>Hu, Qinghao ; Sun, Peng ; Yan, Shengen ; Wen, Yonggang ; Zhang, Tianwei</creatorcontrib><description>Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design (1) a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5×; (2) a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.</description><identifier>ISBN: 9781450384421</identifier><identifier>ISBN: 1450384420</identifier><identifier>EISSN: 2167-4337</identifier><identifier>EISBN: 9781450384421</identifier><identifier>EISBN: 1450384420</identifier><identifier>DOI: 10.1145/3458817.3476223</identifier><language>eng</language><publisher>New York, NY, USA: ACM</publisher><subject>Behavioral sciences ; Cluster Management System ; Cluster Statistical Analysis ; Computing methodologies -- Distributed computing methodologies ; Computing methodologies -- Machine learning -- Machine learning approaches ; Computing methodologies -- Modeling and simulation -- Simulation evaluation ; Deep learning ; Deep Learning Training ; Energy Conservation ; GPU Datacenter ; Graphics processing units ; High performance computing ; Industries ; Job shop scheduling ; Power demand ; Time-series Prediction ; Workload Scheduling</subject><ispartof>Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, p.1-15</ispartof><rights>2021 ACM</rights><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9910054$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9910054$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Hu, Qinghao</creatorcontrib><creatorcontrib>Sun, Peng</creatorcontrib><creatorcontrib>Yan, Shengen</creatorcontrib><creatorcontrib>Wen, Yonggang</creatorcontrib><creatorcontrib>Zhang, Tianwei</creatorcontrib><title>Characterization and prediction of deep learning workloads in large-scale GPU datacenters</title><title>Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis</title><addtitle>SC</addtitle><description>Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design (1) a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5×; (2) a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.</description><subject>Behavioral sciences</subject><subject>Cluster Management System</subject><subject>Cluster Statistical Analysis</subject><subject>Computing methodologies -- Distributed computing methodologies</subject><subject>Computing methodologies -- Machine learning -- Machine learning approaches</subject><subject>Computing methodologies -- Modeling and simulation -- Simulation evaluation</subject><subject>Deep learning</subject><subject>Deep Learning Training</subject><subject>Energy Conservation</subject><subject>GPU Datacenter</subject><subject>Graphics processing units</subject><subject>High performance computing</subject><subject>Industries</subject><subject>Job shop scheduling</subject><subject>Power demand</subject><subject>Time-series Prediction</subject><subject>Workload Scheduling</subject><issn>2167-4337</issn><isbn>9781450384421</isbn><isbn>1450384420</isbn><isbn>9781450384421</isbn><isbn>1450384420</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2021</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNqNkD1PwzAURQ0IiQo6M7B4ZEnxV2J7RBUUpEow0IHJerGfi2maVE4kBL-eQDsxMV1dnas7HEIuOZtxrsobqUpjuJ5JpSsh5BGZWm1GwKRRSvBjMhG80oWSUp_8YWdk2vfvjDFhNJeCTcjr_A0y-AFz-oIhdS2FNtBdxpD8b-0iDYg72iDkNrVr-tHlTdNB6GlqaQN5jUXvoUG6eF7RAAN4bMe7_oKcRmh6nB7ynKzu717mD8XyafE4v10WwLUeCguglZLArOE8KBN9tKxGVcpSV6UOQsfaVxBiYCIKRM5ZKY0FG7SvasvlObna_yZEdLuctpA_nbWcsVKN9HpPwW9d3XWb3nHmfjy6g0d38DhOZ_-cujonjPIbOoZuTw</recordid><startdate>20211114</startdate><enddate>20211114</enddate><creator>Hu, Qinghao</creator><creator>Sun, Peng</creator><creator>Yan, Shengen</creator><creator>Wen, Yonggang</creator><creator>Zhang, Tianwei</creator><general>ACM</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>20211114</creationdate><title>Characterization and prediction of deep learning workloads in large-scale GPU datacenters</title><author>Hu, Qinghao ; Sun, Peng ; Yan, Shengen ; Wen, Yonggang ; Zhang, Tianwei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a177t-9aa7443a09811d48fcf90be45357657d27fbc6adfd02f2ee1105389a9d7c6b913</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Behavioral sciences</topic><topic>Cluster Management System</topic><topic>Cluster Statistical Analysis</topic><topic>Computing methodologies -- Distributed computing methodologies</topic><topic>Computing methodologies -- Machine learning -- Machine learning approaches</topic><topic>Computing methodologies -- Modeling and simulation -- Simulation evaluation</topic><topic>Deep learning</topic><topic>Deep Learning Training</topic><topic>Energy Conservation</topic><topic>GPU Datacenter</topic><topic>Graphics processing units</topic><topic>High performance computing</topic><topic>Industries</topic><topic>Job shop scheduling</topic><topic>Power demand</topic><topic>Time-series Prediction</topic><topic>Workload Scheduling</topic><toplevel>online_resources</toplevel><creatorcontrib>Hu, Qinghao</creatorcontrib><creatorcontrib>Sun, Peng</creatorcontrib><creatorcontrib>Yan, Shengen</creatorcontrib><creatorcontrib>Wen, Yonggang</creatorcontrib><creatorcontrib>Zhang, Tianwei</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library Online</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Hu, Qinghao</au><au>Sun, Peng</au><au>Yan, Shengen</au><au>Wen, Yonggang</au><au>Zhang, Tianwei</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Characterization and prediction of deep learning workloads in large-scale GPU datacenters</atitle><btitle>Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis</btitle><stitle>SC</stitle><date>2021-11-14</date><risdate>2021</risdate><spage>1</spage><epage>15</epage><pages>1-15</pages><eissn>2167-4337</eissn><isbn>9781450384421</isbn><isbn>1450384420</isbn><eisbn>9781450384421</eisbn><eisbn>1450384420</eisbn><abstract>Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design (1) a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5×; (2) a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.</abstract><cop>New York, NY, USA</cop><pub>ACM</pub><doi>10.1145/3458817.3476223</doi><tpages>15</tpages></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISBN: 9781450384421 |
ispartof | Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, p.1-15 |
issn | 2167-4337 |
language | eng |
recordid | cdi_ieee_primary_9910054 |
source | IEEE Xplore All Conference Series |
subjects | Behavioral sciences Cluster Management System Cluster Statistical Analysis Computing methodologies -- Distributed computing methodologies Computing methodologies -- Machine learning -- Machine learning approaches Computing methodologies -- Modeling and simulation -- Simulation evaluation Deep learning Deep Learning Training Energy Conservation GPU Datacenter Graphics processing units High performance computing Industries Job shop scheduling Power demand Time-series Prediction Workload Scheduling |
title | Characterization and prediction of deep learning workloads in large-scale GPU datacenters |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T06%3A25%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Characterization%20and%20prediction%20of%20deep%20learning%20workloads%20in%20large-scale%20GPU%20datacenters&rft.btitle=Proceedings%20of%20the%20International%20Conference%20for%20High%20Performance%20Computing,%20Networking,%20Storage%20and%20Analysis&rft.au=Hu,%20Qinghao&rft.date=2021-11-14&rft.spage=1&rft.epage=15&rft.pages=1-15&rft.eissn=2167-4337&rft.isbn=9781450384421&rft.isbn_list=1450384420&rft_id=info:doi/10.1145/3458817.3476223&rft.eisbn=9781450384421&rft.eisbn_list=1450384420&rft_dat=%3Cacm_CHZPO%3Eacm_books_10_1145_3458817_3476223%3C/acm_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-a177t-9aa7443a09811d48fcf90be45357657d27fbc6adfd02f2ee1105389a9d7c6b913%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=9910054&rfr_iscdi=true |