Loading…

Characterization and prediction of deep learning workloads in large-scale GPU datacenters

Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep underst...

Full description

Saved in:
Bibliographic Details
Main Authors: Hu, Qinghao, Sun, Peng, Yan, Shengen, Wen, Yonggang, Zhang, Tianwei
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page 15
container_issue
container_start_page 1
container_title
container_volume
creator Hu, Qinghao
Sun, Peng
Yan, Shengen
Wen, Yonggang
Zhang, Tianwei
description Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design (1) a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5×; (2) a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.
doi_str_mv 10.1145/3458817.3476223
format conference_proceeding
fullrecord <record><control><sourceid>acm_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_9910054</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9910054</ieee_id><sourcerecordid>acm_books_10_1145_3458817_3476223</sourcerecordid><originalsourceid>FETCH-LOGICAL-a177t-9aa7443a09811d48fcf90be45357657d27fbc6adfd02f2ee1105389a9d7c6b913</originalsourceid><addsrcrecordid>eNqNkD1PwzAURQ0IiQo6M7B4ZEnxV2J7RBUUpEow0IHJerGfi2maVE4kBL-eQDsxMV1dnas7HEIuOZtxrsobqUpjuJ5JpSsh5BGZWm1GwKRRSvBjMhG80oWSUp_8YWdk2vfvjDFhNJeCTcjr_A0y-AFz-oIhdS2FNtBdxpD8b-0iDYg72iDkNrVr-tHlTdNB6GlqaQN5jUXvoUG6eF7RAAN4bMe7_oKcRmh6nB7ynKzu717mD8XyafE4v10WwLUeCguglZLArOE8KBN9tKxGVcpSV6UOQsfaVxBiYCIKRM5ZKY0FG7SvasvlObna_yZEdLuctpA_nbWcsVKN9HpPwW9d3XWb3nHmfjy6g0d38DhOZ_-cujonjPIbOoZuTw</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Characterization and prediction of deep learning workloads in large-scale GPU datacenters</title><source>IEEE Xplore All Conference Series</source><creator>Hu, Qinghao ; Sun, Peng ; Yan, Shengen ; Wen, Yonggang ; Zhang, Tianwei</creator><creatorcontrib>Hu, Qinghao ; Sun, Peng ; Yan, Shengen ; Wen, Yonggang ; Zhang, Tianwei</creatorcontrib><description>Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design (1) a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5×; (2) a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.</description><identifier>ISBN: 9781450384421</identifier><identifier>ISBN: 1450384420</identifier><identifier>EISSN: 2167-4337</identifier><identifier>EISBN: 9781450384421</identifier><identifier>EISBN: 1450384420</identifier><identifier>DOI: 10.1145/3458817.3476223</identifier><language>eng</language><publisher>New York, NY, USA: ACM</publisher><subject>Behavioral sciences ; Cluster Management System ; Cluster Statistical Analysis ; Computing methodologies -- Distributed computing methodologies ; Computing methodologies -- Machine learning -- Machine learning approaches ; Computing methodologies -- Modeling and simulation -- Simulation evaluation ; Deep learning ; Deep Learning Training ; Energy Conservation ; GPU Datacenter ; Graphics processing units ; High performance computing ; Industries ; Job shop scheduling ; Power demand ; Time-series Prediction ; Workload Scheduling</subject><ispartof>Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, p.1-15</ispartof><rights>2021 ACM</rights><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9910054$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9910054$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Hu, Qinghao</creatorcontrib><creatorcontrib>Sun, Peng</creatorcontrib><creatorcontrib>Yan, Shengen</creatorcontrib><creatorcontrib>Wen, Yonggang</creatorcontrib><creatorcontrib>Zhang, Tianwei</creatorcontrib><title>Characterization and prediction of deep learning workloads in large-scale GPU datacenters</title><title>Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis</title><addtitle>SC</addtitle><description>Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design (1) a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5×; (2) a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.</description><subject>Behavioral sciences</subject><subject>Cluster Management System</subject><subject>Cluster Statistical Analysis</subject><subject>Computing methodologies -- Distributed computing methodologies</subject><subject>Computing methodologies -- Machine learning -- Machine learning approaches</subject><subject>Computing methodologies -- Modeling and simulation -- Simulation evaluation</subject><subject>Deep learning</subject><subject>Deep Learning Training</subject><subject>Energy Conservation</subject><subject>GPU Datacenter</subject><subject>Graphics processing units</subject><subject>High performance computing</subject><subject>Industries</subject><subject>Job shop scheduling</subject><subject>Power demand</subject><subject>Time-series Prediction</subject><subject>Workload Scheduling</subject><issn>2167-4337</issn><isbn>9781450384421</isbn><isbn>1450384420</isbn><isbn>9781450384421</isbn><isbn>1450384420</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2021</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNqNkD1PwzAURQ0IiQo6M7B4ZEnxV2J7RBUUpEow0IHJerGfi2maVE4kBL-eQDsxMV1dnas7HEIuOZtxrsobqUpjuJ5JpSsh5BGZWm1GwKRRSvBjMhG80oWSUp_8YWdk2vfvjDFhNJeCTcjr_A0y-AFz-oIhdS2FNtBdxpD8b-0iDYg72iDkNrVr-tHlTdNB6GlqaQN5jUXvoUG6eF7RAAN4bMe7_oKcRmh6nB7ynKzu717mD8XyafE4v10WwLUeCguglZLArOE8KBN9tKxGVcpSV6UOQsfaVxBiYCIKRM5ZKY0FG7SvasvlObna_yZEdLuctpA_nbWcsVKN9HpPwW9d3XWb3nHmfjy6g0d38DhOZ_-cujonjPIbOoZuTw</recordid><startdate>20211114</startdate><enddate>20211114</enddate><creator>Hu, Qinghao</creator><creator>Sun, Peng</creator><creator>Yan, Shengen</creator><creator>Wen, Yonggang</creator><creator>Zhang, Tianwei</creator><general>ACM</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>20211114</creationdate><title>Characterization and prediction of deep learning workloads in large-scale GPU datacenters</title><author>Hu, Qinghao ; Sun, Peng ; Yan, Shengen ; Wen, Yonggang ; Zhang, Tianwei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a177t-9aa7443a09811d48fcf90be45357657d27fbc6adfd02f2ee1105389a9d7c6b913</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Behavioral sciences</topic><topic>Cluster Management System</topic><topic>Cluster Statistical Analysis</topic><topic>Computing methodologies -- Distributed computing methodologies</topic><topic>Computing methodologies -- Machine learning -- Machine learning approaches</topic><topic>Computing methodologies -- Modeling and simulation -- Simulation evaluation</topic><topic>Deep learning</topic><topic>Deep Learning Training</topic><topic>Energy Conservation</topic><topic>GPU Datacenter</topic><topic>Graphics processing units</topic><topic>High performance computing</topic><topic>Industries</topic><topic>Job shop scheduling</topic><topic>Power demand</topic><topic>Time-series Prediction</topic><topic>Workload Scheduling</topic><toplevel>online_resources</toplevel><creatorcontrib>Hu, Qinghao</creatorcontrib><creatorcontrib>Sun, Peng</creatorcontrib><creatorcontrib>Yan, Shengen</creatorcontrib><creatorcontrib>Wen, Yonggang</creatorcontrib><creatorcontrib>Zhang, Tianwei</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library Online</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Hu, Qinghao</au><au>Sun, Peng</au><au>Yan, Shengen</au><au>Wen, Yonggang</au><au>Zhang, Tianwei</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Characterization and prediction of deep learning workloads in large-scale GPU datacenters</atitle><btitle>Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis</btitle><stitle>SC</stitle><date>2021-11-14</date><risdate>2021</risdate><spage>1</spage><epage>15</epage><pages>1-15</pages><eissn>2167-4337</eissn><isbn>9781450384421</isbn><isbn>1450384420</isbn><eisbn>9781450384421</eisbn><eisbn>1450384420</eisbn><abstract>Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design (1) a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5×; (2) a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.</abstract><cop>New York, NY, USA</cop><pub>ACM</pub><doi>10.1145/3458817.3476223</doi><tpages>15</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier ISBN: 9781450384421
ispartof Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, p.1-15
issn 2167-4337
language eng
recordid cdi_ieee_primary_9910054
source IEEE Xplore All Conference Series
subjects Behavioral sciences
Cluster Management System
Cluster Statistical Analysis
Computing methodologies -- Distributed computing methodologies
Computing methodologies -- Machine learning -- Machine learning approaches
Computing methodologies -- Modeling and simulation -- Simulation evaluation
Deep learning
Deep Learning Training
Energy Conservation
GPU Datacenter
Graphics processing units
High performance computing
Industries
Job shop scheduling
Power demand
Time-series Prediction
Workload Scheduling
title Characterization and prediction of deep learning workloads in large-scale GPU datacenters
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T06%3A25%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Characterization%20and%20prediction%20of%20deep%20learning%20workloads%20in%20large-scale%20GPU%20datacenters&rft.btitle=Proceedings%20of%20the%20International%20Conference%20for%20High%20Performance%20Computing,%20Networking,%20Storage%20and%20Analysis&rft.au=Hu,%20Qinghao&rft.date=2021-11-14&rft.spage=1&rft.epage=15&rft.pages=1-15&rft.eissn=2167-4337&rft.isbn=9781450384421&rft.isbn_list=1450384420&rft_id=info:doi/10.1145/3458817.3476223&rft.eisbn=9781450384421&rft.eisbn_list=1450384420&rft_dat=%3Cacm_CHZPO%3Eacm_books_10_1145_3458817_3476223%3C/acm_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-a177t-9aa7443a09811d48fcf90be45357657d27fbc6adfd02f2ee1105389a9d7c6b913%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=9910054&rfr_iscdi=true