Loading…

Characterization and prediction of deep learning workloads in large-scale GPU datacenters

Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep underst...

Full description

Saved in:

Bibliographic Details
Main Authors:	Hu, Qinghao, Sun, Peng, Yan, Shengen, Wen, Yonggang, Zhang, Tianwei
Format:	Conference Proceeding
Language:	English
Subjects:	Behavioral sciences Cluster Management System Cluster Statistical Analysis Computing methodologies > Distributed computing methodologies Computing methodologies > Machine learning > Machine learning approaches Computing methodologies > Modeling and simulation > Simulation evaluation Deep learning Deep Learning Training Energy Conservation GPU Datacenter Graphics processing units High performance computing Industries Job shop scheduling Power demand Time-series Prediction Workload Scheduling
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	15
container_issue
container_start_page	1
container_title
container_volume
creator	Hu, Qinghao Sun, Peng Yan, Shengen Wen, Yonggang Zhang, Tianwei
description	Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design (1) a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5×; (2) a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.
doi_str_mv	10.1145/3458817.3476223
format	conference_proceeding
fullrecord	<record><control><sourceid>acm_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_9910054</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9910054</ieee_id><sourcerecordid>acm_books_10_1145_3458817_3476223</sourcerecordid><originalsourceid>FETCH-LOGICAL-a177t-9aa7443a09811d48fcf90be45357657d27fbc6adfd02f2ee1105389a9d7c6b913</originalsourceid><addsrcrecordid>eNqNkD1PwzAURQ0IiQo6M7B4ZEnxV2J7RBUUpEow0IHJerGfi2maVE4kBL-eQDsxMV1dnas7HEIuOZtxrsobqUpjuJ5JpSsh5BGZWm1GwKRRSvBjMhG80oWSUp_8YWdk2vfvjDFhNJeCTcjr_A0y-AFz-oIhdS2FNtBdxpD8b-0iDYg72iDkNrVr-tHlTdNB6GlqaQN5jUXvoUG6eF7RAAN4bMe7_oKcRmh6nB7ynKzu717mD8XyafE4v10WwLUeCguglZLArOE8KBN9tKxGVcpSV6UOQsfaVxBiYCIKRM5ZKY0FG7SvasvlObna_yZEdLuctpA_nbWcsVKN9HpPwW9d3XWb3nHmfjy6g0d38DhOZ_-cujonjPIbOoZuTw</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Characterization and prediction of deep learning workloads in large-scale GPU datacenters</title><source>IEEE Xplore All Conference Series</source><creator>Hu, Qinghao ; Sun, Peng ; Yan, Shengen ; Wen, Yonggang ; Zhang, Tianwei</creator><creatorcontrib>Hu, Qinghao ; Sun, Peng ; Yan, Shengen ; Wen, Yonggang ; Zhang, Tianwei</creatorcontrib><description>Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design (1) a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5×; (2) a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.</description><identifier>ISBN: 9781450384421</identifier><identifier>ISBN: 1450384420</identifier><identifier>EISSN: 2167-4337</identifier><identifier>EISBN: 9781450384421</identifier><identifier>EISBN: 1450384420</identifier><identifier>DOI: 10.1145/3458817.3476223</identifier><language>eng</language><publisher>New York, NY, USA: ACM</publisher><subject>Behavioral sciences ; Cluster Management System ; Cluster Statistical Analysis ; Computing methodologies -- Distributed computing methodologies ; Computing methodologies -- Machine learning -- Machine learning approaches ; Computing methodologies -- Modeling and simulation -- Simulation evaluation ; Deep learning ; Deep Learning Training ; Energy Conservation ; GPU Datacenter ; Graphics processing units ; High performance computing ; Industries ; Job shop scheduling ; Power demand ; Time-series Prediction ; Workload Scheduling</subject><ispartof>Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, p.1-15</ispartof><rights>2021 ACM</rights><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9910054$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9910054$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Hu, Qinghao</creatorcontrib><creatorcontrib>Sun, Peng</creatorcontrib><creatorcontrib>Yan, Shengen</creatorcontrib><creatorcontrib>Wen, Yonggang</creatorcontrib><creatorcontrib>Zhang, Tianwei</creatorcontrib><title>Characterization and prediction of deep learning workloads in large-scale GPU datacenters</title><title>Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis</title><addtitle>SC</addtitle><description>Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design (1) a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5×; (2) a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.</description><subject>Behavioral sciences</subject><subject>Cluster Management System</subject><subject>Cluster Statistical Analysis</subject><subject>Computing methodologies -- Distributed computing methodologies</subject><subject>Computing methodologies -- Machine learning -- Machine learning approaches</subject><subject>Computing methodologies -- Modeling and simulation -- Simulation evaluation</subject><subject>Deep learning</subject><subject>Deep Learning Training</subject><subject>Energy Conservation</subject><subject>GPU Datacenter</subject><subject>Graphics processing units</subject><subject>High performance computing</subject><subject>Industries</subject><subject>Job shop scheduling</subject><subject>Power demand</subject><subject>Time-series Prediction</subject><subject>Workload Scheduling</subject><issn>2167-4337</issn><isbn>9781450384421</isbn><isbn>1450384420</isbn><isbn>9781450384421</isbn><isbn>1450384420</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2021</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNqNkD1PwzAURQ0IiQo6M7B4ZEnxV2J7RBUUpEow0IHJerGfi2maVE4kBL-eQDsxMV1dnas7HEIuOZtxrsobqUpjuJ5JpSsh5BGZWm1GwKRRSvBjMhG80oWSUp_8YWdk2vfvjDFhNJeCTcjr_A0y-AFz-oIhdS2FNtBdxpD8b-0iDYg72iDkNrVr-tHlTdNB6GlqaQN5jUXvoUG6eF7RAAN4bMe7_oKcRmh6nB7ynKzu717mD8XyafE4v10WwLUeCguglZLArOE8KBN9tKxGVcpSV6UOQsfaVxBiYCIKRM5ZKY0FG7SvasvlObna_yZEdLuctpA_nbWcsVKN9HpPwW9d3XWb3nHmfjy6g0d38DhOZ_-cujonjPIbOoZuTw</recordid><startdate>20211114</startdate><enddate>20211114</enddate><creator>Hu, Qinghao</creator><creator>Sun, Peng</creator><creator>Yan, Shengen</creator><creator>Wen, Yonggang</creator><creator>Zhang, Tianwei</creator><general>ACM</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>20211114</creationdate><title>Characterization and prediction of deep learning workloads in large-scale GPU datacenters</title><author>Hu, Qinghao ; Sun, Peng ; Yan, Shengen ; Wen, Yonggang ; Zhang, Tianwei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a177t-9aa7443a09811d48fcf90be45357657d27fbc6adfd02f2ee1105389a9d7c6b913</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Behavioral sciences</topic><topic>Cluster Management System</topic><topic>Cluster Statistical Analysis</topic><topic>Computing methodologies -- Distributed computing methodologies</topic><topic>Computing methodologies -- Machine learning -- Machine learning approaches</topic><topic>Computing methodologies -- Modeling and simulation -- Simulation evaluation</topic><topic>Deep learning</topic><topic>Deep Learning Training</topic><topic>Energy Conservation</topic><topic>GPU Datacenter</topic><topic>Graphics processing units</topic><topic>High performance computing</topic><topic>Industries</topic><topic>Job shop scheduling</topic><topic>Power demand</topic><topic>Time-series Prediction</topic><topic>Workload Scheduling</topic><toplevel>online_resources</toplevel><creatorcontrib>Hu, Qinghao</creatorcontrib><creatorcontrib>Sun, Peng</creatorcontrib><creatorcontrib>Yan, Shengen</creatorcontrib><creatorcontrib>Wen, Yonggang</creatorcontrib><creatorcontrib>Zhang, Tianwei</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library Online</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Hu, Qinghao</au><au>Sun, Peng</au><au>Yan, Shengen</au><au>Wen, Yonggang</au><au>Zhang, Tianwei</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Characterization and prediction of deep learning workloads in large-scale GPU datacenters</atitle><btitle>Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis</btitle><stitle>SC</stitle><date>2021-11-14</date><risdate>2021</risdate><spage>1</spage><epage>15</epage><pages>1-15</pages><eissn>2167-4337</eissn><isbn>9781450384421</isbn><isbn>1450384420</isbn><eisbn>9781450384421</eisbn><eisbn>1450384420</eisbn><abstract>Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design (1) a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5×; (2) a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.</abstract><cop>New York, NY, USA</cop><pub>ACM</pub><doi>10.1145/3458817.3476223</doi><tpages>15</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISBN: 9781450384421
ispartof	Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, p.1-15
issn	2167-4337
language	eng
recordid	cdi_ieee_primary_9910054
source	IEEE Xplore All Conference Series
subjects	Behavioral sciences Cluster Management System Cluster Statistical Analysis Computing methodologies -- Distributed computing methodologies Computing methodologies -- Machine learning -- Machine learning approaches Computing methodologies -- Modeling and simulation -- Simulation evaluation Deep learning Deep Learning Training Energy Conservation GPU Datacenter Graphics processing units High performance computing Industries Job shop scheduling Power demand Time-series Prediction Workload Scheduling
title	Characterization and prediction of deep learning workloads in large-scale GPU datacenters
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T06%3A25%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Characterization%20and%20prediction%20of%20deep%20learning%20workloads%20in%20large-scale%20GPU%20datacenters&rft.btitle=Proceedings%20of%20the%20International%20Conference%20for%20High%20Performance%20Computing,%20Networking,%20Storage%20and%20Analysis&rft.au=Hu,%20Qinghao&rft.date=2021-11-14&rft.spage=1&rft.epage=15&rft.pages=1-15&rft.eissn=2167-4337&rft.isbn=9781450384421&rft.isbn_list=1450384420&rft_id=info:doi/10.1145/3458817.3476223&rft.eisbn=9781450384421&rft.eisbn_list=1450384420&rft_dat=%3Cacm_CHZPO%3Eacm_books_10_1145_3458817_3476223%3C/acm_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-a177t-9aa7443a09811d48fcf90be45357657d27fbc6adfd02f2ee1105389a9d7c6b913%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=9910054&rfr_iscdi=true