Loading…

Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting

Advances in deep neural networks (DNNs) have significantly contributed to the development of real-time video processing applications. Efficient scheduling of DNN workloads in cloud-hosted inference systems is crucial to minimizing serving costs while meeting application latency constraints. However,...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2024-12
Main Authors:	Zhao, Zhixin, Hu, Yitao, Gong, Ziqi, Yang, Guotao, Li, Wenxin, Liu, Xiulong, Li, Keqiu, Wang, Hao
Format:	Article
Language:	English
Subjects:	Artificial neural networks Budgets Constraints Image processing Inference Lower bounds Modules Real time Scheduling Splitting Video Workload Workloads
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Zhao, Zhixin Hu, Yitao Gong, Ziqi Yang, Guotao Li, Wenxin Liu, Xiulong Li, Keqiu Wang, Hao
description	Advances in deep neural networks (DNNs) have significantly contributed to the development of real-time video processing applications. Efficient scheduling of DNN workloads in cloud-hosted inference systems is crucial to minimizing serving costs while meeting application latency constraints. However, existing systems suffer from excessive module latency during request dispatching, low execution throughput during module scheduling, and wasted latency budget during latency splitting for multi-DNN application, which undermines their capability to minimize the serving cost. In this paper, we design a DNN inference system called Harpagon, which minimizes the serving cost under latency constraints with a three-level design. It first maximizes the batch collection rate with a batch-aware request dispatch policy to minimize the module latency. It then maximizes the module throughput with multi-tuple configurations and proper amount of dummy requests. It also carefully splits the end-to-end latency into per-module latency budget to minimize the total serving cost for multi-DNN applications. Evaluation shows that Harpagon outperforms the state of the art by 1.49 to 2.37 times in serving cost while satisfying the latency objectives. Additionally, compared to the optimal solution using brute force search, Harpagon derives the lower bound of serving cost for 91.5% workloads with millisecond level runtime.
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3142731642</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3142731642</sourcerecordid><originalsourceid>FETCH-proquest_journals_31427316423</originalsourceid><addsrcrecordid>eNqNir0OgjAYRRsTE4nyDk1cJYGWH-MKGBZdcHEiDRT4CLa1LQw-vZD4AE733JyzQQ6hNPDOISE75Boz-L5P4oREEXXQs2BasU6KC76BgBd8QHQ4u99xyfW8ciqNxTMwnLct1MCFxRkYxWzdL_qEy7rnzTSuKRMNLtUI1i7vgLYtGw13f7tHx2v-SAtPafmeuLHVICctFlXRICQJDeKQ0P-qL30SQZo</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3142731642</pqid></control><display><type>article</type><title>Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting</title><source>ProQuest - Publicly Available Content Database</source><creator>Zhao, Zhixin ; Hu, Yitao ; Gong, Ziqi ; Yang, Guotao ; Li, Wenxin ; Liu, Xiulong ; Li, Keqiu ; Wang, Hao</creator><creatorcontrib>Zhao, Zhixin ; Hu, Yitao ; Gong, Ziqi ; Yang, Guotao ; Li, Wenxin ; Liu, Xiulong ; Li, Keqiu ; Wang, Hao</creatorcontrib><description>Advances in deep neural networks (DNNs) have significantly contributed to the development of real-time video processing applications. Efficient scheduling of DNN workloads in cloud-hosted inference systems is crucial to minimizing serving costs while meeting application latency constraints. However, existing systems suffer from excessive module latency during request dispatching, low execution throughput during module scheduling, and wasted latency budget during latency splitting for multi-DNN application, which undermines their capability to minimize the serving cost. In this paper, we design a DNN inference system called Harpagon, which minimizes the serving cost under latency constraints with a three-level design. It first maximizes the batch collection rate with a batch-aware request dispatch policy to minimize the module latency. It then maximizes the module throughput with multi-tuple configurations and proper amount of dummy requests. It also carefully splits the end-to-end latency into per-module latency budget to minimize the total serving cost for multi-DNN applications. Evaluation shows that Harpagon outperforms the state of the art by 1.49 to 2.37 times in serving cost while satisfying the latency objectives. Additionally, compared to the optimal solution using brute force search, Harpagon derives the lower bound of serving cost for 91.5% workloads with millisecond level runtime.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Artificial neural networks ; Budgets ; Constraints ; Image processing ; Inference ; Lower bounds ; Modules ; Real time ; Scheduling ; Splitting ; Video ; Workload ; Workloads</subject><ispartof>arXiv.org, 2024-12</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3142731642?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25752,37011,44589</link.rule.ids></links><search><creatorcontrib>Zhao, Zhixin</creatorcontrib><creatorcontrib>Hu, Yitao</creatorcontrib><creatorcontrib>Gong, Ziqi</creatorcontrib><creatorcontrib>Yang, Guotao</creatorcontrib><creatorcontrib>Li, Wenxin</creatorcontrib><creatorcontrib>Liu, Xiulong</creatorcontrib><creatorcontrib>Li, Keqiu</creatorcontrib><creatorcontrib>Wang, Hao</creatorcontrib><title>Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting</title><title>arXiv.org</title><description>Advances in deep neural networks (DNNs) have significantly contributed to the development of real-time video processing applications. Efficient scheduling of DNN workloads in cloud-hosted inference systems is crucial to minimizing serving costs while meeting application latency constraints. However, existing systems suffer from excessive module latency during request dispatching, low execution throughput during module scheduling, and wasted latency budget during latency splitting for multi-DNN application, which undermines their capability to minimize the serving cost. In this paper, we design a DNN inference system called Harpagon, which minimizes the serving cost under latency constraints with a three-level design. It first maximizes the batch collection rate with a batch-aware request dispatch policy to minimize the module latency. It then maximizes the module throughput with multi-tuple configurations and proper amount of dummy requests. It also carefully splits the end-to-end latency into per-module latency budget to minimize the total serving cost for multi-DNN applications. Evaluation shows that Harpagon outperforms the state of the art by 1.49 to 2.37 times in serving cost while satisfying the latency objectives. Additionally, compared to the optimal solution using brute force search, Harpagon derives the lower bound of serving cost for 91.5% workloads with millisecond level runtime.</description><subject>Artificial neural networks</subject><subject>Budgets</subject><subject>Constraints</subject><subject>Image processing</subject><subject>Inference</subject><subject>Lower bounds</subject><subject>Modules</subject><subject>Real time</subject><subject>Scheduling</subject><subject>Splitting</subject><subject>Video</subject><subject>Workload</subject><subject>Workloads</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNir0OgjAYRRsTE4nyDk1cJYGWH-MKGBZdcHEiDRT4CLa1LQw-vZD4AE733JyzQQ6hNPDOISE75Boz-L5P4oREEXXQs2BasU6KC76BgBd8QHQ4u99xyfW8ciqNxTMwnLct1MCFxRkYxWzdL_qEy7rnzTSuKRMNLtUI1i7vgLYtGw13f7tHx2v-SAtPafmeuLHVICctFlXRICQJDeKQ0P-qL30SQZo</recordid><startdate>20241209</startdate><enddate>20241209</enddate><creator>Zhao, Zhixin</creator><creator>Hu, Yitao</creator><creator>Gong, Ziqi</creator><creator>Yang, Guotao</creator><creator>Li, Wenxin</creator><creator>Liu, Xiulong</creator><creator>Li, Keqiu</creator><creator>Wang, Hao</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241209</creationdate><title>Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting</title><author>Zhao, Zhixin ; Hu, Yitao ; Gong, Ziqi ; Yang, Guotao ; Li, Wenxin ; Liu, Xiulong ; Li, Keqiu ; Wang, Hao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31427316423</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Artificial neural networks</topic><topic>Budgets</topic><topic>Constraints</topic><topic>Image processing</topic><topic>Inference</topic><topic>Lower bounds</topic><topic>Modules</topic><topic>Real time</topic><topic>Scheduling</topic><topic>Splitting</topic><topic>Video</topic><topic>Workload</topic><topic>Workloads</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhao, Zhixin</creatorcontrib><creatorcontrib>Hu, Yitao</creatorcontrib><creatorcontrib>Gong, Ziqi</creatorcontrib><creatorcontrib>Yang, Guotao</creatorcontrib><creatorcontrib>Li, Wenxin</creatorcontrib><creatorcontrib>Liu, Xiulong</creatorcontrib><creatorcontrib>Li, Keqiu</creatorcontrib><creatorcontrib>Wang, Hao</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>ProQuest - Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhao, Zhixin</au><au>Hu, Yitao</au><au>Gong, Ziqi</au><au>Yang, Guotao</au><au>Li, Wenxin</au><au>Liu, Xiulong</au><au>Li, Keqiu</au><au>Wang, Hao</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting</atitle><jtitle>arXiv.org</jtitle><date>2024-12-09</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Advances in deep neural networks (DNNs) have significantly contributed to the development of real-time video processing applications. Efficient scheduling of DNN workloads in cloud-hosted inference systems is crucial to minimizing serving costs while meeting application latency constraints. However, existing systems suffer from excessive module latency during request dispatching, low execution throughput during module scheduling, and wasted latency budget during latency splitting for multi-DNN application, which undermines their capability to minimize the serving cost. In this paper, we design a DNN inference system called Harpagon, which minimizes the serving cost under latency constraints with a three-level design. It first maximizes the batch collection rate with a batch-aware request dispatch policy to minimize the module latency. It then maximizes the module throughput with multi-tuple configurations and proper amount of dummy requests. It also carefully splits the end-to-end latency into per-module latency budget to minimize the total serving cost for multi-DNN applications. Evaluation shows that Harpagon outperforms the state of the art by 1.49 to 2.37 times in serving cost while satisfying the latency objectives. Additionally, compared to the optimal solution using brute force search, Harpagon derives the lower bound of serving cost for 91.5% workloads with millisecond level runtime.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-12
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3142731642
source	ProQuest - Publicly Available Content Database
subjects	Artificial neural networks Budgets Constraints Image processing Inference Lower bounds Modules Real time Scheduling Splitting Video Workload Workloads
title	Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-09T08%3A44%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Harpagon:%20Minimizing%20DNN%20Serving%20Cost%20via%20Efficient%20Dispatching,%20Scheduling%20and%20Splitting&rft.jtitle=arXiv.org&rft.au=Zhao,%20Zhixin&rft.date=2024-12-09&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3142731642%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_31427316423%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3142731642&rft_id=info:pmid/&rfr_iscdi=true