Loading…

Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting

Advances in deep neural networks (DNNs) have significantly contributed to the development of real-time video processing applications. Efficient scheduling of DNN workloads in cloud-hosted inference systems is crucial to minimizing serving costs while meeting application latency constraints. However,...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2024-12
Main Authors: Zhao, Zhixin, Hu, Yitao, Gong, Ziqi, Yang, Guotao, Li, Wenxin, Liu, Xiulong, Li, Keqiu, Wang, Hao
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Zhao, Zhixin
Hu, Yitao
Gong, Ziqi
Yang, Guotao
Li, Wenxin
Liu, Xiulong
Li, Keqiu
Wang, Hao
description Advances in deep neural networks (DNNs) have significantly contributed to the development of real-time video processing applications. Efficient scheduling of DNN workloads in cloud-hosted inference systems is crucial to minimizing serving costs while meeting application latency constraints. However, existing systems suffer from excessive module latency during request dispatching, low execution throughput during module scheduling, and wasted latency budget during latency splitting for multi-DNN application, which undermines their capability to minimize the serving cost. In this paper, we design a DNN inference system called Harpagon, which minimizes the serving cost under latency constraints with a three-level design. It first maximizes the batch collection rate with a batch-aware request dispatch policy to minimize the module latency. It then maximizes the module throughput with multi-tuple configurations and proper amount of dummy requests. It also carefully splits the end-to-end latency into per-module latency budget to minimize the total serving cost for multi-DNN applications. Evaluation shows that Harpagon outperforms the state of the art by 1.49 to 2.37 times in serving cost while satisfying the latency objectives. Additionally, compared to the optimal solution using brute force search, Harpagon derives the lower bound of serving cost for 91.5% workloads with millisecond level runtime.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3142731642</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3142731642</sourcerecordid><originalsourceid>FETCH-proquest_journals_31427316423</originalsourceid><addsrcrecordid>eNqNir0OgjAYRRsTE4nyDk1cJYGWH-MKGBZdcHEiDRT4CLa1LQw-vZD4AE733JyzQQ6hNPDOISE75Boz-L5P4oREEXXQs2BasU6KC76BgBd8QHQ4u99xyfW8ciqNxTMwnLct1MCFxRkYxWzdL_qEy7rnzTSuKRMNLtUI1i7vgLYtGw13f7tHx2v-SAtPafmeuLHVICctFlXRICQJDeKQ0P-qL30SQZo</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3142731642</pqid></control><display><type>article</type><title>Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting</title><source>ProQuest - Publicly Available Content Database</source><creator>Zhao, Zhixin ; Hu, Yitao ; Gong, Ziqi ; Yang, Guotao ; Li, Wenxin ; Liu, Xiulong ; Li, Keqiu ; Wang, Hao</creator><creatorcontrib>Zhao, Zhixin ; Hu, Yitao ; Gong, Ziqi ; Yang, Guotao ; Li, Wenxin ; Liu, Xiulong ; Li, Keqiu ; Wang, Hao</creatorcontrib><description>Advances in deep neural networks (DNNs) have significantly contributed to the development of real-time video processing applications. Efficient scheduling of DNN workloads in cloud-hosted inference systems is crucial to minimizing serving costs while meeting application latency constraints. However, existing systems suffer from excessive module latency during request dispatching, low execution throughput during module scheduling, and wasted latency budget during latency splitting for multi-DNN application, which undermines their capability to minimize the serving cost. In this paper, we design a DNN inference system called Harpagon, which minimizes the serving cost under latency constraints with a three-level design. It first maximizes the batch collection rate with a batch-aware request dispatch policy to minimize the module latency. It then maximizes the module throughput with multi-tuple configurations and proper amount of dummy requests. It also carefully splits the end-to-end latency into per-module latency budget to minimize the total serving cost for multi-DNN applications. Evaluation shows that Harpagon outperforms the state of the art by 1.49 to 2.37 times in serving cost while satisfying the latency objectives. Additionally, compared to the optimal solution using brute force search, Harpagon derives the lower bound of serving cost for 91.5% workloads with millisecond level runtime.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Artificial neural networks ; Budgets ; Constraints ; Image processing ; Inference ; Lower bounds ; Modules ; Real time ; Scheduling ; Splitting ; Video ; Workload ; Workloads</subject><ispartof>arXiv.org, 2024-12</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3142731642?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25752,37011,44589</link.rule.ids></links><search><creatorcontrib>Zhao, Zhixin</creatorcontrib><creatorcontrib>Hu, Yitao</creatorcontrib><creatorcontrib>Gong, Ziqi</creatorcontrib><creatorcontrib>Yang, Guotao</creatorcontrib><creatorcontrib>Li, Wenxin</creatorcontrib><creatorcontrib>Liu, Xiulong</creatorcontrib><creatorcontrib>Li, Keqiu</creatorcontrib><creatorcontrib>Wang, Hao</creatorcontrib><title>Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting</title><title>arXiv.org</title><description>Advances in deep neural networks (DNNs) have significantly contributed to the development of real-time video processing applications. Efficient scheduling of DNN workloads in cloud-hosted inference systems is crucial to minimizing serving costs while meeting application latency constraints. However, existing systems suffer from excessive module latency during request dispatching, low execution throughput during module scheduling, and wasted latency budget during latency splitting for multi-DNN application, which undermines their capability to minimize the serving cost. In this paper, we design a DNN inference system called Harpagon, which minimizes the serving cost under latency constraints with a three-level design. It first maximizes the batch collection rate with a batch-aware request dispatch policy to minimize the module latency. It then maximizes the module throughput with multi-tuple configurations and proper amount of dummy requests. It also carefully splits the end-to-end latency into per-module latency budget to minimize the total serving cost for multi-DNN applications. Evaluation shows that Harpagon outperforms the state of the art by 1.49 to 2.37 times in serving cost while satisfying the latency objectives. Additionally, compared to the optimal solution using brute force search, Harpagon derives the lower bound of serving cost for 91.5% workloads with millisecond level runtime.</description><subject>Artificial neural networks</subject><subject>Budgets</subject><subject>Constraints</subject><subject>Image processing</subject><subject>Inference</subject><subject>Lower bounds</subject><subject>Modules</subject><subject>Real time</subject><subject>Scheduling</subject><subject>Splitting</subject><subject>Video</subject><subject>Workload</subject><subject>Workloads</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNir0OgjAYRRsTE4nyDk1cJYGWH-MKGBZdcHEiDRT4CLa1LQw-vZD4AE733JyzQQ6hNPDOISE75Boz-L5P4oREEXXQs2BasU6KC76BgBd8QHQ4u99xyfW8ciqNxTMwnLct1MCFxRkYxWzdL_qEy7rnzTSuKRMNLtUI1i7vgLYtGw13f7tHx2v-SAtPafmeuLHVICctFlXRICQJDeKQ0P-qL30SQZo</recordid><startdate>20241209</startdate><enddate>20241209</enddate><creator>Zhao, Zhixin</creator><creator>Hu, Yitao</creator><creator>Gong, Ziqi</creator><creator>Yang, Guotao</creator><creator>Li, Wenxin</creator><creator>Liu, Xiulong</creator><creator>Li, Keqiu</creator><creator>Wang, Hao</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241209</creationdate><title>Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting</title><author>Zhao, Zhixin ; Hu, Yitao ; Gong, Ziqi ; Yang, Guotao ; Li, Wenxin ; Liu, Xiulong ; Li, Keqiu ; Wang, Hao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31427316423</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Artificial neural networks</topic><topic>Budgets</topic><topic>Constraints</topic><topic>Image processing</topic><topic>Inference</topic><topic>Lower bounds</topic><topic>Modules</topic><topic>Real time</topic><topic>Scheduling</topic><topic>Splitting</topic><topic>Video</topic><topic>Workload</topic><topic>Workloads</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhao, Zhixin</creatorcontrib><creatorcontrib>Hu, Yitao</creatorcontrib><creatorcontrib>Gong, Ziqi</creatorcontrib><creatorcontrib>Yang, Guotao</creatorcontrib><creatorcontrib>Li, Wenxin</creatorcontrib><creatorcontrib>Liu, Xiulong</creatorcontrib><creatorcontrib>Li, Keqiu</creatorcontrib><creatorcontrib>Wang, Hao</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>ProQuest - Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhao, Zhixin</au><au>Hu, Yitao</au><au>Gong, Ziqi</au><au>Yang, Guotao</au><au>Li, Wenxin</au><au>Liu, Xiulong</au><au>Li, Keqiu</au><au>Wang, Hao</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting</atitle><jtitle>arXiv.org</jtitle><date>2024-12-09</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Advances in deep neural networks (DNNs) have significantly contributed to the development of real-time video processing applications. Efficient scheduling of DNN workloads in cloud-hosted inference systems is crucial to minimizing serving costs while meeting application latency constraints. However, existing systems suffer from excessive module latency during request dispatching, low execution throughput during module scheduling, and wasted latency budget during latency splitting for multi-DNN application, which undermines their capability to minimize the serving cost. In this paper, we design a DNN inference system called Harpagon, which minimizes the serving cost under latency constraints with a three-level design. It first maximizes the batch collection rate with a batch-aware request dispatch policy to minimize the module latency. It then maximizes the module throughput with multi-tuple configurations and proper amount of dummy requests. It also carefully splits the end-to-end latency into per-module latency budget to minimize the total serving cost for multi-DNN applications. Evaluation shows that Harpagon outperforms the state of the art by 1.49 to 2.37 times in serving cost while satisfying the latency objectives. Additionally, compared to the optimal solution using brute force search, Harpagon derives the lower bound of serving cost for 91.5% workloads with millisecond level runtime.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-12
issn 2331-8422
language eng
recordid cdi_proquest_journals_3142731642
source ProQuest - Publicly Available Content Database
subjects Artificial neural networks
Budgets
Constraints
Image processing
Inference
Lower bounds
Modules
Real time
Scheduling
Splitting
Video
Workload
Workloads
title Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-09T08%3A44%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Harpagon:%20Minimizing%20DNN%20Serving%20Cost%20via%20Efficient%20Dispatching,%20Scheduling%20and%20Splitting&rft.jtitle=arXiv.org&rft.au=Zhao,%20Zhixin&rft.date=2024-12-09&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3142731642%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_31427316423%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3142731642&rft_id=info:pmid/&rfr_iscdi=true