Loading…
Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting
Advances in deep neural networks (DNNs) have significantly contributed to the development of real-time video processing applications. Efficient scheduling of DNN workloads in cloud-hosted inference systems is crucial to minimizing serving costs while meeting application latency constraints. However,...
Saved in:
Published in: | arXiv.org 2024-12 |
---|---|
Main Authors: | , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | |
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Zhao, Zhixin Hu, Yitao Gong, Ziqi Yang, Guotao Li, Wenxin Liu, Xiulong Li, Keqiu Wang, Hao |
description | Advances in deep neural networks (DNNs) have significantly contributed to the development of real-time video processing applications. Efficient scheduling of DNN workloads in cloud-hosted inference systems is crucial to minimizing serving costs while meeting application latency constraints. However, existing systems suffer from excessive module latency during request dispatching, low execution throughput during module scheduling, and wasted latency budget during latency splitting for multi-DNN application, which undermines their capability to minimize the serving cost. In this paper, we design a DNN inference system called Harpagon, which minimizes the serving cost under latency constraints with a three-level design. It first maximizes the batch collection rate with a batch-aware request dispatch policy to minimize the module latency. It then maximizes the module throughput with multi-tuple configurations and proper amount of dummy requests. It also carefully splits the end-to-end latency into per-module latency budget to minimize the total serving cost for multi-DNN applications. Evaluation shows that Harpagon outperforms the state of the art by 1.49 to 2.37 times in serving cost while satisfying the latency objectives. Additionally, compared to the optimal solution using brute force search, Harpagon derives the lower bound of serving cost for 91.5% workloads with millisecond level runtime. |
format | article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3142731642</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3142731642</sourcerecordid><originalsourceid>FETCH-proquest_journals_31427316423</originalsourceid><addsrcrecordid>eNqNir0OgjAYRRsTE4nyDk1cJYGWH-MKGBZdcHEiDRT4CLa1LQw-vZD4AE733JyzQQ6hNPDOISE75Boz-L5P4oREEXXQs2BasU6KC76BgBd8QHQ4u99xyfW8ciqNxTMwnLct1MCFxRkYxWzdL_qEy7rnzTSuKRMNLtUI1i7vgLYtGw13f7tHx2v-SAtPafmeuLHVICctFlXRICQJDeKQ0P-qL30SQZo</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3142731642</pqid></control><display><type>article</type><title>Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting</title><source>ProQuest - Publicly Available Content Database</source><creator>Zhao, Zhixin ; Hu, Yitao ; Gong, Ziqi ; Yang, Guotao ; Li, Wenxin ; Liu, Xiulong ; Li, Keqiu ; Wang, Hao</creator><creatorcontrib>Zhao, Zhixin ; Hu, Yitao ; Gong, Ziqi ; Yang, Guotao ; Li, Wenxin ; Liu, Xiulong ; Li, Keqiu ; Wang, Hao</creatorcontrib><description>Advances in deep neural networks (DNNs) have significantly contributed to the development of real-time video processing applications. Efficient scheduling of DNN workloads in cloud-hosted inference systems is crucial to minimizing serving costs while meeting application latency constraints. However, existing systems suffer from excessive module latency during request dispatching, low execution throughput during module scheduling, and wasted latency budget during latency splitting for multi-DNN application, which undermines their capability to minimize the serving cost. In this paper, we design a DNN inference system called Harpagon, which minimizes the serving cost under latency constraints with a three-level design. It first maximizes the batch collection rate with a batch-aware request dispatch policy to minimize the module latency. It then maximizes the module throughput with multi-tuple configurations and proper amount of dummy requests. It also carefully splits the end-to-end latency into per-module latency budget to minimize the total serving cost for multi-DNN applications. Evaluation shows that Harpagon outperforms the state of the art by 1.49 to 2.37 times in serving cost while satisfying the latency objectives. Additionally, compared to the optimal solution using brute force search, Harpagon derives the lower bound of serving cost for 91.5% workloads with millisecond level runtime.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Artificial neural networks ; Budgets ; Constraints ; Image processing ; Inference ; Lower bounds ; Modules ; Real time ; Scheduling ; Splitting ; Video ; Workload ; Workloads</subject><ispartof>arXiv.org, 2024-12</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3142731642?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25752,37011,44589</link.rule.ids></links><search><creatorcontrib>Zhao, Zhixin</creatorcontrib><creatorcontrib>Hu, Yitao</creatorcontrib><creatorcontrib>Gong, Ziqi</creatorcontrib><creatorcontrib>Yang, Guotao</creatorcontrib><creatorcontrib>Li, Wenxin</creatorcontrib><creatorcontrib>Liu, Xiulong</creatorcontrib><creatorcontrib>Li, Keqiu</creatorcontrib><creatorcontrib>Wang, Hao</creatorcontrib><title>Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting</title><title>arXiv.org</title><description>Advances in deep neural networks (DNNs) have significantly contributed to the development of real-time video processing applications. Efficient scheduling of DNN workloads in cloud-hosted inference systems is crucial to minimizing serving costs while meeting application latency constraints. However, existing systems suffer from excessive module latency during request dispatching, low execution throughput during module scheduling, and wasted latency budget during latency splitting for multi-DNN application, which undermines their capability to minimize the serving cost. In this paper, we design a DNN inference system called Harpagon, which minimizes the serving cost under latency constraints with a three-level design. It first maximizes the batch collection rate with a batch-aware request dispatch policy to minimize the module latency. It then maximizes the module throughput with multi-tuple configurations and proper amount of dummy requests. It also carefully splits the end-to-end latency into per-module latency budget to minimize the total serving cost for multi-DNN applications. Evaluation shows that Harpagon outperforms the state of the art by 1.49 to 2.37 times in serving cost while satisfying the latency objectives. Additionally, compared to the optimal solution using brute force search, Harpagon derives the lower bound of serving cost for 91.5% workloads with millisecond level runtime.</description><subject>Artificial neural networks</subject><subject>Budgets</subject><subject>Constraints</subject><subject>Image processing</subject><subject>Inference</subject><subject>Lower bounds</subject><subject>Modules</subject><subject>Real time</subject><subject>Scheduling</subject><subject>Splitting</subject><subject>Video</subject><subject>Workload</subject><subject>Workloads</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNir0OgjAYRRsTE4nyDk1cJYGWH-MKGBZdcHEiDRT4CLa1LQw-vZD4AE733JyzQQ6hNPDOISE75Boz-L5P4oREEXXQs2BasU6KC76BgBd8QHQ4u99xyfW8ciqNxTMwnLct1MCFxRkYxWzdL_qEy7rnzTSuKRMNLtUI1i7vgLYtGw13f7tHx2v-SAtPafmeuLHVICctFlXRICQJDeKQ0P-qL30SQZo</recordid><startdate>20241209</startdate><enddate>20241209</enddate><creator>Zhao, Zhixin</creator><creator>Hu, Yitao</creator><creator>Gong, Ziqi</creator><creator>Yang, Guotao</creator><creator>Li, Wenxin</creator><creator>Liu, Xiulong</creator><creator>Li, Keqiu</creator><creator>Wang, Hao</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241209</creationdate><title>Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting</title><author>Zhao, Zhixin ; Hu, Yitao ; Gong, Ziqi ; Yang, Guotao ; Li, Wenxin ; Liu, Xiulong ; Li, Keqiu ; Wang, Hao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31427316423</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Artificial neural networks</topic><topic>Budgets</topic><topic>Constraints</topic><topic>Image processing</topic><topic>Inference</topic><topic>Lower bounds</topic><topic>Modules</topic><topic>Real time</topic><topic>Scheduling</topic><topic>Splitting</topic><topic>Video</topic><topic>Workload</topic><topic>Workloads</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhao, Zhixin</creatorcontrib><creatorcontrib>Hu, Yitao</creatorcontrib><creatorcontrib>Gong, Ziqi</creatorcontrib><creatorcontrib>Yang, Guotao</creatorcontrib><creatorcontrib>Li, Wenxin</creatorcontrib><creatorcontrib>Liu, Xiulong</creatorcontrib><creatorcontrib>Li, Keqiu</creatorcontrib><creatorcontrib>Wang, Hao</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>ProQuest - Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhao, Zhixin</au><au>Hu, Yitao</au><au>Gong, Ziqi</au><au>Yang, Guotao</au><au>Li, Wenxin</au><au>Liu, Xiulong</au><au>Li, Keqiu</au><au>Wang, Hao</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting</atitle><jtitle>arXiv.org</jtitle><date>2024-12-09</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Advances in deep neural networks (DNNs) have significantly contributed to the development of real-time video processing applications. Efficient scheduling of DNN workloads in cloud-hosted inference systems is crucial to minimizing serving costs while meeting application latency constraints. However, existing systems suffer from excessive module latency during request dispatching, low execution throughput during module scheduling, and wasted latency budget during latency splitting for multi-DNN application, which undermines their capability to minimize the serving cost. In this paper, we design a DNN inference system called Harpagon, which minimizes the serving cost under latency constraints with a three-level design. It first maximizes the batch collection rate with a batch-aware request dispatch policy to minimize the module latency. It then maximizes the module throughput with multi-tuple configurations and proper amount of dummy requests. It also carefully splits the end-to-end latency into per-module latency budget to minimize the total serving cost for multi-DNN applications. Evaluation shows that Harpagon outperforms the state of the art by 1.49 to 2.37 times in serving cost while satisfying the latency objectives. Additionally, compared to the optimal solution using brute force search, Harpagon derives the lower bound of serving cost for 91.5% workloads with millisecond level runtime.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-12 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_3142731642 |
source | ProQuest - Publicly Available Content Database |
subjects | Artificial neural networks Budgets Constraints Image processing Inference Lower bounds Modules Real time Scheduling Splitting Video Workload Workloads |
title | Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-09T08%3A44%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Harpagon:%20Minimizing%20DNN%20Serving%20Cost%20via%20Efficient%20Dispatching,%20Scheduling%20and%20Splitting&rft.jtitle=arXiv.org&rft.au=Zhao,%20Zhixin&rft.date=2024-12-09&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3142731642%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_31427316423%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3142731642&rft_id=info:pmid/&rfr_iscdi=true |