Loading…

Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference

Real-time Deep Neural Network (DNN) inference with low-latency requirement has become increasingly important for numerous applications in both cloud computing (e.g., Apple’s Siri) and edge computing (e.g., Google/Waymo’s driverless car). FPGA-based DNN accelerators have demonstrated both superior fl...

Full description

Saved in:

Bibliographic Details
Published in:	ACM transactions on embedded computing systems 2019-10, Vol.18 (5s), p.1-23
Main Authors:	Jiang, Weiwen, Sha, Edwin H.-M., Zhang, Xinyi, Yang, Lei, Zhuge, Qingfeng, Shi, Yiyu, Hu, Jingtong
Format:	Article
Language:	English
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c324t-e0ad8674df4d5b30ec7338bf06ceb0fb7deb12c0f3ef7a98aaf19185c27b14443
cites	cdi_FETCH-LOGICAL-c324t-e0ad8674df4d5b30ec7338bf06ceb0fb7deb12c0f3ef7a98aaf19185c27b14443
container_end_page	23
container_issue	5s
container_start_page	1
container_title	ACM transactions on embedded computing systems
container_volume	18
creator	Jiang, Weiwen Sha, Edwin H.-M. Zhang, Xinyi Yang, Lei Zhuge, Qingfeng Shi, Yiyu Hu, Jingtong
description	Real-time Deep Neural Network (DNN) inference with low-latency requirement has become increasingly important for numerous applications in both cloud computing (e.g., Apple’s Siri) and edge computing (e.g., Google/Waymo’s driverless car). FPGA-based DNN accelerators have demonstrated both superior flexibility and performance; in addition, for real-time inference with low batch size, FPGA is expected to achieve further performance improvement. However, the performance gain from the single-FPGA design is obstructed by the limited on-chip resource. In this paper, we employ multiple FPGAs to cooperatively run DNNs with the objective of achieving super-linear speed-up against single-FPGA design. In implementing such systems, we found two barriers that hinder us from achieving the design goal: (1) the lack of a clear partition scheme for each DNN layer to fully exploit parallelism, and (2) the insufficient bandwidth between the off-chip memory and the accelerator due to the growing size of DNNs. To tackle these issues, we propose a general framework, “Super-LIP”, which can support different kinds of DNNs. In this paper, we take Convolutional Neural Network (CNN) as a vehicle to illustrate Super-LIP. We first formulate an accurate system-level model to support the exploration of best partition schemes. Then, we develop a novel design methodology to effectively alleviate the heavy loads on memory bandwidth by moving traffic from memory bus to inter-FPGA links. We implement Super-LIP based on ZCU102 FPGA boards. Results demonstrate that Super-LIP with 2 FPGAs can achieve 3.48× speedup, compared to the state-of-the-art single-FPGA design. What is more, as the number of FPGAs scales up, the system latency can be further reduced while maintaining high energy efficiency.
doi_str_mv	10.1145/3358192
format	article
fullrecord	<record><control><sourceid>crossref</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3358192</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1145_3358192</sourcerecordid><originalsourceid>FETCH-LOGICAL-c324t-e0ad8674df4d5b30ec7338bf06ceb0fb7deb12c0f3ef7a98aaf19185c27b14443</originalsourceid><addsrcrecordid>eNotkEFLwzAYhoMoOKf4F3LzFM3XJE16LNPNQZ3i5rkk6ReNdF1JV2H_XqY7Pe_p4eUh5Bb4PYBUD0IoA0V2RiaglGFC5ur8uEXBCm70Jbkahm_OQWdSTUhV-q-IP7H7pOuxx8Sq2KFNdN0jNmNPrU-7YaAvY7uPbP62KGnYJfqOtmWbuEX6uFrRZRcwYefxmlwE2w54c-KUfMyfNrNnVr0ulrOyYl5kcs-Q28bkWjZBNsoJjl4LYVzguUfHg9MNOsg8DwKDtoWxNkABRvlMO5BSiim5-_f-nUsY6j7FrU2HGnh9jFCfIohfkzZNxw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference</title><source>Association for Computing Machinery:Jisc Collections:ACM OPEN Journals 2023-2025 (reading list)</source><creator>Jiang, Weiwen ; Sha, Edwin H.-M. ; Zhang, Xinyi ; Yang, Lei ; Zhuge, Qingfeng ; Shi, Yiyu ; Hu, Jingtong</creator><creatorcontrib>Jiang, Weiwen ; Sha, Edwin H.-M. ; Zhang, Xinyi ; Yang, Lei ; Zhuge, Qingfeng ; Shi, Yiyu ; Hu, Jingtong</creatorcontrib><description>Real-time Deep Neural Network (DNN) inference with low-latency requirement has become increasingly important for numerous applications in both cloud computing (e.g., Apple’s Siri) and edge computing (e.g., Google/Waymo’s driverless car). FPGA-based DNN accelerators have demonstrated both superior flexibility and performance; in addition, for real-time inference with low batch size, FPGA is expected to achieve further performance improvement. However, the performance gain from the single-FPGA design is obstructed by the limited on-chip resource. In this paper, we employ multiple FPGAs to cooperatively run DNNs with the objective of achieving super-linear speed-up against single-FPGA design. In implementing such systems, we found two barriers that hinder us from achieving the design goal: (1) the lack of a clear partition scheme for each DNN layer to fully exploit parallelism, and (2) the insufficient bandwidth between the off-chip memory and the accelerator due to the growing size of DNNs. To tackle these issues, we propose a general framework, “Super-LIP”, which can support different kinds of DNNs. In this paper, we take Convolutional Neural Network (CNN) as a vehicle to illustrate Super-LIP. We first formulate an accurate system-level model to support the exploration of best partition schemes. Then, we develop a novel design methodology to effectively alleviate the heavy loads on memory bandwidth by moving traffic from memory bus to inter-FPGA links. We implement Super-LIP based on ZCU102 FPGA boards. Results demonstrate that Super-LIP with 2 FPGAs can achieve 3.48× speedup, compared to the state-of-the-art single-FPGA design. What is more, as the number of FPGAs scales up, the system latency can be further reduced while maintaining high energy efficiency.</description><identifier>ISSN: 1539-9087</identifier><identifier>EISSN: 1558-3465</identifier><identifier>DOI: 10.1145/3358192</identifier><language>eng</language><ispartof>ACM transactions on embedded computing systems, 2019-10, Vol.18 (5s), p.1-23</ispartof><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c324t-e0ad8674df4d5b30ec7338bf06ceb0fb7deb12c0f3ef7a98aaf19185c27b14443</citedby><cites>FETCH-LOGICAL-c324t-e0ad8674df4d5b30ec7338bf06ceb0fb7deb12c0f3ef7a98aaf19185c27b14443</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Jiang, Weiwen</creatorcontrib><creatorcontrib>Sha, Edwin H.-M.</creatorcontrib><creatorcontrib>Zhang, Xinyi</creatorcontrib><creatorcontrib>Yang, Lei</creatorcontrib><creatorcontrib>Zhuge, Qingfeng</creatorcontrib><creatorcontrib>Shi, Yiyu</creatorcontrib><creatorcontrib>Hu, Jingtong</creatorcontrib><title>Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference</title><title>ACM transactions on embedded computing systems</title><description>Real-time Deep Neural Network (DNN) inference with low-latency requirement has become increasingly important for numerous applications in both cloud computing (e.g., Apple’s Siri) and edge computing (e.g., Google/Waymo’s driverless car). FPGA-based DNN accelerators have demonstrated both superior flexibility and performance; in addition, for real-time inference with low batch size, FPGA is expected to achieve further performance improvement. However, the performance gain from the single-FPGA design is obstructed by the limited on-chip resource. In this paper, we employ multiple FPGAs to cooperatively run DNNs with the objective of achieving super-linear speed-up against single-FPGA design. In implementing such systems, we found two barriers that hinder us from achieving the design goal: (1) the lack of a clear partition scheme for each DNN layer to fully exploit parallelism, and (2) the insufficient bandwidth between the off-chip memory and the accelerator due to the growing size of DNNs. To tackle these issues, we propose a general framework, “Super-LIP”, which can support different kinds of DNNs. In this paper, we take Convolutional Neural Network (CNN) as a vehicle to illustrate Super-LIP. We first formulate an accurate system-level model to support the exploration of best partition schemes. Then, we develop a novel design methodology to effectively alleviate the heavy loads on memory bandwidth by moving traffic from memory bus to inter-FPGA links. We implement Super-LIP based on ZCU102 FPGA boards. Results demonstrate that Super-LIP with 2 FPGAs can achieve 3.48× speedup, compared to the state-of-the-art single-FPGA design. What is more, as the number of FPGAs scales up, the system latency can be further reduced while maintaining high energy efficiency.</description><issn>1539-9087</issn><issn>1558-3465</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><recordid>eNotkEFLwzAYhoMoOKf4F3LzFM3XJE16LNPNQZ3i5rkk6ReNdF1JV2H_XqY7Pe_p4eUh5Bb4PYBUD0IoA0V2RiaglGFC5ur8uEXBCm70Jbkahm_OQWdSTUhV-q-IP7H7pOuxx8Sq2KFNdN0jNmNPrU-7YaAvY7uPbP62KGnYJfqOtmWbuEX6uFrRZRcwYefxmlwE2w54c-KUfMyfNrNnVr0ulrOyYl5kcs-Q28bkWjZBNsoJjl4LYVzguUfHg9MNOsg8DwKDtoWxNkABRvlMO5BSiim5-_f-nUsY6j7FrU2HGnh9jFCfIohfkzZNxw</recordid><startdate>20191001</startdate><enddate>20191001</enddate><creator>Jiang, Weiwen</creator><creator>Sha, Edwin H.-M.</creator><creator>Zhang, Xinyi</creator><creator>Yang, Lei</creator><creator>Zhuge, Qingfeng</creator><creator>Shi, Yiyu</creator><creator>Hu, Jingtong</creator><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20191001</creationdate><title>Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference</title><author>Jiang, Weiwen ; Sha, Edwin H.-M. ; Zhang, Xinyi ; Yang, Lei ; Zhuge, Qingfeng ; Shi, Yiyu ; Hu, Jingtong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c324t-e0ad8674df4d5b30ec7338bf06ceb0fb7deb12c0f3ef7a98aaf19185c27b14443</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Jiang, Weiwen</creatorcontrib><creatorcontrib>Sha, Edwin H.-M.</creatorcontrib><creatorcontrib>Zhang, Xinyi</creatorcontrib><creatorcontrib>Yang, Lei</creatorcontrib><creatorcontrib>Zhuge, Qingfeng</creatorcontrib><creatorcontrib>Shi, Yiyu</creatorcontrib><creatorcontrib>Hu, Jingtong</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on embedded computing systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Jiang, Weiwen</au><au>Sha, Edwin H.-M.</au><au>Zhang, Xinyi</au><au>Yang, Lei</au><au>Zhuge, Qingfeng</au><au>Shi, Yiyu</au><au>Hu, Jingtong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference</atitle><jtitle>ACM transactions on embedded computing systems</jtitle><date>2019-10-01</date><risdate>2019</risdate><volume>18</volume><issue>5s</issue><spage>1</spage><epage>23</epage><pages>1-23</pages><issn>1539-9087</issn><eissn>1558-3465</eissn><abstract>Real-time Deep Neural Network (DNN) inference with low-latency requirement has become increasingly important for numerous applications in both cloud computing (e.g., Apple’s Siri) and edge computing (e.g., Google/Waymo’s driverless car). FPGA-based DNN accelerators have demonstrated both superior flexibility and performance; in addition, for real-time inference with low batch size, FPGA is expected to achieve further performance improvement. However, the performance gain from the single-FPGA design is obstructed by the limited on-chip resource. In this paper, we employ multiple FPGAs to cooperatively run DNNs with the objective of achieving super-linear speed-up against single-FPGA design. In implementing such systems, we found two barriers that hinder us from achieving the design goal: (1) the lack of a clear partition scheme for each DNN layer to fully exploit parallelism, and (2) the insufficient bandwidth between the off-chip memory and the accelerator due to the growing size of DNNs. To tackle these issues, we propose a general framework, “Super-LIP”, which can support different kinds of DNNs. In this paper, we take Convolutional Neural Network (CNN) as a vehicle to illustrate Super-LIP. We first formulate an accurate system-level model to support the exploration of best partition schemes. Then, we develop a novel design methodology to effectively alleviate the heavy loads on memory bandwidth by moving traffic from memory bus to inter-FPGA links. We implement Super-LIP based on ZCU102 FPGA boards. Results demonstrate that Super-LIP with 2 FPGAs can achieve 3.48× speedup, compared to the state-of-the-art single-FPGA design. What is more, as the number of FPGAs scales up, the system latency can be further reduced while maintaining high energy efficiency.</abstract><doi>10.1145/3358192</doi><tpages>23</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1539-9087
ispartof	ACM transactions on embedded computing systems, 2019-10, Vol.18 (5s), p.1-23
issn	1539-9087 1558-3465
language	eng
recordid	cdi_crossref_primary_10_1145_3358192
source	Association for Computing Machinery:Jisc Collections:ACM OPEN Journals 2023-2025 (reading list)
title	Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-27T02%3A16%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Achieving%20Super-Linear%20Speedup%20across%20Multi-FPGA%20for%20Real-Time%20DNN%20Inference&rft.jtitle=ACM%20transactions%20on%20embedded%20computing%20systems&rft.au=Jiang,%20Weiwen&rft.date=2019-10-01&rft.volume=18&rft.issue=5s&rft.spage=1&rft.epage=23&rft.pages=1-23&rft.issn=1539-9087&rft.eissn=1558-3465&rft_id=info:doi/10.1145/3358192&rft_dat=%3Ccrossref%3E10_1145_3358192%3C/crossref%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c324t-e0ad8674df4d5b30ec7338bf06ceb0fb7deb12c0f3ef7a98aaf19185c27b14443%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true