Loading…

LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference

The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalab...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2024-08
Main Authors:	Moon, Seungjae, Jung-Hoon, Kim, Kim, Junsoo, Hong, Seongmin, Cha, Junseo, Kim, Minsu, Lim, Sukbin, Choi, Gyubin, Seo, Dongjin, Kim, Jongho, Lee, Hunjong, Park, Hyunjun, Ko, Ryeowook, Choi, Soongyu, Park, Jongse, Lee, Jinwon, Joo-Young, Kim
Format:	Article
Language:	English
Subjects:	Globalization Inference Large language models Microprocessors Network latency Semantics Synchronism
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Moon, Seungjae Jung-Hoon, Kim Kim, Junsoo Hong, Seongmin Cha, Junseo Kim, Minsu Lim, Sukbin Choi, Gyubin Seo, Dongjin Kim, Jongho Lee, Hunjong Park, Hyunjun Ko, Ryeowook Choi, Soongyu Park, Jongse Lee, Jinwon Joo-Young, Kim
description	The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33x and 1.32x energy efficiency over NVIDIA H100 and L4 servers, respectively.
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3093280768</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3093280768</sourcerecordid><originalsourceid>FETCH-proquest_journals_30932807683</originalsourceid><addsrcrecordid>eNqNjE0KwjAUhIMgWNQ7PHBdiIn90Z2IUqGioC5cldi-1paY1KRd6OnNwgO4mJkPZpgB8Rjncz9eMDYiU2sbSikLIxYE3CO39HRdwRpS0aHK3_6x7epn_cEChCogqauHfMM5F1LcJcLJ6Byt1QZKp1SYCp2rqhcODrpACXtVonFXOCHDUkiL01-OyWy3vWwSvzX61aPtskb3Rrkq43TJWUyjMOb_rb4QZkD9</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3093280768</pqid></control><display><type>article</type><title>LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference</title><source>Publicly Available Content (ProQuest)</source><creator>Moon, Seungjae ; Jung-Hoon, Kim ; Kim, Junsoo ; Hong, Seongmin ; Cha, Junseo ; Kim, Minsu ; Lim, Sukbin ; Choi, Gyubin ; Seo, Dongjin ; Kim, Jongho ; Lee, Hunjong ; Park, Hyunjun ; Ko, Ryeowook ; Choi, Soongyu ; Park, Jongse ; Lee, Jinwon ; Joo-Young, Kim</creator><creatorcontrib>Moon, Seungjae ; Jung-Hoon, Kim ; Kim, Junsoo ; Hong, Seongmin ; Cha, Junseo ; Kim, Minsu ; Lim, Sukbin ; Choi, Gyubin ; Seo, Dongjin ; Kim, Jongho ; Lee, Hunjong ; Park, Hyunjun ; Ko, Ryeowook ; Choi, Soongyu ; Park, Jongse ; Lee, Jinwon ; Joo-Young, Kim</creatorcontrib><description>The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33x and 1.32x energy efficiency over NVIDIA H100 and L4 servers, respectively.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Globalization ; Inference ; Large language models ; Microprocessors ; Network latency ; Semantics ; Synchronism</subject><ispartof>arXiv.org, 2024-08</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3093280768?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Moon, Seungjae</creatorcontrib><creatorcontrib>Jung-Hoon, Kim</creatorcontrib><creatorcontrib>Kim, Junsoo</creatorcontrib><creatorcontrib>Hong, Seongmin</creatorcontrib><creatorcontrib>Cha, Junseo</creatorcontrib><creatorcontrib>Kim, Minsu</creatorcontrib><creatorcontrib>Lim, Sukbin</creatorcontrib><creatorcontrib>Choi, Gyubin</creatorcontrib><creatorcontrib>Seo, Dongjin</creatorcontrib><creatorcontrib>Kim, Jongho</creatorcontrib><creatorcontrib>Lee, Hunjong</creatorcontrib><creatorcontrib>Park, Hyunjun</creatorcontrib><creatorcontrib>Ko, Ryeowook</creatorcontrib><creatorcontrib>Choi, Soongyu</creatorcontrib><creatorcontrib>Park, Jongse</creatorcontrib><creatorcontrib>Lee, Jinwon</creatorcontrib><creatorcontrib>Joo-Young, Kim</creatorcontrib><title>LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference</title><title>arXiv.org</title><description>The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33x and 1.32x energy efficiency over NVIDIA H100 and L4 servers, respectively.</description><subject>Globalization</subject><subject>Inference</subject><subject>Large language models</subject><subject>Microprocessors</subject><subject>Network latency</subject><subject>Semantics</subject><subject>Synchronism</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNjE0KwjAUhIMgWNQ7PHBdiIn90Z2IUqGioC5cldi-1paY1KRd6OnNwgO4mJkPZpgB8Rjncz9eMDYiU2sbSikLIxYE3CO39HRdwRpS0aHK3_6x7epn_cEChCogqauHfMM5F1LcJcLJ6Byt1QZKp1SYCp2rqhcODrpACXtVonFXOCHDUkiL01-OyWy3vWwSvzX61aPtskb3Rrkq43TJWUyjMOb_rb4QZkD9</recordid><startdate>20240814</startdate><enddate>20240814</enddate><creator>Moon, Seungjae</creator><creator>Jung-Hoon, Kim</creator><creator>Kim, Junsoo</creator><creator>Hong, Seongmin</creator><creator>Cha, Junseo</creator><creator>Kim, Minsu</creator><creator>Lim, Sukbin</creator><creator>Choi, Gyubin</creator><creator>Seo, Dongjin</creator><creator>Kim, Jongho</creator><creator>Lee, Hunjong</creator><creator>Park, Hyunjun</creator><creator>Ko, Ryeowook</creator><creator>Choi, Soongyu</creator><creator>Park, Jongse</creator><creator>Lee, Jinwon</creator><creator>Joo-Young, Kim</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240814</creationdate><title>LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference</title><author>Moon, Seungjae ; Jung-Hoon, Kim ; Kim, Junsoo ; Hong, Seongmin ; Cha, Junseo ; Kim, Minsu ; Lim, Sukbin ; Choi, Gyubin ; Seo, Dongjin ; Kim, Jongho ; Lee, Hunjong ; Park, Hyunjun ; Ko, Ryeowook ; Choi, Soongyu ; Park, Jongse ; Lee, Jinwon ; Joo-Young, Kim</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30932807683</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Globalization</topic><topic>Inference</topic><topic>Large language models</topic><topic>Microprocessors</topic><topic>Network latency</topic><topic>Semantics</topic><topic>Synchronism</topic><toplevel>online_resources</toplevel><creatorcontrib>Moon, Seungjae</creatorcontrib><creatorcontrib>Jung-Hoon, Kim</creatorcontrib><creatorcontrib>Kim, Junsoo</creatorcontrib><creatorcontrib>Hong, Seongmin</creatorcontrib><creatorcontrib>Cha, Junseo</creatorcontrib><creatorcontrib>Kim, Minsu</creatorcontrib><creatorcontrib>Lim, Sukbin</creatorcontrib><creatorcontrib>Choi, Gyubin</creatorcontrib><creatorcontrib>Seo, Dongjin</creatorcontrib><creatorcontrib>Kim, Jongho</creatorcontrib><creatorcontrib>Lee, Hunjong</creatorcontrib><creatorcontrib>Park, Hyunjun</creatorcontrib><creatorcontrib>Ko, Ryeowook</creatorcontrib><creatorcontrib>Choi, Soongyu</creatorcontrib><creatorcontrib>Park, Jongse</creatorcontrib><creatorcontrib>Lee, Jinwon</creatorcontrib><creatorcontrib>Joo-Young, Kim</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Moon, Seungjae</au><au>Jung-Hoon, Kim</au><au>Kim, Junsoo</au><au>Hong, Seongmin</au><au>Cha, Junseo</au><au>Kim, Minsu</au><au>Lim, Sukbin</au><au>Choi, Gyubin</au><au>Seo, Dongjin</au><au>Kim, Jongho</au><au>Lee, Hunjong</au><au>Park, Hyunjun</au><au>Ko, Ryeowook</au><au>Choi, Soongyu</au><au>Park, Jongse</au><au>Lee, Jinwon</au><au>Joo-Young, Kim</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference</atitle><jtitle>arXiv.org</jtitle><date>2024-08-14</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33x and 1.32x energy efficiency over NVIDIA H100 and L4 servers, respectively.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-08
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3093280768
source	Publicly Available Content (ProQuest)
subjects	Globalization Inference Large language models Microprocessors Network latency Semantics Synchronism
title	LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T23%3A40%3A01IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=LPU:%20A%20Latency-Optimized%20and%20Highly%20Scalable%20Processor%20for%20Large%20Language%20Model%20Inference&rft.jtitle=arXiv.org&rft.au=Moon,%20Seungjae&rft.date=2024-08-14&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3093280768%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_30932807683%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3093280768&rft_id=info:pmid/&rfr_iscdi=true