Loading…

LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference

The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalab...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2024-08
Main Authors: Moon, Seungjae, Jung-Hoon, Kim, Kim, Junsoo, Hong, Seongmin, Cha, Junseo, Kim, Minsu, Lim, Sukbin, Choi, Gyubin, Seo, Dongjin, Kim, Jongho, Lee, Hunjong, Park, Hyunjun, Ko, Ryeowook, Choi, Soongyu, Park, Jongse, Lee, Jinwon, Joo-Young, Kim
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Moon, Seungjae
Jung-Hoon, Kim
Kim, Junsoo
Hong, Seongmin
Cha, Junseo
Kim, Minsu
Lim, Sukbin
Choi, Gyubin
Seo, Dongjin
Kim, Jongho
Lee, Hunjong
Park, Hyunjun
Ko, Ryeowook
Choi, Soongyu
Park, Jongse
Lee, Jinwon
Joo-Young, Kim
description The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33x and 1.32x energy efficiency over NVIDIA H100 and L4 servers, respectively.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3093280768</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3093280768</sourcerecordid><originalsourceid>FETCH-proquest_journals_30932807683</originalsourceid><addsrcrecordid>eNqNjE0KwjAUhIMgWNQ7PHBdiIn90Z2IUqGioC5cldi-1paY1KRd6OnNwgO4mJkPZpgB8Rjncz9eMDYiU2sbSikLIxYE3CO39HRdwRpS0aHK3_6x7epn_cEChCogqauHfMM5F1LcJcLJ6Byt1QZKp1SYCp2rqhcODrpACXtVonFXOCHDUkiL01-OyWy3vWwSvzX61aPtskb3Rrkq43TJWUyjMOb_rb4QZkD9</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3093280768</pqid></control><display><type>article</type><title>LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference</title><source>Publicly Available Content (ProQuest)</source><creator>Moon, Seungjae ; Jung-Hoon, Kim ; Kim, Junsoo ; Hong, Seongmin ; Cha, Junseo ; Kim, Minsu ; Lim, Sukbin ; Choi, Gyubin ; Seo, Dongjin ; Kim, Jongho ; Lee, Hunjong ; Park, Hyunjun ; Ko, Ryeowook ; Choi, Soongyu ; Park, Jongse ; Lee, Jinwon ; Joo-Young, Kim</creator><creatorcontrib>Moon, Seungjae ; Jung-Hoon, Kim ; Kim, Junsoo ; Hong, Seongmin ; Cha, Junseo ; Kim, Minsu ; Lim, Sukbin ; Choi, Gyubin ; Seo, Dongjin ; Kim, Jongho ; Lee, Hunjong ; Park, Hyunjun ; Ko, Ryeowook ; Choi, Soongyu ; Park, Jongse ; Lee, Jinwon ; Joo-Young, Kim</creatorcontrib><description>The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33x and 1.32x energy efficiency over NVIDIA H100 and L4 servers, respectively.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Globalization ; Inference ; Large language models ; Microprocessors ; Network latency ; Semantics ; Synchronism</subject><ispartof>arXiv.org, 2024-08</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3093280768?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Moon, Seungjae</creatorcontrib><creatorcontrib>Jung-Hoon, Kim</creatorcontrib><creatorcontrib>Kim, Junsoo</creatorcontrib><creatorcontrib>Hong, Seongmin</creatorcontrib><creatorcontrib>Cha, Junseo</creatorcontrib><creatorcontrib>Kim, Minsu</creatorcontrib><creatorcontrib>Lim, Sukbin</creatorcontrib><creatorcontrib>Choi, Gyubin</creatorcontrib><creatorcontrib>Seo, Dongjin</creatorcontrib><creatorcontrib>Kim, Jongho</creatorcontrib><creatorcontrib>Lee, Hunjong</creatorcontrib><creatorcontrib>Park, Hyunjun</creatorcontrib><creatorcontrib>Ko, Ryeowook</creatorcontrib><creatorcontrib>Choi, Soongyu</creatorcontrib><creatorcontrib>Park, Jongse</creatorcontrib><creatorcontrib>Lee, Jinwon</creatorcontrib><creatorcontrib>Joo-Young, Kim</creatorcontrib><title>LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference</title><title>arXiv.org</title><description>The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33x and 1.32x energy efficiency over NVIDIA H100 and L4 servers, respectively.</description><subject>Globalization</subject><subject>Inference</subject><subject>Large language models</subject><subject>Microprocessors</subject><subject>Network latency</subject><subject>Semantics</subject><subject>Synchronism</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNjE0KwjAUhIMgWNQ7PHBdiIn90Z2IUqGioC5cldi-1paY1KRd6OnNwgO4mJkPZpgB8Rjncz9eMDYiU2sbSikLIxYE3CO39HRdwRpS0aHK3_6x7epn_cEChCogqauHfMM5F1LcJcLJ6Byt1QZKp1SYCp2rqhcODrpACXtVonFXOCHDUkiL01-OyWy3vWwSvzX61aPtskb3Rrkq43TJWUyjMOb_rb4QZkD9</recordid><startdate>20240814</startdate><enddate>20240814</enddate><creator>Moon, Seungjae</creator><creator>Jung-Hoon, Kim</creator><creator>Kim, Junsoo</creator><creator>Hong, Seongmin</creator><creator>Cha, Junseo</creator><creator>Kim, Minsu</creator><creator>Lim, Sukbin</creator><creator>Choi, Gyubin</creator><creator>Seo, Dongjin</creator><creator>Kim, Jongho</creator><creator>Lee, Hunjong</creator><creator>Park, Hyunjun</creator><creator>Ko, Ryeowook</creator><creator>Choi, Soongyu</creator><creator>Park, Jongse</creator><creator>Lee, Jinwon</creator><creator>Joo-Young, Kim</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240814</creationdate><title>LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference</title><author>Moon, Seungjae ; Jung-Hoon, Kim ; Kim, Junsoo ; Hong, Seongmin ; Cha, Junseo ; Kim, Minsu ; Lim, Sukbin ; Choi, Gyubin ; Seo, Dongjin ; Kim, Jongho ; Lee, Hunjong ; Park, Hyunjun ; Ko, Ryeowook ; Choi, Soongyu ; Park, Jongse ; Lee, Jinwon ; Joo-Young, Kim</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30932807683</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Globalization</topic><topic>Inference</topic><topic>Large language models</topic><topic>Microprocessors</topic><topic>Network latency</topic><topic>Semantics</topic><topic>Synchronism</topic><toplevel>online_resources</toplevel><creatorcontrib>Moon, Seungjae</creatorcontrib><creatorcontrib>Jung-Hoon, Kim</creatorcontrib><creatorcontrib>Kim, Junsoo</creatorcontrib><creatorcontrib>Hong, Seongmin</creatorcontrib><creatorcontrib>Cha, Junseo</creatorcontrib><creatorcontrib>Kim, Minsu</creatorcontrib><creatorcontrib>Lim, Sukbin</creatorcontrib><creatorcontrib>Choi, Gyubin</creatorcontrib><creatorcontrib>Seo, Dongjin</creatorcontrib><creatorcontrib>Kim, Jongho</creatorcontrib><creatorcontrib>Lee, Hunjong</creatorcontrib><creatorcontrib>Park, Hyunjun</creatorcontrib><creatorcontrib>Ko, Ryeowook</creatorcontrib><creatorcontrib>Choi, Soongyu</creatorcontrib><creatorcontrib>Park, Jongse</creatorcontrib><creatorcontrib>Lee, Jinwon</creatorcontrib><creatorcontrib>Joo-Young, Kim</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Moon, Seungjae</au><au>Jung-Hoon, Kim</au><au>Kim, Junsoo</au><au>Hong, Seongmin</au><au>Cha, Junseo</au><au>Kim, Minsu</au><au>Lim, Sukbin</au><au>Choi, Gyubin</au><au>Seo, Dongjin</au><au>Kim, Jongho</au><au>Lee, Hunjong</au><au>Park, Hyunjun</au><au>Ko, Ryeowook</au><au>Choi, Soongyu</au><au>Park, Jongse</au><au>Lee, Jinwon</au><au>Joo-Young, Kim</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference</atitle><jtitle>arXiv.org</jtitle><date>2024-08-14</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33x and 1.32x energy efficiency over NVIDIA H100 and L4 servers, respectively.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-08
issn 2331-8422
language eng
recordid cdi_proquest_journals_3093280768
source Publicly Available Content (ProQuest)
subjects Globalization
Inference
Large language models
Microprocessors
Network latency
Semantics
Synchronism
title LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T23%3A40%3A01IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=LPU:%20A%20Latency-Optimized%20and%20Highly%20Scalable%20Processor%20for%20Large%20Language%20Model%20Inference&rft.jtitle=arXiv.org&rft.au=Moon,%20Seungjae&rft.date=2024-08-14&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3093280768%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_30932807683%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3093280768&rft_id=info:pmid/&rfr_iscdi=true