Loading…
LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference
The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalab...
Saved in:
Published in: | arXiv.org 2024-08 |
---|---|
Main Authors: | , , , , , , , , , , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | |
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Moon, Seungjae Jung-Hoon, Kim Kim, Junsoo Hong, Seongmin Cha, Junseo Kim, Minsu Lim, Sukbin Choi, Gyubin Seo, Dongjin Kim, Jongho Lee, Hunjong Park, Hyunjun Ko, Ryeowook Choi, Soongyu Park, Jongse Lee, Jinwon Joo-Young, Kim |
description | The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33x and 1.32x energy efficiency over NVIDIA H100 and L4 servers, respectively. |
format | article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3093280768</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3093280768</sourcerecordid><originalsourceid>FETCH-proquest_journals_30932807683</originalsourceid><addsrcrecordid>eNqNjE0KwjAUhIMgWNQ7PHBdiIn90Z2IUqGioC5cldi-1paY1KRd6OnNwgO4mJkPZpgB8Rjncz9eMDYiU2sbSikLIxYE3CO39HRdwRpS0aHK3_6x7epn_cEChCogqauHfMM5F1LcJcLJ6Byt1QZKp1SYCp2rqhcODrpACXtVonFXOCHDUkiL01-OyWy3vWwSvzX61aPtskb3Rrkq43TJWUyjMOb_rb4QZkD9</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3093280768</pqid></control><display><type>article</type><title>LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference</title><source>Publicly Available Content (ProQuest)</source><creator>Moon, Seungjae ; Jung-Hoon, Kim ; Kim, Junsoo ; Hong, Seongmin ; Cha, Junseo ; Kim, Minsu ; Lim, Sukbin ; Choi, Gyubin ; Seo, Dongjin ; Kim, Jongho ; Lee, Hunjong ; Park, Hyunjun ; Ko, Ryeowook ; Choi, Soongyu ; Park, Jongse ; Lee, Jinwon ; Joo-Young, Kim</creator><creatorcontrib>Moon, Seungjae ; Jung-Hoon, Kim ; Kim, Junsoo ; Hong, Seongmin ; Cha, Junseo ; Kim, Minsu ; Lim, Sukbin ; Choi, Gyubin ; Seo, Dongjin ; Kim, Jongho ; Lee, Hunjong ; Park, Hyunjun ; Ko, Ryeowook ; Choi, Soongyu ; Park, Jongse ; Lee, Jinwon ; Joo-Young, Kim</creatorcontrib><description>The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33x and 1.32x energy efficiency over NVIDIA H100 and L4 servers, respectively.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Globalization ; Inference ; Large language models ; Microprocessors ; Network latency ; Semantics ; Synchronism</subject><ispartof>arXiv.org, 2024-08</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3093280768?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Moon, Seungjae</creatorcontrib><creatorcontrib>Jung-Hoon, Kim</creatorcontrib><creatorcontrib>Kim, Junsoo</creatorcontrib><creatorcontrib>Hong, Seongmin</creatorcontrib><creatorcontrib>Cha, Junseo</creatorcontrib><creatorcontrib>Kim, Minsu</creatorcontrib><creatorcontrib>Lim, Sukbin</creatorcontrib><creatorcontrib>Choi, Gyubin</creatorcontrib><creatorcontrib>Seo, Dongjin</creatorcontrib><creatorcontrib>Kim, Jongho</creatorcontrib><creatorcontrib>Lee, Hunjong</creatorcontrib><creatorcontrib>Park, Hyunjun</creatorcontrib><creatorcontrib>Ko, Ryeowook</creatorcontrib><creatorcontrib>Choi, Soongyu</creatorcontrib><creatorcontrib>Park, Jongse</creatorcontrib><creatorcontrib>Lee, Jinwon</creatorcontrib><creatorcontrib>Joo-Young, Kim</creatorcontrib><title>LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference</title><title>arXiv.org</title><description>The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33x and 1.32x energy efficiency over NVIDIA H100 and L4 servers, respectively.</description><subject>Globalization</subject><subject>Inference</subject><subject>Large language models</subject><subject>Microprocessors</subject><subject>Network latency</subject><subject>Semantics</subject><subject>Synchronism</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNjE0KwjAUhIMgWNQ7PHBdiIn90Z2IUqGioC5cldi-1paY1KRd6OnNwgO4mJkPZpgB8Rjncz9eMDYiU2sbSikLIxYE3CO39HRdwRpS0aHK3_6x7epn_cEChCogqauHfMM5F1LcJcLJ6Byt1QZKp1SYCp2rqhcODrpACXtVonFXOCHDUkiL01-OyWy3vWwSvzX61aPtskb3Rrkq43TJWUyjMOb_rb4QZkD9</recordid><startdate>20240814</startdate><enddate>20240814</enddate><creator>Moon, Seungjae</creator><creator>Jung-Hoon, Kim</creator><creator>Kim, Junsoo</creator><creator>Hong, Seongmin</creator><creator>Cha, Junseo</creator><creator>Kim, Minsu</creator><creator>Lim, Sukbin</creator><creator>Choi, Gyubin</creator><creator>Seo, Dongjin</creator><creator>Kim, Jongho</creator><creator>Lee, Hunjong</creator><creator>Park, Hyunjun</creator><creator>Ko, Ryeowook</creator><creator>Choi, Soongyu</creator><creator>Park, Jongse</creator><creator>Lee, Jinwon</creator><creator>Joo-Young, Kim</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240814</creationdate><title>LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference</title><author>Moon, Seungjae ; Jung-Hoon, Kim ; Kim, Junsoo ; Hong, Seongmin ; Cha, Junseo ; Kim, Minsu ; Lim, Sukbin ; Choi, Gyubin ; Seo, Dongjin ; Kim, Jongho ; Lee, Hunjong ; Park, Hyunjun ; Ko, Ryeowook ; Choi, Soongyu ; Park, Jongse ; Lee, Jinwon ; Joo-Young, Kim</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30932807683</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Globalization</topic><topic>Inference</topic><topic>Large language models</topic><topic>Microprocessors</topic><topic>Network latency</topic><topic>Semantics</topic><topic>Synchronism</topic><toplevel>online_resources</toplevel><creatorcontrib>Moon, Seungjae</creatorcontrib><creatorcontrib>Jung-Hoon, Kim</creatorcontrib><creatorcontrib>Kim, Junsoo</creatorcontrib><creatorcontrib>Hong, Seongmin</creatorcontrib><creatorcontrib>Cha, Junseo</creatorcontrib><creatorcontrib>Kim, Minsu</creatorcontrib><creatorcontrib>Lim, Sukbin</creatorcontrib><creatorcontrib>Choi, Gyubin</creatorcontrib><creatorcontrib>Seo, Dongjin</creatorcontrib><creatorcontrib>Kim, Jongho</creatorcontrib><creatorcontrib>Lee, Hunjong</creatorcontrib><creatorcontrib>Park, Hyunjun</creatorcontrib><creatorcontrib>Ko, Ryeowook</creatorcontrib><creatorcontrib>Choi, Soongyu</creatorcontrib><creatorcontrib>Park, Jongse</creatorcontrib><creatorcontrib>Lee, Jinwon</creatorcontrib><creatorcontrib>Joo-Young, Kim</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Moon, Seungjae</au><au>Jung-Hoon, Kim</au><au>Kim, Junsoo</au><au>Hong, Seongmin</au><au>Cha, Junseo</au><au>Kim, Minsu</au><au>Lim, Sukbin</au><au>Choi, Gyubin</au><au>Seo, Dongjin</au><au>Kim, Jongho</au><au>Lee, Hunjong</au><au>Park, Hyunjun</au><au>Ko, Ryeowook</au><au>Choi, Soongyu</au><au>Park, Jongse</au><au>Lee, Jinwon</au><au>Joo-Young, Kim</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference</atitle><jtitle>arXiv.org</jtitle><date>2024-08-14</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33x and 1.32x energy efficiency over NVIDIA H100 and L4 servers, respectively.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-08 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_3093280768 |
source | Publicly Available Content (ProQuest) |
subjects | Globalization Inference Large language models Microprocessors Network latency Semantics Synchronism |
title | LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T23%3A40%3A01IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=LPU:%20A%20Latency-Optimized%20and%20Highly%20Scalable%20Processor%20for%20Large%20Language%20Model%20Inference&rft.jtitle=arXiv.org&rft.au=Moon,%20Seungjae&rft.date=2024-08-14&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3093280768%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_30932807683%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3093280768&rft_id=info:pmid/&rfr_iscdi=true |