Loading…

A Multi-Mode 8k-MAC HW-Utilization-Aware Neural Processing Unit With a Unified Multi-Precision Datapath in 4-nm Flagship Mobile SoC

This article presents an 8k-multiply-accumulate (MAC) neural processing unit (NPU) in 4-nm mobile system-on-chip (SoC). The unified multi-precision MACs support from integer (INT)4/8/16 to floating point (FP)16 data with high area and energy efficiency. When the NPU meets some layers having low hard...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE journal of solid-state circuits 2023-01, Vol.58 (1), p.189-202
Main Authors:	Park, Jun-Seok, Park, Changsoo, Kwon, Suknam, Jeon, Taeho, Kang, Yesung, Lee, Heonsoo, Lee, Dongwoo, Kim, James, Kim, Hyeong-Seok, Lee, YoungJong, Park, Sangkyu, Kim, MinSeong, Ha, SangHyuck, Bang, Jihoon, Park, Jinpyo, Lim, SukHwan, Kang, Inyup
Format:	Article
Language:	English
Subjects:	Artificial neural networks Computational modeling Compute utilization Computer architecture deep neural networks (DNNs) domain-specific architecture (DSA) Floating point arithmetic Frequency modulation Hardware inference accelerator Parallel processing sparsity-aware zero skipping System on chip Tensors unified multiply-accumulate (MAC) Utilization
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c293t-35c978a2e589482d2cbeb6b2b6be542d1be0fe30f04d20f16fd618ed7938ca7a3
cites	cdi_FETCH-LOGICAL-c293t-35c978a2e589482d2cbeb6b2b6be542d1be0fe30f04d20f16fd618ed7938ca7a3
container_end_page	202
container_issue	1
container_start_page	189
container_title	IEEE journal of solid-state circuits
container_volume	58
creator	Park, Jun-Seok Park, Changsoo Kwon, Suknam Jeon, Taeho Kang, Yesung Lee, Heonsoo Lee, Dongwoo Kim, James Kim, Hyeong-Seok Lee, YoungJong Park, Sangkyu Kim, MinSeong Ha, SangHyuck Bang, Jihoon Park, Jinpyo Lim, SukHwan Kang, Inyup
description	This article presents an 8k-multiply-accumulate (MAC) neural processing unit (NPU) in 4-nm mobile system-on-chip (SoC). The unified multi-precision MACs support from integer (INT)4/8/16 to floating point (FP)16 data with high area and energy efficiency. When the NPU meets some layers having low hardware (HW) utilization, such as depthwise convolution or shallow layers with a few input channels, the NPU reconfigures the computational flow to enhance the utilization up to four times after getting basic tensor information from a compiler, such as operation types and shapes. The NPU supports a dynamic operation mode to cover extremely low-power to low-latency requirements. The NPU achieves 4.26 tera FP operations per second (TFLOPS)/W and 11.59 tera operations per second (TOPS)/W for DeepLabV3 (FP16) and MobileNetEdgeTPU (INT8), respectively, as well as high area efficiency (1.72 TFLOPS/mm2 and 3.45 TOPS/mm2).
doi_str_mv	10.1109/JSSC.2022.3205713
format	article
fullrecord	<record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_ieee_primary_9916240</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9916240</ieee_id><sourcerecordid>2758723301</sourcerecordid><originalsourceid>FETCH-LOGICAL-c293t-35c978a2e589482d2cbeb6b2b6be542d1be0fe30f04d20f16fd618ed7938ca7a3</originalsourceid><addsrcrecordid>eNo9kE1Lw0AQhhdRsFZ_gHhZ8Lx1P_KxOYZordJooZZ6C5tk0m5Nk7ibIHr1j5vQ4mGYGXjeGXgQumZ0whgN7p6Xy2jCKecTwanrM3GCRsx1JWG-eD9FI0qZJAGn9BxdWLvrV8eRbIR-Qxx3ZatJXOeA5QeJwwjP1mTV6lL_qFbXFQm_lAH8Ap1RJV6YOgNrdbXBq0q3eK3bLVbDXGjIj8cWBjJt-yy-V61qVI_oCjuk2uNpqTZ2qxsc16kuAS_r6BKdFaq0cHXsY7SaPrxFMzJ_fXyKwjnJeCBaItws8KXi4MrAkTznWQqpl_K-wHV4zlKgBQhaUCfntGBekXtMQu4HQmbKV2KMbg93G1N_dmDbZFd3pupfJtx3pc-FoKyn2IHKTG2tgSJpjN4r850wmgyuk8F1MrhOjq77zM0howHgnw8C5nGHij8NjHoI</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2758723301</pqid></control><display><type>article</type><title>A Multi-Mode 8k-MAC HW-Utilization-Aware Neural Processing Unit With a Unified Multi-Precision Datapath in 4-nm Flagship Mobile SoC</title><source>IEEE Xplore (Online service)</source><creator>Park, Jun-Seok ; Park, Changsoo ; Kwon, Suknam ; Jeon, Taeho ; Kang, Yesung ; Lee, Heonsoo ; Lee, Dongwoo ; Kim, James ; Kim, Hyeong-Seok ; Lee, YoungJong ; Park, Sangkyu ; Kim, MinSeong ; Ha, SangHyuck ; Bang, Jihoon ; Park, Jinpyo ; Lim, SukHwan ; Kang, Inyup</creator><creatorcontrib>Park, Jun-Seok ; Park, Changsoo ; Kwon, Suknam ; Jeon, Taeho ; Kang, Yesung ; Lee, Heonsoo ; Lee, Dongwoo ; Kim, James ; Kim, Hyeong-Seok ; Lee, YoungJong ; Park, Sangkyu ; Kim, MinSeong ; Ha, SangHyuck ; Bang, Jihoon ; Park, Jinpyo ; Lim, SukHwan ; Kang, Inyup</creatorcontrib><description>This article presents an 8k-multiply-accumulate (MAC) neural processing unit (NPU) in 4-nm mobile system-on-chip (SoC). The unified multi-precision MACs support from integer (INT)4/8/16 to floating point (FP)16 data with high area and energy efficiency. When the NPU meets some layers having low hardware (HW) utilization, such as depthwise convolution or shallow layers with a few input channels, the NPU reconfigures the computational flow to enhance the utilization up to four times after getting basic tensor information from a compiler, such as operation types and shapes. The NPU supports a dynamic operation mode to cover extremely low-power to low-latency requirements. The NPU achieves 4.26 tera FP operations per second (TFLOPS)/W and 11.59 tera operations per second (TOPS)/W for DeepLabV3 (FP16) and MobileNetEdgeTPU (INT8), respectively, as well as high area efficiency (1.72 TFLOPS/mm2 and 3.45 TOPS/mm2).</description><identifier>ISSN: 0018-9200</identifier><identifier>EISSN: 1558-173X</identifier><identifier>DOI: 10.1109/JSSC.2022.3205713</identifier><identifier>CODEN: IJSCBC</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Artificial neural networks ; Computational modeling ; Compute utilization ; Computer architecture ; deep neural networks (DNNs) ; domain-specific architecture (DSA) ; Floating point arithmetic ; Frequency modulation ; Hardware ; inference accelerator ; Parallel processing ; sparsity-aware zero skipping ; System on chip ; Tensors ; unified multiply-accumulate (MAC) ; Utilization</subject><ispartof>IEEE journal of solid-state circuits, 2023-01, Vol.58 (1), p.189-202</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c293t-35c978a2e589482d2cbeb6b2b6be542d1be0fe30f04d20f16fd618ed7938ca7a3</citedby><cites>FETCH-LOGICAL-c293t-35c978a2e589482d2cbeb6b2b6be542d1be0fe30f04d20f16fd618ed7938ca7a3</cites><orcidid>0000-0002-9934-2726</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9916240$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,54796</link.rule.ids></links><search><creatorcontrib>Park, Jun-Seok</creatorcontrib><creatorcontrib>Park, Changsoo</creatorcontrib><creatorcontrib>Kwon, Suknam</creatorcontrib><creatorcontrib>Jeon, Taeho</creatorcontrib><creatorcontrib>Kang, Yesung</creatorcontrib><creatorcontrib>Lee, Heonsoo</creatorcontrib><creatorcontrib>Lee, Dongwoo</creatorcontrib><creatorcontrib>Kim, James</creatorcontrib><creatorcontrib>Kim, Hyeong-Seok</creatorcontrib><creatorcontrib>Lee, YoungJong</creatorcontrib><creatorcontrib>Park, Sangkyu</creatorcontrib><creatorcontrib>Kim, MinSeong</creatorcontrib><creatorcontrib>Ha, SangHyuck</creatorcontrib><creatorcontrib>Bang, Jihoon</creatorcontrib><creatorcontrib>Park, Jinpyo</creatorcontrib><creatorcontrib>Lim, SukHwan</creatorcontrib><creatorcontrib>Kang, Inyup</creatorcontrib><title>A Multi-Mode 8k-MAC HW-Utilization-Aware Neural Processing Unit With a Unified Multi-Precision Datapath in 4-nm Flagship Mobile SoC</title><title>IEEE journal of solid-state circuits</title><addtitle>JSSC</addtitle><description>This article presents an 8k-multiply-accumulate (MAC) neural processing unit (NPU) in 4-nm mobile system-on-chip (SoC). The unified multi-precision MACs support from integer (INT)4/8/16 to floating point (FP)16 data with high area and energy efficiency. When the NPU meets some layers having low hardware (HW) utilization, such as depthwise convolution or shallow layers with a few input channels, the NPU reconfigures the computational flow to enhance the utilization up to four times after getting basic tensor information from a compiler, such as operation types and shapes. The NPU supports a dynamic operation mode to cover extremely low-power to low-latency requirements. The NPU achieves 4.26 tera FP operations per second (TFLOPS)/W and 11.59 tera operations per second (TOPS)/W for DeepLabV3 (FP16) and MobileNetEdgeTPU (INT8), respectively, as well as high area efficiency (1.72 TFLOPS/mm2 and 3.45 TOPS/mm2).</description><subject>Artificial neural networks</subject><subject>Computational modeling</subject><subject>Compute utilization</subject><subject>Computer architecture</subject><subject>deep neural networks (DNNs)</subject><subject>domain-specific architecture (DSA)</subject><subject>Floating point arithmetic</subject><subject>Frequency modulation</subject><subject>Hardware</subject><subject>inference accelerator</subject><subject>Parallel processing</subject><subject>sparsity-aware zero skipping</subject><subject>System on chip</subject><subject>Tensors</subject><subject>unified multiply-accumulate (MAC)</subject><subject>Utilization</subject><issn>0018-9200</issn><issn>1558-173X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNo9kE1Lw0AQhhdRsFZ_gHhZ8Lx1P_KxOYZordJooZZ6C5tk0m5Nk7ibIHr1j5vQ4mGYGXjeGXgQumZ0whgN7p6Xy2jCKecTwanrM3GCRsx1JWG-eD9FI0qZJAGn9BxdWLvrV8eRbIR-Qxx3ZatJXOeA5QeJwwjP1mTV6lL_qFbXFQm_lAH8Ap1RJV6YOgNrdbXBq0q3eK3bLVbDXGjIj8cWBjJt-yy-V61qVI_oCjuk2uNpqTZ2qxsc16kuAS_r6BKdFaq0cHXsY7SaPrxFMzJ_fXyKwjnJeCBaItws8KXi4MrAkTznWQqpl_K-wHV4zlKgBQhaUCfntGBekXtMQu4HQmbKV2KMbg93G1N_dmDbZFd3pupfJtx3pc-FoKyn2IHKTG2tgSJpjN4r850wmgyuk8F1MrhOjq77zM0howHgnw8C5nGHij8NjHoI</recordid><startdate>20230101</startdate><enddate>20230101</enddate><creator>Park, Jun-Seok</creator><creator>Park, Changsoo</creator><creator>Kwon, Suknam</creator><creator>Jeon, Taeho</creator><creator>Kang, Yesung</creator><creator>Lee, Heonsoo</creator><creator>Lee, Dongwoo</creator><creator>Kim, James</creator><creator>Kim, Hyeong-Seok</creator><creator>Lee, YoungJong</creator><creator>Park, Sangkyu</creator><creator>Kim, MinSeong</creator><creator>Ha, SangHyuck</creator><creator>Bang, Jihoon</creator><creator>Park, Jinpyo</creator><creator>Lim, SukHwan</creator><creator>Kang, Inyup</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SP</scope><scope>8FD</scope><scope>L7M</scope><orcidid>https://orcid.org/0000-0002-9934-2726</orcidid></search><sort><creationdate>20230101</creationdate><title>A Multi-Mode 8k-MAC HW-Utilization-Aware Neural Processing Unit With a Unified Multi-Precision Datapath in 4-nm Flagship Mobile SoC</title><author>Park, Jun-Seok ; Park, Changsoo ; Kwon, Suknam ; Jeon, Taeho ; Kang, Yesung ; Lee, Heonsoo ; Lee, Dongwoo ; Kim, James ; Kim, Hyeong-Seok ; Lee, YoungJong ; Park, Sangkyu ; Kim, MinSeong ; Ha, SangHyuck ; Bang, Jihoon ; Park, Jinpyo ; Lim, SukHwan ; Kang, Inyup</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c293t-35c978a2e589482d2cbeb6b2b6be542d1be0fe30f04d20f16fd618ed7938ca7a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Artificial neural networks</topic><topic>Computational modeling</topic><topic>Compute utilization</topic><topic>Computer architecture</topic><topic>deep neural networks (DNNs)</topic><topic>domain-specific architecture (DSA)</topic><topic>Floating point arithmetic</topic><topic>Frequency modulation</topic><topic>Hardware</topic><topic>inference accelerator</topic><topic>Parallel processing</topic><topic>sparsity-aware zero skipping</topic><topic>System on chip</topic><topic>Tensors</topic><topic>unified multiply-accumulate (MAC)</topic><topic>Utilization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Park, Jun-Seok</creatorcontrib><creatorcontrib>Park, Changsoo</creatorcontrib><creatorcontrib>Kwon, Suknam</creatorcontrib><creatorcontrib>Jeon, Taeho</creatorcontrib><creatorcontrib>Kang, Yesung</creatorcontrib><creatorcontrib>Lee, Heonsoo</creatorcontrib><creatorcontrib>Lee, Dongwoo</creatorcontrib><creatorcontrib>Kim, James</creatorcontrib><creatorcontrib>Kim, Hyeong-Seok</creatorcontrib><creatorcontrib>Lee, YoungJong</creatorcontrib><creatorcontrib>Park, Sangkyu</creatorcontrib><creatorcontrib>Kim, MinSeong</creatorcontrib><creatorcontrib>Ha, SangHyuck</creatorcontrib><creatorcontrib>Bang, Jihoon</creatorcontrib><creatorcontrib>Park, Jinpyo</creatorcontrib><creatorcontrib>Lim, SukHwan</creatorcontrib><creatorcontrib>Kang, Inyup</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>Advanced Technologies Database with Aerospace</collection><jtitle>IEEE journal of solid-state circuits</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Park, Jun-Seok</au><au>Park, Changsoo</au><au>Kwon, Suknam</au><au>Jeon, Taeho</au><au>Kang, Yesung</au><au>Lee, Heonsoo</au><au>Lee, Dongwoo</au><au>Kim, James</au><au>Kim, Hyeong-Seok</au><au>Lee, YoungJong</au><au>Park, Sangkyu</au><au>Kim, MinSeong</au><au>Ha, SangHyuck</au><au>Bang, Jihoon</au><au>Park, Jinpyo</au><au>Lim, SukHwan</au><au>Kang, Inyup</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Multi-Mode 8k-MAC HW-Utilization-Aware Neural Processing Unit With a Unified Multi-Precision Datapath in 4-nm Flagship Mobile SoC</atitle><jtitle>IEEE journal of solid-state circuits</jtitle><stitle>JSSC</stitle><date>2023-01-01</date><risdate>2023</risdate><volume>58</volume><issue>1</issue><spage>189</spage><epage>202</epage><pages>189-202</pages><issn>0018-9200</issn><eissn>1558-173X</eissn><coden>IJSCBC</coden><abstract>This article presents an 8k-multiply-accumulate (MAC) neural processing unit (NPU) in 4-nm mobile system-on-chip (SoC). The unified multi-precision MACs support from integer (INT)4/8/16 to floating point (FP)16 data with high area and energy efficiency. When the NPU meets some layers having low hardware (HW) utilization, such as depthwise convolution or shallow layers with a few input channels, the NPU reconfigures the computational flow to enhance the utilization up to four times after getting basic tensor information from a compiler, such as operation types and shapes. The NPU supports a dynamic operation mode to cover extremely low-power to low-latency requirements. The NPU achieves 4.26 tera FP operations per second (TFLOPS)/W and 11.59 tera operations per second (TOPS)/W for DeepLabV3 (FP16) and MobileNetEdgeTPU (INT8), respectively, as well as high area efficiency (1.72 TFLOPS/mm2 and 3.45 TOPS/mm2).</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/JSSC.2022.3205713</doi><tpages>14</tpages><orcidid>https://orcid.org/0000-0002-9934-2726</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 0018-9200
ispartof	IEEE journal of solid-state circuits, 2023-01, Vol.58 (1), p.189-202
issn	0018-9200 1558-173X
language	eng
recordid	cdi_ieee_primary_9916240
source	IEEE Xplore (Online service)
subjects	Artificial neural networks Computational modeling Compute utilization Computer architecture deep neural networks (DNNs) domain-specific architecture (DSA) Floating point arithmetic Frequency modulation Hardware inference accelerator Parallel processing sparsity-aware zero skipping System on chip Tensors unified multiply-accumulate (MAC) Utilization
title	A Multi-Mode 8k-MAC HW-Utilization-Aware Neural Processing Unit With a Unified Multi-Precision Datapath in 4-nm Flagship Mobile SoC
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T17%3A19%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Multi-Mode%208k-MAC%20HW-Utilization-Aware%20Neural%20Processing%20Unit%20With%20a%20Unified%20Multi-Precision%20Datapath%20in%204-nm%20Flagship%20Mobile%20SoC&rft.jtitle=IEEE%20journal%20of%20solid-state%20circuits&rft.au=Park,%20Jun-Seok&rft.date=2023-01-01&rft.volume=58&rft.issue=1&rft.spage=189&rft.epage=202&rft.pages=189-202&rft.issn=0018-9200&rft.eissn=1558-173X&rft.coden=IJSCBC&rft_id=info:doi/10.1109/JSSC.2022.3205713&rft_dat=%3Cproquest_ieee_%3E2758723301%3C/proquest_ieee_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c293t-35c978a2e589482d2cbeb6b2b6be542d1be0fe30f04d20f16fd618ed7938ca7a3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2758723301&rft_id=info:pmid/&rft_ieee_id=9916240&rfr_iscdi=true