Loading…

A Multi-Mode 8k-MAC HW-Utilization-Aware Neural Processing Unit With a Unified Multi-Precision Datapath in 4-nm Flagship Mobile SoC

This article presents an 8k-multiply-accumulate (MAC) neural processing unit (NPU) in 4-nm mobile system-on-chip (SoC). The unified multi-precision MACs support from integer (INT)4/8/16 to floating point (FP)16 data with high area and energy efficiency. When the NPU meets some layers having low hard...

Full description

Saved in:
Bibliographic Details
Published in:IEEE journal of solid-state circuits 2023-01, Vol.58 (1), p.189-202
Main Authors: Park, Jun-Seok, Park, Changsoo, Kwon, Suknam, Jeon, Taeho, Kang, Yesung, Lee, Heonsoo, Lee, Dongwoo, Kim, James, Kim, Hyeong-Seok, Lee, YoungJong, Park, Sangkyu, Kim, MinSeong, Ha, SangHyuck, Bang, Jihoon, Park, Jinpyo, Lim, SukHwan, Kang, Inyup
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This article presents an 8k-multiply-accumulate (MAC) neural processing unit (NPU) in 4-nm mobile system-on-chip (SoC). The unified multi-precision MACs support from integer (INT)4/8/16 to floating point (FP)16 data with high area and energy efficiency. When the NPU meets some layers having low hardware (HW) utilization, such as depthwise convolution or shallow layers with a few input channels, the NPU reconfigures the computational flow to enhance the utilization up to four times after getting basic tensor information from a compiler, such as operation types and shapes. The NPU supports a dynamic operation mode to cover extremely low-power to low-latency requirements. The NPU achieves 4.26 tera FP operations per second (TFLOPS)/W and 11.59 tera operations per second (TOPS)/W for DeepLabV3 (FP16) and MobileNetEdgeTPU (INT8), respectively, as well as high area efficiency (1.72 TFLOPS/mm2 and 3.45 TOPS/mm2).
ISSN:0018-9200
1558-173X
DOI:10.1109/JSSC.2022.3205713