Loading…

A 22 nm Floating-Point ReRAM Compute-in-Memory Macro Using Residue-Shared ADC for AI Edge Device

Artificial intelligence (AI) edge devices increasingly require the enhanced accuracy of floating-point (FP) multiply-and-accumulate (MAC) operations as well as nonvolatile on-chip memory to minimize the movement of weight data in power-off mode. Designing non-volatile compute-in-memory (nvCIM) macro...

Full description

Saved in:
Bibliographic Details
Published in:IEEE journal of solid-state circuits 2024-10, p.1-13
Main Authors: Hsu, Hung-Hsi, Wen, Tai-Hao, Khwa, Win-San, Huang, Wei-Hsing, Ke, Zhao-En, Chin, Yu-Hsiang, Wen, Hua-Jin, Chang, Yu-Chen, Hsu, Wei-Ting, Lele, Ashwin Sanjay, Zhang, Bo, Wu, Ping-Sheng, Lo, Chung-Chuan, Liu, Ren-Shuo, Hsieh, Chih-Cheng, Tang, Kea-Tiong, Teng, Shih-Hsin, Chou, Chung-Cheng, Chih, Yu-Der, Chang, Tsung-Yung Jonathan, Chang, Meng-Fan
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Artificial intelligence (AI) edge devices increasingly require the enhanced accuracy of floating-point (FP) multiply-and-accumulate (MAC) operations as well as nonvolatile on-chip memory to minimize the movement of weight data in power-off mode. Designing non-volatile compute-in-memory (nvCIM) macros for FP operations imposes several challenges, including: 1) a tradeoff between inference accuracy and weight bit-width following pre-alignment; 2) long computing latency and high energy consumption; 3) large cell array current during computation; and 4) high multi-bit readout energy consumption. In this study, we devised four schemes to address these issues, including: 1) a kernel-wise weight pre-alignment (K-WPA); 2) a rescheduled multi-bit input compression (RS-MIC); 3) HRS-favored dual-sign-bit (HF-DSB); and 4) residue-shared analog-to-digital converter (RS-ADC). A 16 Mb resistive random access memory (ReRAM) nvCIM macro fabricated for FP operations using foundry-provided ReRAM (22 nm CMOS technology) achieved an efficiency of 34.2 TFLOPS/W under BF16-input, BF16-weight, and FP32-output and 31.4 TFLOPS/W under FP16-input, FP16-weight, and FP32-output.
ISSN:0018-9200
1558-173X
DOI:10.1109/JSSC.2024.3470211