Loading…
A 22 nm Floating-Point ReRAM Compute-in-Memory Macro Using Residue-Shared ADC for AI Edge Device
Artificial intelligence (AI) edge devices increasingly require the enhanced accuracy of floating-point (FP) multiply-and-accumulate (MAC) operations as well as nonvolatile on-chip memory to minimize the movement of weight data in power-off mode. Designing non-volatile compute-in-memory (nvCIM) macro...
Saved in:
Published in: | IEEE journal of solid-state circuits 2024-10, p.1-13 |
---|---|
Main Authors: | , , , , , , , , , , , , , , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Artificial intelligence (AI) edge devices increasingly require the enhanced accuracy of floating-point (FP) multiply-and-accumulate (MAC) operations as well as nonvolatile on-chip memory to minimize the movement of weight data in power-off mode. Designing non-volatile compute-in-memory (nvCIM) macros for FP operations imposes several challenges, including: 1) a tradeoff between inference accuracy and weight bit-width following pre-alignment; 2) long computing latency and high energy consumption; 3) large cell array current during computation; and 4) high multi-bit readout energy consumption. In this study, we devised four schemes to address these issues, including: 1) a kernel-wise weight pre-alignment (K-WPA); 2) a rescheduled multi-bit input compression (RS-MIC); 3) HRS-favored dual-sign-bit (HF-DSB); and 4) residue-shared analog-to-digital converter (RS-ADC). A 16 Mb resistive random access memory (ReRAM) nvCIM macro fabricated for FP operations using foundry-provided ReRAM (22 nm CMOS technology) achieved an efficiency of 34.2 TFLOPS/W under BF16-input, BF16-weight, and FP32-output and 31.4 TFLOPS/W under FP16-input, FP16-weight, and FP32-output. |
---|---|
ISSN: | 0018-9200 1558-173X |
DOI: | 10.1109/JSSC.2024.3470211 |