Loading…

PipeCIM: A High-Throughput Computing-In-Memory Microprocessor With Nested Pipeline and RISC-V Extended Instructions

The large number of multiply accumulate (MAC) operations in Convolutional Neural Network (CNN) leads to substantial data migration and computation. Although computing-in-memory (CIM) proves to be a promising paradigm for MAC operations, high throughput CNN accelerator still confronts bottlenecks fro...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on circuits and systems. I, Regular papers Regular papers, 2024-07, Vol.71 (7), p.3214-3227
Main Authors: Chen, Tingran, Wang, Wenjia, Chen, Jiaqi, Fu, Haotian, Yi, Wente, Cheng, Bojun, Zhang, He, Pan, Biao
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The large number of multiply accumulate (MAC) operations in Convolutional Neural Network (CNN) leads to substantial data migration and computation. Although computing-in-memory (CIM) proves to be a promising paradigm for MAC operations, high throughput CNN accelerator still confronts bottlenecks from: the low MAC utilization and the uncessary off-chip memory access. In this paper, we propose a high throughput CIM-based CNN accelerator PipeCIM with three hierarchies of pipelines: Intra-Macro, Near-Memory and Tile-Level. The Intra-Macro Pipeline parallelly executes data transfer and in-memory-computing (IMC) operations. The Near-Memory Pipeline alleviates memory access for pooling and data reshaping. The Tile-Level Pipeline establishes a layer-wise pipeline to further improve the throughput while reducing control complexity. PipeCIM introduces the nested scheme and a Unidirectional Divergent Connection Protocol (UDTCP) to simplify the control of data flow with the help of customized RISC-V instructions. To validate our design, PipeCIM was prototyped in 55 nm process node, achieving energy efficiency of 133.8 TOPS/W and peak throughput of 819 GOPS with a 16KB CIM array, which can accelerate VGG-16 to 128.56\times or Inception to 19.754\times compared to the baseline.
ISSN:1549-8328
1558-0806
DOI:10.1109/TCSI.2024.3384271