Loading…

A Communication-Aware DNN Accelerator on ImageNet Using In-Memory Entry-Counting Based Algorithm-Circuit-Architecture Co-Design in 65-nm CMOS

This article presents a communication-aware processing-in-memory deep neural network accelerator, which implements an in-memory entry-counting scheme for low bit-width quantized multiplication-and-accumulations (MACs). To maintain good accuracy on ImageNet, the proposed design adopts a full-stack co...

Full description

Saved in:
Bibliographic Details
Published in:IEEE journal on emerging and selected topics in circuits and systems 2020-09, Vol.10 (3), p.283-294
Main Authors: Zhu, Haozhe, Chen, Chixiao, Liu, Shiwei, Zou, Qiaosha, Wang, Mingyu, Zhang, Lihua, Zeng, Xiaoyang, Shi, C.-J. Richard
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This article presents a communication-aware processing-in-memory deep neural network accelerator, which implements an in-memory entry-counting scheme for low bit-width quantized multiplication-and-accumulations (MACs). To maintain good accuracy on ImageNet, the proposed design adopts a full-stack co-design methodology, from algorithms, circuits to architectures. In the algorithm level, an entry-counting based MAC is proposed to fit the learned step-sized quantization scheme, and exploit the sparsity of both activations and weights intrinsically. In the circuit level, content addressable memory cells and multiplexed arrays are developed in the processing-in-memory macro. In the architecture level, the proposed design is compatible with different stationary dataflow mappings, further reducing the memory access. An in-memory entry-counting silicon prototype and its entire peripheral circuits are fabricated in 65nm LP CMOS technology with an active area of 0.76\times 0.66 mm 2 . The 7.36-Kb processing-in-memory macro with 128 search entries can reduce the multiplication number by 12.8\times . The peak throughput is 3.58 GOPS, achieved at a clock rate of 143MHz and a power supply of 1.23V. The peak energy efficiency of the processing-in-memory macro is 11.6 TOPS/W, achieved at a clock rate of 40MHz and a power supply of 1.01V. Note that the physical design of the entry-counting memory is completed in a standard digital placement and routing flow by augmenting the library with two dedicated memory cells. A 3-bit quantized ResNet-18 on the ImageNet dataset is performed, where the top-1 accuracy is 64.4%.
ISSN:2156-3357
2156-3365
DOI:10.1109/JETCAS.2020.3014920