Loading…

Analog Computing in Memory (CIM) Technique for General Matrix Multiplication (GEMM) to Support Deep Neural Network (DNN) and Cosine Similarity Search Computing using 3D AND-type NOR Flash Devices

The massively increasing data in computing world inspires the R&D of novel memory-centric computing architectures and devices. In this work, we propose a novel analog CIM technique for GEMM using 3D NOR Flash devices to support general-purpose matrix multiplication. Our analysis indicates that i...

Full description

Saved in:
Bibliographic Details
Main Authors: Wei, Ming-Liang, Lue, Hang-Ting, Ho, Shu-Yin, Lin, Yen-Po, Hsu, Tzu-Hsuan, Hsieh, Chih-Chang, Li, Yung-Chun, Yeh, Teng-Hao, Chen, Shih-Hung, Jhu, Yi-Hao, Li, Hsiang-Pang, Hu, Han-Wen, Hung, Chun-Hsiung, Wang, Keh-Chung, Lu, Chih-Yuan
Format: Conference Proceeding
Language:English
Subjects:
Citations: Items that cite this one
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The massively increasing data in computing world inspires the R&D of novel memory-centric computing architectures and devices. In this work, we propose a novel analog CIM technique for GEMM using 3D NOR Flash devices to support general-purpose matrix multiplication. Our analysis indicates that it's very robust to use "billions" of memory cells with modest 4-level and large-spacing analog Icell to produce good accuracy and reliability, contrary to the past thinking to pursue many levels in each memory cell that inevitably suffers accuracy loss. We estimate that a 2.7Gb 3D NOR GEMM can provide a high-performance (frame/sec>300) image recognition inference of ResNet-50 on ImageNet dataset, using a simple flexible controller chip with 1MB SRAM without the need of massive ALU and external DRAM. The accuracy can maintain ~85% for Cifar-10 (by VGG7), and ~90% for ImageNet Top-5 (by ResNet-50) under good device control. This 3D NOR GEMM enjoys much lower system cost and flexibility than a complicated SOC. We also propose an operation and design method of "Cosine Similarity" computing using the 3D NOR. We can use a ternary similarity search algorithm with positive and negative inputs and weights to perform high-dimension feature vector (such as 512 for face recognition with FaceNet on VGGFace2 dataset) similarity computing in a high-parallelism CIM design (512 WL inputs, 1024 BL's, at Tread=100ns). High-accuracy search (~97.8%, almost identical to 98% of software computing) and high internal search bandwidth (~5Tb/s per chip) are achieved. This in-Flash search accelerator is potential to enable new hardware-aware search algorithms in big data retrieval applications.
ISSN:2156-017X
DOI:10.1109/IEDM45625.2022.10019495