Loading…
Analog Computing in Memory (CIM) Technique for General Matrix Multiplication (GEMM) to Support Deep Neural Network (DNN) and Cosine Similarity Search Computing using 3D AND-type NOR Flash Devices
The massively increasing data in computing world inspires the R&D of novel memory-centric computing architectures and devices. In this work, we propose a novel analog CIM technique for GEMM using 3D NOR Flash devices to support general-purpose matrix multiplication. Our analysis indicates that i...
Saved in:
Main Authors: | , , , , , , , , , , , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Citations: | Items that cite this one |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The massively increasing data in computing world inspires the R&D of novel memory-centric computing architectures and devices. In this work, we propose a novel analog CIM technique for GEMM using 3D NOR Flash devices to support general-purpose matrix multiplication. Our analysis indicates that it's very robust to use "billions" of memory cells with modest 4-level and large-spacing analog Icell to produce good accuracy and reliability, contrary to the past thinking to pursue many levels in each memory cell that inevitably suffers accuracy loss. We estimate that a 2.7Gb 3D NOR GEMM can provide a high-performance (frame/sec>300) image recognition inference of ResNet-50 on ImageNet dataset, using a simple flexible controller chip with 1MB SRAM without the need of massive ALU and external DRAM. The accuracy can maintain ~85% for Cifar-10 (by VGG7), and ~90% for ImageNet Top-5 (by ResNet-50) under good device control. This 3D NOR GEMM enjoys much lower system cost and flexibility than a complicated SOC. We also propose an operation and design method of "Cosine Similarity" computing using the 3D NOR. We can use a ternary similarity search algorithm with positive and negative inputs and weights to perform high-dimension feature vector (such as 512 for face recognition with FaceNet on VGGFace2 dataset) similarity computing in a high-parallelism CIM design (512 WL inputs, 1024 BL's, at Tread=100ns). High-accuracy search (~97.8%, almost identical to 98% of software computing) and high internal search bandwidth (~5Tb/s per chip) are achieved. This in-Flash search accelerator is potential to enable new hardware-aware search algorithms in big data retrieval applications. |
---|---|
ISSN: | 2156-017X |
DOI: | 10.1109/IEDM45625.2022.10019495 |