Loading…

X-Former: In-Memory Acceleration of Transformers

Transformers have achieved great success in a wide variety of natural language processing (NLP) tasks due to the self-attention mechanism, which assigns an importance score for every word relative to other words in a sequence. However, these models are very large, often reaching hundreds of billions...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on very large scale integration (VLSI) systems 2023-08, Vol.31 (8), p.1-11
Main Authors: Sridharan, Shrihari, Stevens, Jacob R., Roy, Kaushik, Raghunathan, Anand
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Transformers have achieved great success in a wide variety of natural language processing (NLP) tasks due to the self-attention mechanism, which assigns an importance score for every word relative to other words in a sequence. However, these models are very large, often reaching hundreds of billions of parameters, and therefore require a large number of dynamic random access memory (DRAM) accesses. Hence, traditional deep neural network (DNN) accelerators such as graphical processing units (GPUs) and tensor processing units (TPUs) face limitations in processing Transformers efficiently. In-memory accelerators based on nonvolatile memory (NVM) promise to be an effective solution to this challenge, since they provide high storage density while performing massively parallel matrix-vector multiplications (MVMs) within memory arrays. However, attention score computations, which are frequently used in Transformers unlike convolutional neural networks (CNNs) and recurrent neural network (RNNs), require MVMs where both the operands change dynamically for each input. As a result, conventional NVM-based accelerators incur high write latency and write energy when used for Transformers and further suffer from the low endurance of most NVM technologies. To address these challenges, we present, a hybrid in-memory hardware accelerator that consists of both NVM and CMOS processing elements to execute transformer workloads efficiently. To improve the hardware utilization of, we also propose a sequence blocking dataflow, which overlaps the computations of the two processing elements and reduces execution time. Across several benchmarks, we show that achieves up to 69.8 \times and 13 \times improvements in latency and energy over a NVIDIA GeForce GTX 1060 GPU and up to 24.1 \times and 7.95 \times improvements in latency and energy over a state-of-the-art in-memory NVM accelerator.
ISSN:1063-8210
1557-9999
DOI:10.1109/TVLSI.2023.3282046