Loading…

ReTransformer: ReRAM-based processing-in-memory architecture for transformer acceleration

Transformer has emerged as a popular deep neural network (DNN) model for Neural Language Processing (NLP) applications and demonstrated excellent performance in neural machine translation, entity recognition, etc. However, its scaled dot-product attention mechanism in auto-regressive decoder brings...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yang, Xiaoxuan, Yan, Bonan, Li, Hai, Chen, Yiran
Format:	Conference Proceeding
Language:	English
Subjects:	Acceleration autoregressive decoder Computational modeling Computer architecture convolutional neural networks Decoding deep neural network model Hardware > Emerging technologies > Analysis and design of emerging devices and systems hardware acceleration solution learning (artificial intelligence) mathematics computing matrix decomposition matrix multiplication matrix-matrix multiplication memory architecture multi-threading natural language processing neural language processing applications neural machine translation performance evaluation Pipelines processing-in-memory recurrent neural nets recurrent neural networks ReRAM ReRAM-based PIM architecture ReRAM-based processing-in-memory architecture ReTransformer scaled dot-product attention mechanism submatrix pipeline design Transformer Virtual machine monitors
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Transformer has emerged as a popular deep neural network (DNN) model for Neural Language Processing (NLP) applications and demonstrated excellent performance in neural machine translation, entity recognition, etc. However, its scaled dot-product attention mechanism in auto-regressive decoder brings a performance bottleneck during inference. Transformer is also computationally and memory intensive and demands for a hardware acceleration solution. Although researchers have successfully applied ReRAM-based Processing-in-Memory (PIM) to accelerate convolutional neural networks (CNNs) and recurrent neural networks (RNNs), the unique computation process of the scaled dot-product attention in Transformer makes it difficult to directly apply these designs. Besides, how to handle intermediate results in Matrix-matrix Multiplication (MatMul) and how to design a pipeline at a finer granularity of Transformer remain unsolved. In this work, we propose ReTransformer - a ReRAM-based PIM architecture for Transformer acceleration. ReTransformer can not only accelerate the scaled dot-product attention of Transformer using ReRAM-based PIM but also eliminate some data dependency by avoiding writing the intermediate results using the proposed matrix decomposition technique. Moreover, we propose a new sub-matrix pipeline design for multi-head self-attention. Experimental results show that compared to GPU and Pipelayer, ReTransformer improves computing efficiency by 23.21× and 3.25×, respectively. The corresponding overall power is reduced by 1086× and 2.82×, respectively.
ISSN:	1558-2434
DOI:	10.1145/3400302.3415640