Loading…

Quantization and Hardware Architecture Co-Design for Matrix-Vector Multiplications of Large Language Models

Large language models (LLMs) have sparked a new revolution in the field of natural language processing (NLP), and have garnered tremendous attention in both academic research and everyday life, thanks to their unprecedented performance in a wide range of applications. However, their deployment remai...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on circuits and systems. I, Regular papers Regular papers, 2024-06, Vol.71 (6), p.2858-2871
Main Authors:	Li, Wenjie, Hu, Aokun, Xu, Ningyi, He, Guanghui
Format:	Article
Language:	English
Subjects:	Acceleration Co-design Computational modeling Computer architecture Energy efficiency Hardware Hardware acceleration hardware architecture Large language models Memory management Natural language processing outlier precision-scalable quantization Quantization (signal) Sorting State of the art Transformers
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Large language models (LLMs) have sparked a new revolution in the field of natural language processing (NLP), and have garnered tremendous attention in both academic research and everyday life, thanks to their unprecedented performance in a wide range of applications. However, their deployment remains a significant challenge, primarily due to their intensive computational and memory requirements. Hardware acceleration and efficient quantization are promising solutions to address the two issues. In this paper, a quantization and hardware architecture co-design is presented for matrix-vector multiplications (MVMs) of LLMs. During quantization, we uniformly group weights and activations to ensure workload balance for hardware. To enhance the performance of quantization, we further propose two approaches called channel sorting and channel selection, which can be applied simultaneously. To support the proposed quantization scheme, we develop two precision-scalable MVM hardware architectures. They are specifically designed for high speed and high energy efficiency, respectively. Experimental results show that our proposed quantization scheme achieves state-of-the-art performance among all the reported post-training schemes that quantize both weights and activations into integers. Compared to MVM architecture of the state-of-the-art LLM accelerator OliVe, our design exhibits significant advantages in terms of area efficiency and energy efficiency.
ISSN:	1549-8328 1558-0806
DOI:	10.1109/TCSI.2024.3350661