Loading…

Parallelization Strategies for DLRM Embedding Bag Operator on AMD CPUs

Deep learning recommendation models (DLRMs) are deployed extensively to support personalized recommendations and consume a large fraction of artificial intelligence (AI) cycles in modern datacenters with embedding stage being a critical component. Modern CPUs execute a lot of DLRM cycles because the...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE MICRO 2024-11, Vol.44 (6), p.44-51
Main Authors:	Nair, Krishnakumar, Pandey, Avinash-Chandra, Karabannavar, Siddappa, Arunachalam, Meena, Kalamatianos, John, Agrawal, Varun, Gupta, Saurabh, Sirasao, Ashish, Delaye, Elliott, Reinhardt, Steve, Vivekanandham, Rajesh, Wittig, Ralph, Kathail, Vinod, Gopalakrishnan, Padmini, Pareek, Satyaprakash, Jain, Rishabh, Kandemir, Mahmut Taylan, Lin, Jun-Liang, Akbulut, Gulsum Gudukbay, Das, Chita R.
Format:	Article
Language:	English
Subjects:	Bandwidth Instruction sets Kernel Multicore processing Recommender systems Three-dimensional displays Vectors
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Deep learning recommendation models (DLRMs) are deployed extensively to support personalized recommendations and consume a large fraction of artificial intelligence (AI) cycles in modern datacenters with embedding stage being a critical component. Modern CPUs execute a lot of DLRM cycles because they are cost effective compared to GPUs and other accelerators. Our paper addresses key bottlenecks in accelerating the embedding stage on CPUs. Specifically, this work 1) explores novel threading schemes that parallelize embedding bag, 2) pushes the envelope on realized bandwidth by improving data reuse in caches, and 3) studies the impact of parallelization on load imbalance. The new embedding bag kernels have been prototyped in the ZenDNN software stack. When put together, our work on fourth generation EPYC processors achieve up to 9.9x improvement in embedding bag performance over state-of-the-art implementations, and improve realized bandwidth of up to 5.7x over DDR bandwidth.
ISSN:	0272-1732 1937-4143
DOI:	10.1109/MM.2024.3423785