Loading…
Parallelization Strategies for DLRM Embedding Bag Operator on AMD CPUs
Deep learning recommendation models (DLRMs) are deployed extensively to support personalized recommendations and consume a large fraction of artificial intelligence (AI) cycles in modern datacenters with embedding stage being a critical component. Modern CPUs execute a lot of DLRM cycles because the...
Saved in:
Published in: | IEEE MICRO 2024-11, Vol.44 (6), p.44-51 |
---|---|
Main Authors: | , , , , , , , , , , , , , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Deep learning recommendation models (DLRMs) are deployed extensively to support personalized recommendations and consume a large fraction of artificial intelligence (AI) cycles in modern datacenters with embedding stage being a critical component. Modern CPUs execute a lot of DLRM cycles because they are cost effective compared to GPUs and other accelerators. Our paper addresses key bottlenecks in accelerating the embedding stage on CPUs. Specifically, this work 1) explores novel threading schemes that parallelize embedding bag, 2) pushes the envelope on realized bandwidth by improving data reuse in caches, and 3) studies the impact of parallelization on load imbalance. The new embedding bag kernels have been prototyped in the ZenDNN software stack. When put together, our work on fourth generation EPYC processors achieve up to 9.9x improvement in embedding bag performance over state-of-the-art implementations, and improve realized bandwidth of up to 5.7x over DDR bandwidth. |
---|---|
ISSN: | 0272-1732 1937-4143 |
DOI: | 10.1109/MM.2024.3423785 |