Loading…

Bandwidth-Effective DRAM Cache for GPU s with Storage-Class Memory

We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity with SCM, the GPU can capture a larger fraction of the memory footprint than HBM for workloads that mandate memory oversubscripti...

Full description

Saved in:
Bibliographic Details
Main Authors: Hong, Jeongmin, Cho, Sungjun, Park, Geonwoo, Yang, Wonhyuk, Gong, Young-Ho, Kim, Gwangsun
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity with SCM, the GPU can capture a larger fraction of the memory footprint than HBM for workloads that mandate memory oversubscription, resulting in substantial speedups. However, the DRAM cache needs to be carefully designed to address the latency and bandwidth limitations of the SCM while minimizing cost overhead and considering GPU's characteristics. Because the massive number of GPU threads can easily thrash the DRAM cache and degrade performance, we first propose an SCM-aware DRAM cache bypass policy for GPUs that considers the multi-dimensional characteristics of memory accesses by G PU s with SCM to bypass DRAM for data with low performance utility. In addition, to reduce DRAM cache probe traffic and increase effective DRAM BW with minimal cost overhead, we propose a Configurable Tag Cache (CTC) that repurposes part of the L2 cache to cache DRAM cacheline tags. The L2 capacity used for the CTC can be adjusted by users for adaptability. Furthermore, to minimize DRAM cache probe traffic from CTC misses, our Aggregated Metadata-In-Last-column (AMIL) DRAM cache organization co-locates all DRAM cacheline tags in a single column within a row. The AMIL also retains the full ECC protection, unlike prior DRAM cache implementation with Tag-And-Data (TAD) organization. Additionally, we propose SCM throttling to curtail power consumption and exploiting SCM's SLC/MLC modes to adapt to workload's memory footprint. While our techniques can be used for different DRAM and SCM devices, we focus on a Heterogeneous Memory Stack (HMS) organization that stacks SCM dies on top of DRAM dies for high performance. Compared to HBM, the HMS improves performance by up to 12.5× (2.9× overall) and reduces energy by up to 89.3% (48.1 % overall). Compared to prior works, we reduce DRAM cache probe and SCM write traffic by 91-93 % and 57-75 %, respectively.
ISSN:2378-203X
DOI:10.1109/HPCA57654.2024.00021