Loading…

Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory

We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity with SCM, the GPU can capture a larger fraction of the memory footprint than HBM for workloads that oversubscribe memory, achievi...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2024-03
Main Authors: Hong, Jeongmin, Cho, Sungjun, Park, Geonwoo, Yang, Wonhyuk, Young-Ho, Gong, Kim, Gwangsun
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Hong, Jeongmin
Cho, Sungjun
Park, Geonwoo
Yang, Wonhyuk
Young-Ho, Gong
Kim, Gwangsun
description We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity with SCM, the GPU can capture a larger fraction of the memory footprint than HBM for workloads that oversubscribe memory, achieving high speedups. However, the DRAM cache needs to be carefully designed to address the latency and BW limitations of the SCM while minimizing cost overhead and considering GPU's characteristics. Because the massive number of GPU threads can thrash the DRAM cache, we first propose an SCM-aware DRAM cache bypass policy for GPUs that considers the multi-dimensional characteristics of memory accesses by GPUs with SCM to bypass DRAM for data with low performance utility. In addition, to reduce DRAM cache probes and increase effective DRAM BW with minimal cost, we propose a Configurable Tag Cache (CTC) that repurposes part of the L2 cache to cache DRAM cacheline tags. The L2 capacity used for the CTC can be adjusted by users for adaptability. Furthermore, to minimize DRAM cache probe traffic from CTC misses, our Aggregated Metadata-In-Last-column (AMIL) DRAM cache organization co-locates all DRAM cacheline tags in a single column within a row. The AMIL also retains the full ECC protection, unlike prior DRAM cache's Tag-And-Data (TAD) organization. Additionally, we propose SCM throttling to curtail power and exploiting SCM's SLC/MLC modes to adapt to workload's memory footprint. While our techniques can be used for different DRAM and SCM devices, we focus on a Heterogeneous Memory Stack (HMS) organization that stacks SCM dies on top of DRAM dies for high performance. Compared to HBM, HMS improves performance by up to 12.5x (2.9x overall) and reduces energy by up to 89.3% (48.1% overall). Compared to prior works, we reduce DRAM cache probe and SCM write traffic by 91-93% and 57-75%, respectively.
doi_str_mv 10.48550/arxiv.2403.09358
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2957595936</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2957595936</sourcerecordid><originalsourceid>FETCH-LOGICAL-a526-c6220c8b083d2d1bb1416d622875b72473e0acb5c751f66fab61d11314f647f23</originalsourceid><addsrcrecordid>eNotjUtLAzEYAIMgWGp_gLeA59Tky3OP61qr0KJoPZc83S21q8m21X9vQU8Dc5hB6IrRqTBS0hubv7vDFATlU1pxac7QCDhnxAiACzQpZUMpBaVBSj5C9a3dhWMXhpbMUop-6A4R373US9xY30ac-oznz28FH7uhxa9Dn-17JM3WloKX8aPPP5foPNltiZN_jtHqfrZqHsjiaf7Y1AtiJSjiFQD1xlHDAwTmHBNMhZM0WjoNQvNIrXfSa8mSUsk6xQJjnImkhE7Ax-j6L_uZ-699LMN60-_z7nRcQyW1rGTFFf8FmGdJBA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2957595936</pqid></control><display><type>article</type><title>Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory</title><source>Publicly Available Content (ProQuest)</source><creator>Hong, Jeongmin ; Cho, Sungjun ; Park, Geonwoo ; Yang, Wonhyuk ; Young-Ho, Gong ; Kim, Gwangsun</creator><creatorcontrib>Hong, Jeongmin ; Cho, Sungjun ; Park, Geonwoo ; Yang, Wonhyuk ; Young-Ho, Gong ; Kim, Gwangsun</creatorcontrib><description>We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity with SCM, the GPU can capture a larger fraction of the memory footprint than HBM for workloads that oversubscribe memory, achieving high speedups. However, the DRAM cache needs to be carefully designed to address the latency and BW limitations of the SCM while minimizing cost overhead and considering GPU's characteristics. Because the massive number of GPU threads can thrash the DRAM cache, we first propose an SCM-aware DRAM cache bypass policy for GPUs that considers the multi-dimensional characteristics of memory accesses by GPUs with SCM to bypass DRAM for data with low performance utility. In addition, to reduce DRAM cache probes and increase effective DRAM BW with minimal cost, we propose a Configurable Tag Cache (CTC) that repurposes part of the L2 cache to cache DRAM cacheline tags. The L2 capacity used for the CTC can be adjusted by users for adaptability. Furthermore, to minimize DRAM cache probe traffic from CTC misses, our Aggregated Metadata-In-Last-column (AMIL) DRAM cache organization co-locates all DRAM cacheline tags in a single column within a row. The AMIL also retains the full ECC protection, unlike prior DRAM cache's Tag-And-Data (TAD) organization. Additionally, we propose SCM throttling to curtail power and exploiting SCM's SLC/MLC modes to adapt to workload's memory footprint. While our techniques can be used for different DRAM and SCM devices, we focus on a Heterogeneous Memory Stack (HMS) organization that stacks SCM dies on top of DRAM dies for high performance. Compared to HBM, HMS improves performance by up to 12.5x (2.9x overall) and reduces energy by up to 89.3% (48.1% overall). Compared to prior works, we reduce DRAM cache probe and SCM write traffic by 91-93% and 57-75%, respectively.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2403.09358</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Dynamic random access memory ; Performance enhancement ; Tags ; Throttling ; Workload</subject><ispartof>arXiv.org, 2024-03</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2957595936?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>776,780,25731,27902,36989,44566</link.rule.ids></links><search><creatorcontrib>Hong, Jeongmin</creatorcontrib><creatorcontrib>Cho, Sungjun</creatorcontrib><creatorcontrib>Park, Geonwoo</creatorcontrib><creatorcontrib>Yang, Wonhyuk</creatorcontrib><creatorcontrib>Young-Ho, Gong</creatorcontrib><creatorcontrib>Kim, Gwangsun</creatorcontrib><title>Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory</title><title>arXiv.org</title><description>We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity with SCM, the GPU can capture a larger fraction of the memory footprint than HBM for workloads that oversubscribe memory, achieving high speedups. However, the DRAM cache needs to be carefully designed to address the latency and BW limitations of the SCM while minimizing cost overhead and considering GPU's characteristics. Because the massive number of GPU threads can thrash the DRAM cache, we first propose an SCM-aware DRAM cache bypass policy for GPUs that considers the multi-dimensional characteristics of memory accesses by GPUs with SCM to bypass DRAM for data with low performance utility. In addition, to reduce DRAM cache probes and increase effective DRAM BW with minimal cost, we propose a Configurable Tag Cache (CTC) that repurposes part of the L2 cache to cache DRAM cacheline tags. The L2 capacity used for the CTC can be adjusted by users for adaptability. Furthermore, to minimize DRAM cache probe traffic from CTC misses, our Aggregated Metadata-In-Last-column (AMIL) DRAM cache organization co-locates all DRAM cacheline tags in a single column within a row. The AMIL also retains the full ECC protection, unlike prior DRAM cache's Tag-And-Data (TAD) organization. Additionally, we propose SCM throttling to curtail power and exploiting SCM's SLC/MLC modes to adapt to workload's memory footprint. While our techniques can be used for different DRAM and SCM devices, we focus on a Heterogeneous Memory Stack (HMS) organization that stacks SCM dies on top of DRAM dies for high performance. Compared to HBM, HMS improves performance by up to 12.5x (2.9x overall) and reduces energy by up to 89.3% (48.1% overall). Compared to prior works, we reduce DRAM cache probe and SCM write traffic by 91-93% and 57-75%, respectively.</description><subject>Dynamic random access memory</subject><subject>Performance enhancement</subject><subject>Tags</subject><subject>Throttling</subject><subject>Workload</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNotjUtLAzEYAIMgWGp_gLeA59Tky3OP61qr0KJoPZc83S21q8m21X9vQU8Dc5hB6IrRqTBS0hubv7vDFATlU1pxac7QCDhnxAiACzQpZUMpBaVBSj5C9a3dhWMXhpbMUop-6A4R373US9xY30ac-oznz28FH7uhxa9Dn-17JM3WloKX8aPPP5foPNltiZN_jtHqfrZqHsjiaf7Y1AtiJSjiFQD1xlHDAwTmHBNMhZM0WjoNQvNIrXfSa8mSUsk6xQJjnImkhE7Ax-j6L_uZ-699LMN60-_z7nRcQyW1rGTFFf8FmGdJBA</recordid><startdate>20240314</startdate><enddate>20240314</enddate><creator>Hong, Jeongmin</creator><creator>Cho, Sungjun</creator><creator>Park, Geonwoo</creator><creator>Yang, Wonhyuk</creator><creator>Young-Ho, Gong</creator><creator>Kim, Gwangsun</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PHGZM</scope><scope>PHGZT</scope><scope>PIMPY</scope><scope>PKEHL</scope><scope>PQEST</scope><scope>PQGLB</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240314</creationdate><title>Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory</title><author>Hong, Jeongmin ; Cho, Sungjun ; Park, Geonwoo ; Yang, Wonhyuk ; Young-Ho, Gong ; Kim, Gwangsun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a526-c6220c8b083d2d1bb1416d622875b72473e0acb5c751f66fab61d11314f647f23</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Dynamic random access memory</topic><topic>Performance enhancement</topic><topic>Tags</topic><topic>Throttling</topic><topic>Workload</topic><toplevel>online_resources</toplevel><creatorcontrib>Hong, Jeongmin</creatorcontrib><creatorcontrib>Cho, Sungjun</creatorcontrib><creatorcontrib>Park, Geonwoo</creatorcontrib><creatorcontrib>Yang, Wonhyuk</creatorcontrib><creatorcontrib>Young-Ho, Gong</creatorcontrib><creatorcontrib>Kim, Gwangsun</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>ProQuest Engineering Database</collection><collection>ProQuest Central (New)</collection><collection>ProQuest One Academic (New)</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Middle East (New)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Applied &amp; Life Sciences</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Hong, Jeongmin</au><au>Cho, Sungjun</au><au>Park, Geonwoo</au><au>Yang, Wonhyuk</au><au>Young-Ho, Gong</au><au>Kim, Gwangsun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory</atitle><jtitle>arXiv.org</jtitle><date>2024-03-14</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity with SCM, the GPU can capture a larger fraction of the memory footprint than HBM for workloads that oversubscribe memory, achieving high speedups. However, the DRAM cache needs to be carefully designed to address the latency and BW limitations of the SCM while minimizing cost overhead and considering GPU's characteristics. Because the massive number of GPU threads can thrash the DRAM cache, we first propose an SCM-aware DRAM cache bypass policy for GPUs that considers the multi-dimensional characteristics of memory accesses by GPUs with SCM to bypass DRAM for data with low performance utility. In addition, to reduce DRAM cache probes and increase effective DRAM BW with minimal cost, we propose a Configurable Tag Cache (CTC) that repurposes part of the L2 cache to cache DRAM cacheline tags. The L2 capacity used for the CTC can be adjusted by users for adaptability. Furthermore, to minimize DRAM cache probe traffic from CTC misses, our Aggregated Metadata-In-Last-column (AMIL) DRAM cache organization co-locates all DRAM cacheline tags in a single column within a row. The AMIL also retains the full ECC protection, unlike prior DRAM cache's Tag-And-Data (TAD) organization. Additionally, we propose SCM throttling to curtail power and exploiting SCM's SLC/MLC modes to adapt to workload's memory footprint. While our techniques can be used for different DRAM and SCM devices, we focus on a Heterogeneous Memory Stack (HMS) organization that stacks SCM dies on top of DRAM dies for high performance. Compared to HBM, HMS improves performance by up to 12.5x (2.9x overall) and reduces energy by up to 89.3% (48.1% overall). Compared to prior works, we reduce DRAM cache probe and SCM write traffic by 91-93% and 57-75%, respectively.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2403.09358</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-03
issn 2331-8422
language eng
recordid cdi_proquest_journals_2957595936
source Publicly Available Content (ProQuest)
subjects Dynamic random access memory
Performance enhancement
Tags
Throttling
Workload
title Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-24T03%3A35%3A18IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Bandwidth-Effective%20DRAM%20Cache%20for%20GPUs%20with%20Storage-Class%20Memory&rft.jtitle=arXiv.org&rft.au=Hong,%20Jeongmin&rft.date=2024-03-14&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2403.09358&rft_dat=%3Cproquest%3E2957595936%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-a526-c6220c8b083d2d1bb1416d622875b72473e0acb5c751f66fab61d11314f647f23%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2957595936&rft_id=info:pmid/&rfr_iscdi=true