Loading…

Analyzing and Leveraging Decoupled L1 Caches in GPUs

Graphics Processing Units (GPUs) use caches to provide on-chip bandwidth as a way to address the memory wall. However, they are not always efficiently utilized for optimal GPU performance. We find that the main source of this inefficiency stems from the tightly-coupled design of cores with L1 caches...

Full description

Saved in:

Bibliographic Details
Main Authors:	Ibrahim, Mohamed Assem, Kayiran, Onur, Eckert, Yasuko, Loh, Gabriel H., Jog, Adwait
Format:	Conference Proceeding
Language:	English
Subjects:	Aggregates Bandwidth Computer architecture Energy efficiency GPUs Graphics processing units Locality Network-on-Chip Organizations Traffic control
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Graphics Processing Units (GPUs) use caches to provide on-chip bandwidth as a way to address the memory wall. However, they are not always efficiently utilized for optimal GPU performance. We find that the main source of this inefficiency stems from the tightly-coupled design of cores with L1 caches. First, such a design assumes a per-core private local L1 cache in which each core independently caches the required data. This allows the same cache line to get replicated across cores, which wastes precious cache capacity. Second, due to the many-to-few traffic pattern, the tightly-coupled design leads to low per-core L1 bandwidth utilization while L2/memory is heavily utilized.To address these inefficiencies, we renovate the conventional GPU cache hierarchy by proposing a new DC-L1 (DeCoupled-L1) cache - an L1 cache separated from the GPU core. We show how decoupling the L1 cache from the GPU core provides opportunities to reduce data replication across the L1s and increase their bandwidth utilization. Specifically, we investigate how to aggregate the DC-L1s; how to manage data placement across the aggregated DC-L1s; and how to efficiently connect the DC-L1s to the GPU cores and the L2/memory partitions. Our evaluation shows that our new cache design boosts the useful L1 cache bandwidth and achieves significant improvement in performance and energy efficiency across a wide set of GPGPU applications while reducing the overall NoC area footprint.
ISSN:	2378-203X
DOI:	10.1109/HPCA51647.2021.00047