Loading…

OSM: Off-Chip Shared Memory for GPUs

Graphics Processing Units (GPUs) employ a shared memory, a software-managed cache for programmers, in each streaming multiprocessor to accelerate data sharing among the threads in a thread block. Although 60% of the shared memory space is underutilized, on average, there are some workloads that dema...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on parallel and distributed systems 2022-12, Vol.33 (12), p.1-1
Main Authors: Darabi, Sina, Yousefzadeh-Asl-Miandoab, Ehsan, Akbarzadeh, Negar, Falahati, Hajar, Lotfi-Kamran, Pejman, Sadrosadati, Mohammad, Sarbazi-Azad, Hamid
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Graphics Processing Units (GPUs) employ a shared memory, a software-managed cache for programmers, in each streaming multiprocessor to accelerate data sharing among the threads in a thread block. Although 60% of the shared memory space is underutilized, on average, there are some workloads that demand higher shared memory capacities. Therefore, improving shared memory utilization while satisfying the needs of shared memory intensive workloads is challenging. We make a key observation that the lifetime of each shared memory address is significantly shorter than the execution time of a thread block. In this paper, we first propose Off-Chip Shared Memory (OSM) that allocates shared memory space in the off-chip memory and accelerates accesses to it via a small on-chip cache. Using an 8KB cache for shared memory addresses, OSM provides almost the same performance as the baseline GPU that uses 96KB on-chip shared memory. OSM improves GPU performance in two ways. First, it allocates higher shared memory capacities in the off-chip memory, and improves thread-level parallelism (TLP). Second, it designs a unified cache for shared memory and global address spaces, providing more caching space for global memory address space even for the workloads with high shared memory utilization. Our experimental results show an average 21% and 18% IPC improvement compared to the baseline and the state-of-the-art architectures.
ISSN:1045-9219
1558-2183
DOI:10.1109/TPDS.2022.3154315