Loading…

Eff-ECC: Protecting GPGPUs Register File With a Unified Energy-Efficient ECC Mechanism

Graphics processing units (GPUs) are widely used in general-purpose high-performance computing applications (i.e., GPGPUs), which require reliable execution in the presence of soft errors. To support massive thread-level parallelism, a sizeable register file is adopted in GPUs, which is highly vulne...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on computer-aided design of integrated circuits and systems 2022-07, Vol.41 (7), p.2080-2093
Main Authors: Yue, Hengshan, Wei, Xiaohui, Tan, Jingweijia, Jiang, Nan, Qiu, Meikang
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Graphics processing units (GPUs) are widely used in general-purpose high-performance computing applications (i.e., GPGPUs), which require reliable execution in the presence of soft errors. To support massive thread-level parallelism, a sizeable register file is adopted in GPUs, which is highly vulnerable to soft errors. Although modern commercial GPUs provide single-error-correction double-error-detection (SEC-DED) error correction code (ECC) for the register file, it consumes a considerable amount of energy due to frequent register accesses and leakage power of ECC storage. In this article, we propose to leverage the error sensitivity of instructions, the duplicate characteristics of the same-named registers, and the error sensitivity of data bits to build a unified energy-efficient ECC mechanism for a GPGPUs register file (Eff-ECC), which consists of instruction-aware ECC (IA-ECC), duplication-aware ECC (DA-ECC), and bit-aware ECC (BA-ECC). Considering the error sensitivity of instructions, IA-ECC merely implements ECCs for the write registers of critical instructions. Observing the same-named registers across threads usually keeps the same data, DA-ECC avoids unnecessary ECC generation and verification for duplicate register values. Leveraging the inherent error-tolerance features of the program, BA-ECC merely protects significant bits of registers to combat the crucial error. Experimental results demonstrate that Eff-ECC tremendously reduces 86.46% energy consumption of traditional SEC-DED ECC.
ISSN:0278-0070
1937-4151
DOI:10.1109/TCAD.2021.3104529