Loading…

Improving Selective Fault Tolerance in GPU Register Files by Relaxing Application Accuracy

The high computing power of graphics processing units (GPUs) makes them attractive for safety-critical applications, where reliability is a major concern. This article uses an approximate computing perspective to relax application accuracy in order to improve the selective fault tolerance techniques...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on nuclear science 2020-07, Vol.67 (7), p.1573-1580
Main Authors: Goncalves, Marcio M., Lamb, Ivan Peter, Rech, Paolo, Brum, Raphael M., Azambuja, Jose Rodrigo
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The high computing power of graphics processing units (GPUs) makes them attractive for safety-critical applications, where reliability is a major concern. This article uses an approximate computing perspective to relax application accuracy in order to improve the selective fault tolerance techniques. Our approach first assesses the vulnerability of a Kepler GPU to the transient effects through a neutron beam experiment. Then, it performs a fault injection campaign to identify the most critical registers and relax the result accuracy. Finally, it uses the acquired data to improve the selective fault tolerance techniques in terms of occupation and performance. The results show that it was possible to improve the GPU register file's reliability on average by 71.6% by relaxing the application accuracy and, when compared with the selective hardening techniques, it was able to reduce the replicated registers by an average of 41.4%, while maintaining 100% fault coverage.
ISSN:0018-9499
1558-1578
DOI:10.1109/TNS.2020.2982162