Loading…
Mixed precision support in HPC applications: What about reliability?
In their quest for exascale and beyond, High-Performance Computing (HPC) systems continue becoming ever larger and more complex. Application developers, on the other hand, leverage novel methods to improve the efficiency of their own codes: a recent trend is the use of floating-point mixed precision...
Saved in:
Published in: | Journal of parallel and distributed computing 2023-11, Vol.181, p.104746, Article 104746 |
---|---|
Main Authors: | , , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | In their quest for exascale and beyond, High-Performance Computing (HPC) systems continue becoming ever larger and more complex. Application developers, on the other hand, leverage novel methods to improve the efficiency of their own codes: a recent trend is the use of floating-point mixed precision, or the careful interlocking of single- and double-precision arithmetic, as a tool to improve performance as well as reduce network and memory boundedness. However, while it is known that modern HPC systems suffer hardware faults at daily rates, the impact of reduced precision on application reliability is yet to be explored. In this work we aim to fill this gap: first, we propose a qualitative survey to identify the branches of HPC where mixed precision is most popular. Second, we show the results of instruction-level fault injection experiments on a variety of representative HPC workloads, comparing vulnerability to Silent Data Errors (SDEs) under different numerical configurations. Our experiments indicate that use of single and mixed precision leads to comparatively more frequent and more severe SDEs, with concerning implications regarding their use on extreme-scale, fault-prone HPC platforms.
•Use of floating-point mixed precision is widespread across all domains of HPC.•Weather Modeling is at the forefront of mixed precision production use in HPC.•Use of single and mixed precision leads to a higher Silent Data Error (SDE) rate.•The severity of observed SDEs is much higher under single and mixed precision. |
---|---|
ISSN: | 0743-7315 1096-0848 |
DOI: | 10.1016/j.jpdc.2023.104746 |