Loading…

Mixed precision support in HPC applications: What about reliability?

In their quest for exascale and beyond, High-Performance Computing (HPC) systems continue becoming ever larger and more complex. Application developers, on the other hand, leverage novel methods to improve the efficiency of their own codes: a recent trend is the use of floating-point mixed precision...

Full description

Saved in:
Bibliographic Details
Published in:Journal of parallel and distributed computing 2023-11, Vol.181, p.104746, Article 104746
Main Authors: Netti, Alessio, Peng, Yang, Omland, Patrik, Paulitsch, Michael, Parra, Jorge, Espinosa, Gustavo, Agarwal, Udit, Chan, Abraham, Pattabiraman, Karthik
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In their quest for exascale and beyond, High-Performance Computing (HPC) systems continue becoming ever larger and more complex. Application developers, on the other hand, leverage novel methods to improve the efficiency of their own codes: a recent trend is the use of floating-point mixed precision, or the careful interlocking of single- and double-precision arithmetic, as a tool to improve performance as well as reduce network and memory boundedness. However, while it is known that modern HPC systems suffer hardware faults at daily rates, the impact of reduced precision on application reliability is yet to be explored. In this work we aim to fill this gap: first, we propose a qualitative survey to identify the branches of HPC where mixed precision is most popular. Second, we show the results of instruction-level fault injection experiments on a variety of representative HPC workloads, comparing vulnerability to Silent Data Errors (SDEs) under different numerical configurations. Our experiments indicate that use of single and mixed precision leads to comparatively more frequent and more severe SDEs, with concerning implications regarding their use on extreme-scale, fault-prone HPC platforms. •Use of floating-point mixed precision is widespread across all domains of HPC.•Weather Modeling is at the forefront of mixed precision production use in HPC.•Use of single and mixed precision leads to a higher Silent Data Error (SDE) rate.•The severity of observed SDEs is much higher under single and mixed precision.
ISSN:0743-7315
1096-0848
DOI:10.1016/j.jpdc.2023.104746