Loading…

A Validation Approach for Quasi-Synchronous Checkpointing Algorithms in HPC Systems

High Performance Computing (HPC) has seen prodigious growth over the past few years and fault tolerance solutions have been incorporated into HPC systems. The most commonly used technique for fault tolerance in HPC is checkpointing. Processes achieve fault tolerance by saving local checkpoint period...

Full description

Saved in:
Bibliographic Details
Main Authors: Khlif, Houda, Kacem, Hatem Hadj, Pomares Hernandez, Saul E., Kacem, Ahmed Hadj
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:High Performance Computing (HPC) has seen prodigious growth over the past few years and fault tolerance solutions have been incorporated into HPC systems. The most commonly used technique for fault tolerance in HPC is checkpointing. Processes achieve fault tolerance by saving local checkpoint periodically during execution. When a failure occurs, the previously saved global checkpoint can be used to restart the computation from an intermediate state. Quasi-Synchronous Checkpointing (QSC) is attractive because it allows finding consistent global checkpoints without an extra message overhead. QSC algorithms are classified into: Strictly Z-Path Free, Z-Path Free and Z-Cycle Free. Z-paths and Z-cycles are undesirable patterns that can give rise to inconsistent system states. QSC algorithms are often evaluated with regard to performance, generally through simulation. However, few works have been designed to validate their correctness. Our approach validates if a QSC algorithm is correct by modeling its execution into a graph and then verifying over this graph if the algorithm is exempt from undesirable patterns. QSC is based on the Happened-Before Relation (HBR) introduced by Lamport. One main problem linked to the HBR is the combinatorial state explosion. Nevertheless, for a HPC system modeled with a HBR graph, the computational cost for the identification of such patterns becomes prohibitively high. In this paper, we define a set of transformation rules oriented towards the detection of the undesirable patterns over a graph derived from the causal order set abstraction (CAOS) which is equivalent to a HBR graph, but it drastically reduces the statespace of a system.
ISSN:2161-5330
DOI:10.1109/AICCSA.2017.128