Loading…

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing

As high performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback...

Full description

Saved in:
Bibliographic Details
Published in:The international journal of high performance computing applications 2005-11, Vol.19 (4), p.479-493
Main Authors: Sankaran, Sriram, Squyres, Jeffrey M., Barrett, Brian, Sahay, Vishal, Lumsdaine, Andrew, Duell, Jason, Hargrove, Paul, Roman, Eric
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:As high performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kernel-level process checkpoint system with the LAM implementation of MPI through a defined checkpoint/restart interface. Checkpointing is transparent to the application, allowing the system to be used for cluster maintenance and scheduling reasons as well as for fault tolerance. Experimental results show negligible communication performance impact due to the incorporation of the checkpoint support capabilities into LAM/MPI.
ISSN:1094-3420
1741-2846
DOI:10.1177/1094342005056139