Loading…
The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing
As high performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback...
Saved in:
Published in: | The international journal of high performance computing applications 2005-11, Vol.19 (4), p.479-493 |
---|---|
Main Authors: | , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | As high performance clusters continue to grow in size and popularity, issues of fault
tolerance and reliability are becoming limiting factors on application scalability.
To address these issues, we present the design and implementation of a system for
providing coordinated checkpointing and rollback recovery for MPI-based parallel
applications. Our approach integrates the Berkeley Lab BLCR kernel-level process
checkpoint system with the LAM implementation of MPI through a defined
checkpoint/restart interface. Checkpointing is transparent to the application,
allowing the system to be used for cluster maintenance and scheduling reasons as
well as for fault tolerance. Experimental results show negligible communication
performance impact due to the incorporation of the checkpoint support capabilities
into LAM/MPI. |
---|---|
ISSN: | 1094-3420 1741-2846 |
DOI: | 10.1177/1094342005056139 |