Loading…

Parallelizing Checkpoint for Faster Fault Tolerance

Modern systems are prone to error, which calls for fault tolerance mechanisms. Traditional fault tolerance mechanisms (checkpoint mechanism) introduce large overhead, sometimes unacceptable. This paper introduces parallel checkpoint, a high efficient checkpoint mechanism for fault tolerance for mult...

Full description

Saved in:
Bibliographic Details
Published in:Journal of physics. Conference series 2018-08, Vol.1069 (1), p.12061
Main Authors: Wang, Ruibo, Zhang, Wenzhe
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Modern systems are prone to error, which calls for fault tolerance mechanisms. Traditional fault tolerance mechanisms (checkpoint mechanism) introduce large overhead, sometimes unacceptable. This paper introduces parallel checkpoint, a high efficient checkpoint mechanism for fault tolerance for multi-threaded programs. By eliminating global barrier, parallelizing threads' checkpoint phase, and overlapping threads' computing phase and checkpoint phase, we can achieve great performance gain (averagely 3.16x) and much better scalability over previous checkpoint mechanism.
ISSN:1742-6588
1742-6596
DOI:10.1088/1742-6596/1069/1/012061