Loading…

Fault-tolerant scheduling on parallel systems with non-memoryless failure distributions

As large parallel systems increase in size and complexity, failures are inevitable and exhibit complex space and time dynamics. Most often, in real systems, failure rates are increasing or decreasing over time. Considering non-memoryless failure distributions, we study a bi-objective scheduling prob...

Full description

Saved in:
Bibliographic Details
Published in:Journal of parallel and distributed computing 2014-05, Vol.74 (5), p.2411-2422
Main Authors: Bouguerra, Mohamed Slim, Kondo, Derrick, Mendonca, Fernando, Trystram, Denis
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:As large parallel systems increase in size and complexity, failures are inevitable and exhibit complex space and time dynamics. Most often, in real systems, failure rates are increasing or decreasing over time. Considering non-memoryless failure distributions, we study a bi-objective scheduling problem of optimizing application makespan and reliability. In particular, we determine whether one can optimize both makespan and reliability simultaneously, or whether one metric must be degraded in order to improve the other. We also devise scheduling algorithms for achieving (approximately) optimal makespan or reliability. When failure rates decrease, we prove that makespan and reliability are opposing metrics. In contrast, when failure rates increase, we prove that one can optimize both makespan and reliability simultaneously. Moreover, we show that the largest processing time (LPT) list scheduling algorithm achieves good performance when processors are of uniform speed. The implications of our findings are the accelerated completion and improved reliability of parallel jobs executed across large distributed systems. Finally, we conduct simulations to investigate the impact of failures on the performance, which is done using an actual application of biological sequence comparison. •If failure rates are decreasing, we prove that makespan and reliability are antagonistic.•As there is no single optimal solution, we design an algorithm for computing the approximation set of the Pareto front.•If failure rates are increasing, both makespan and reliability can be optimized at the same time.•We prove that the scheduling problem can be solved optimally for several common failure distributions, such as Weibull.
ISSN:0743-7315
1096-0848
DOI:10.1016/j.jpdc.2014.01.005