Loading…

With great reliability comes great responsibility: tradeoffs of run-time policy on high reliability systems

In this paper we describe a simulation study to improve performance on a large highly utilized cluster at Sandia National Laboratories. The unique characteristic about the cluster is that there are very few constraints on job size. In particular, the run-time is limited only by system times which oc...

Full description

Saved in:
Bibliographic Details
Main Authors: Kleban, S.D., Johnston, J.R., Ang, J.A., Clearwater, S.H.
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In this paper we describe a simulation study to improve performance on a large highly utilized cluster at Sandia National Laboratories. The unique characteristic about the cluster is that there are very few constraints on job size. In particular, the run-time is limited only by system times which occur about every two weeks. The major contribution of this paper is that we quantify the difference in makespan between running a single long job and its equivalent in many shorter jobs. We find that running longer jobs is beneficial to the facility as a whole when the cycle-weighted makespans are considered and that running shorter jobs has an overall beneficial effect on the makespan for the jobs taken unweighted and for most users.
DOI:10.1109/CCGrid.2004.1336653