Loading…
Improving the performance of batch schedulers using online job runtime classification
•Use of data analytics and machine learning to improve batch scheduling.•Show that a simple qualitative analysis is sufficient to harness most of the gains.•Extensive experimental campaign using different traces from various machines and periods. Job scheduling in high-performance computing platform...
Saved in:
Published in: | Journal of parallel and distributed computing 2022-06, Vol.164, p.83-95 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | •Use of data analytics and machine learning to improve batch scheduling.•Show that a simple qualitative analysis is sufficient to harness most of the gains.•Extensive experimental campaign using different traces from various machines and periods.
Job scheduling in high-performance computing platforms is a hard problem that involves uncertainties on both the job arrival process and their execution times. Users typically provide only loose upper bounds for job execution times, which are not so useful for scheduling heuristics based on processing times. Previous studies focused on applying regression techniques to obtain better execution time estimates, which worked reasonably well and improved scheduling metrics. However, these approaches require a long period of training data.
In this work, we propose a simpler approach by classifying jobs as small or large and prioritizing the execution of small jobs over large ones. Indeed, small jobs are the most impacted by queuing delays, but they typically represent a light load and incur a small burden on the other jobs. The classifier operates online and learns by using data collected over the previous weeks, facilitating its deployment and enabling a fast adaptation to changes in the workload characteristics.
We evaluate our approach using four scheduling policies on seven HPC platform workload traces. We show that: first, incorporating such classification reduces the average bounded slowdown of jobs in all scenarios, second, in most considered scenarios, the improvements are comparable to the ideal hypothetical situation where the scheduler would know in advance the exact running time of jobs. |
---|---|
ISSN: | 0743-7315 1096-0848 |
DOI: | 10.1016/j.jpdc.2022.01.003 |