Loading…

Improving the performance of batch schedulers using online job runtime classification

•Use of data analytics and machine learning to improve batch scheduling.•Show that a simple qualitative analysis is sufficient to harness most of the gains.•Extensive experimental campaign using different traces from various machines and periods. Job scheduling in high-performance computing platform...

Full description

Saved in:
Bibliographic Details
Published in:Journal of parallel and distributed computing 2022-06, Vol.164, p.83-95
Main Authors: Zrigui, Salah, de Camargo, Raphael Y., Legrand, Arnaud, Trystram, Denis
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•Use of data analytics and machine learning to improve batch scheduling.•Show that a simple qualitative analysis is sufficient to harness most of the gains.•Extensive experimental campaign using different traces from various machines and periods. Job scheduling in high-performance computing platforms is a hard problem that involves uncertainties on both the job arrival process and their execution times. Users typically provide only loose upper bounds for job execution times, which are not so useful for scheduling heuristics based on processing times. Previous studies focused on applying regression techniques to obtain better execution time estimates, which worked reasonably well and improved scheduling metrics. However, these approaches require a long period of training data. In this work, we propose a simpler approach by classifying jobs as small or large and prioritizing the execution of small jobs over large ones. Indeed, small jobs are the most impacted by queuing delays, but they typically represent a light load and incur a small burden on the other jobs. The classifier operates online and learns by using data collected over the previous weeks, facilitating its deployment and enabling a fast adaptation to changes in the workload characteristics. We evaluate our approach using four scheduling policies on seven HPC platform workload traces. We show that: first, incorporating such classification reduces the average bounded slowdown of jobs in all scenarios, second, in most considered scenarios, the improvements are comparable to the ideal hypothetical situation where the scheduler would know in advance the exact running time of jobs.
ISSN:0743-7315
1096-0848
DOI:10.1016/j.jpdc.2022.01.003