Loading…
BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster
Training deep learning model is a time-consuming job since it usually uses a large amount of data. To reduce the training time, most practitioners train the models in a GPU cluster environment in a distributed way. The synchronous stochastic gradient descent, which is one of the widely used distribu...
Saved in:
Published in: | The Journal of supercomputing 2020, Vol.76 (1), p.47-67 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Training deep learning model is a time-consuming job since it usually uses a large amount of data. To reduce the training time, most practitioners train the models in a GPU cluster environment in a distributed way. The synchronous stochastic gradient descent, which is one of the widely used distributed training algorithms, has fast convergence rate with the use of multiple GPU workers; but its speed is tied to the slowest worker, i.e., straggler. In a heterogeneous environment, a static straggler, which has not been mainly focused before, has more impact on the performance than randomly occurring straggler. However, most existing studies for straggler mitigation usually consider a homogeneous environment, so their approaches are limited in practice. In this paper, we scrutinize the straggler problem under heterogeneous environment and define
static
and
dynamic
straggler from empirical results. Based on this, we propose a novel approach called
batch orchestration algorithm (BOA)
for straggler mitigation. It adaptively balances the amount of mini-batch data according to the speed of workers. Therefore, BOA can mitigate both
static
and
dynamic
straggler in a modern GPU cluster. BOA uses a Min–Max Integer programming to find the optimal mini-batch size, with the hardware-agnostic performance models. For verification, several experiments are conducted on a cluster having up to six GPUs with three types: GTX 1080, GTX 1060 and Quadro M2000. The results show BOA mitigates both types of stragglers and accelerates the training speed with synchronous SGD compared to other straggler mitigation method. |
---|---|
ISSN: | 0920-8542 1573-0484 |
DOI: | 10.1007/s11227-019-02845-2 |