Loading…

TNP: A Step Towards Elastic Training

With machine learning models continuously growing in size and short release cycles of GPUs, hardware becomes outdated very soon. To cope with the ever-growing model sizes, we seek out ways to better utilize the computing power we already possess. This paper implements a makespan-aware distributed tr...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yeng, Li-Chung, Lee, Wei-Tsong, Wei, Hsin-Wen
Format:	Conference Proceeding
Language:	English
Subjects:	Computational modeling Consumer electronics Distributed Training Hardware Heterogeneous Systems Machine learning Training
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	With machine learning models continuously growing in size and short release cycles of GPUs, hardware becomes outdated very soon. To cope with the ever-growing model sizes, we seek out ways to better utilize the computing power we already possess. This paper implements a makespan-aware distributed training framework called Train 'N' Play (TNP) to make training on large models and large datasets possible for systems that originally could not accomplish.
ISSN:	2575-8284
DOI:	10.1109/ICCE-Taiwan58799.2023.10226742