Loading…

TNP: A Step Towards Elastic Training

With machine learning models continuously growing in size and short release cycles of GPUs, hardware becomes outdated very soon. To cope with the ever-growing model sizes, we seek out ways to better utilize the computing power we already possess. This paper implements a makespan-aware distributed tr...

Full description

Saved in:
Bibliographic Details
Main Authors: Yeng, Li-Chung, Lee, Wei-Tsong, Wei, Hsin-Wen
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:With machine learning models continuously growing in size and short release cycles of GPUs, hardware becomes outdated very soon. To cope with the ever-growing model sizes, we seek out ways to better utilize the computing power we already possess. This paper implements a makespan-aware distributed training framework called Train 'N' Play (TNP) to make training on large models and large datasets possible for systems that originally could not accomplish.
ISSN:2575-8284
DOI:10.1109/ICCE-Taiwan58799.2023.10226742