Loading…
TNP: A Step Towards Elastic Training
With machine learning models continuously growing in size and short release cycles of GPUs, hardware becomes outdated very soon. To cope with the ever-growing model sizes, we seek out ways to better utilize the computing power we already possess. This paper implements a makespan-aware distributed tr...
Saved in:
Main Authors: | , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | With machine learning models continuously growing in size and short release cycles of GPUs, hardware becomes outdated very soon. To cope with the ever-growing model sizes, we seek out ways to better utilize the computing power we already possess. This paper implements a makespan-aware distributed training framework called Train 'N' Play (TNP) to make training on large models and large datasets possible for systems that originally could not accomplish. |
---|---|
ISSN: | 2575-8284 |
DOI: | 10.1109/ICCE-Taiwan58799.2023.10226742 |