Loading…

Collective Communication Performance Evaluation for Distributed Deep Learning Training

In distributed deep learning, the improper use of the collective communication library can lead to a decline in deep learning performance due to increased communication time. Representative collective communication libraries such as MPI, GLOO, and NCCL exhibit varying performance based on server env...

Full description

Saved in:

Bibliographic Details
Published in:	Applied sciences 2024-06, Vol.14 (12), p.5100
Main Authors:	Lee, Sookwang, Lee, Jaehwan
Format:	Article
Language:	English
Subjects:	cloud virtualization collective communication Communication data parallelism Datasets Deep learning distributed deep learning Experiments Libraries Linux Performance evaluation Workloads
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	In distributed deep learning, the improper use of the collective communication library can lead to a decline in deep learning performance due to increased communication time. Representative collective communication libraries such as MPI, GLOO, and NCCL exhibit varying performance based on server environment and communication architecture. In this study, we investigate three key aspects to evaluate the performance of the collective communication libraries in a distributed deep learning setting in an intra-node environment. First, we conduct a comparison and analysis of collective communication library performance within common distributed deep learning architectures, such as parameter servers and ring all-reduce methods. Second, we evaluate the performance of these libraries in different environments, including various container platforms and bare metal setups, considering the scalability and flexibility advantages offered by cloud virtualization. Last, to ensure practicality, we assess the libraries’ performance in a Linux shell and within the PyTorch framework. In the cross-docker virtualization environment, NCCL shows up to 213% higher latency compared to single docker, while GLOO exhibits 36% lower latency in single docker than in cross docker, and NCCL achieves up to 345% lower execution time in all-reduce operations compared to other libraries (MPI and GLOO). These findings will inform the selection of an appropriate collective communication library for designing effective distributed deep learning environments.
ISSN:	2076-3417 2076-3417
DOI:	10.3390/app14125100