Loading…

Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads

With the rapidly increasing model sizes, state-of-the-art Deep Learning (DL) models rely on multiple GPU nodes to run distributed training. Large message communication of GPU data between the GPUs is becoming a performance bottleneck in the overall training performance. GPU-Aware MPI libraries are w...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zhou, Qinghua, Anthony, Quentin, Shafi, Aamir, Subramoni, Hari, Panda, Dhabaleswar K. DK
Format:	Conference Proceeding
Language:	English
Subjects:	Benchmark testing Broadcast Compression Computational modeling Deep learning GPU-Aware MPI Graphics processing units High performance computing Libraries Training
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	31
container_issue
container_start_page	22
container_title
container_volume
creator	Zhou, Qinghua Anthony, Quentin Shafi, Aamir Subramoni, Hari Panda, Dhabaleswar K. DK
description	With the rapidly increasing model sizes, state-of-the-art Deep Learning (DL) models rely on multiple GPU nodes to run distributed training. Large message communication of GPU data between the GPUs is becoming a performance bottleneck in the overall training performance. GPU-Aware MPI libraries are widely adopted for state-of-the-art DL frameworks to improve communication performance. In the existing optimization solutions for Distributed Data-Parallel (DDP) training, the broadcast operation is often utilized to sync up the updated model parameters among all the GPUs. However, for state-of-the-art GPU-Aware MPI libraries, broadcasting large GPU data turns to overburden the training performance due to the limited bandwidth of interconnect between the GPU nodes. On the other hand, the recent research on using GPU-based compression libraries to lower the pressure on the nearly saturated interconnection and co-designing online compression with the communication pattern provides a new perspective to optimize the performance of broadcast on modern GPU clusters.In this paper, we redesign the GPU-Aware MPI library to enable efficient collective-level online compression with an optimized chunked-chain scheme for large message broadcast communication. The proposed design is evaluated to show benefits at both microbenchmark and application levels. At the microbenchmark level, the proposed design can reduce the broadcast communication latency by up to 80.9% compared to the baseline using a state-of-the-art MPI library and 55.1% compared to the existing point-to-point-based compression on modern GPU clusters. For DDP training with PyTorch, the proposed design reduces the training time by up to 15.0% and 6.4% compared to the existing chunked-chain scheme and point-to-point-based compression, respectively, while keeping similar training accuracy. To the best of our knowledge, this is the first work that leverages online GPU-based compression techniques to significantly accelerate broadcast communication for DL workloads.
doi_str_mv	10.1109/HiPC56025.2022.00016
format	conference_proceeding
fullrecord	<record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10106309</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10106309</ieee_id><sourcerecordid>10106309</sourcerecordid><originalsourceid>FETCH-LOGICAL-i204t-4e1c68a80707b3e869c1d0e23b563d450b35e6eab2299675c214557c32f6e9723</originalsourceid><addsrcrecordid>eNotzNFOwjAUxvFqYiIib8DFXmB4etqerZc4FUxI5ELiJem6M63CRtoZ49sD0asv-SXfX4iphJmUYO-WYV0ZAjQzBMQZAEi6EBNblJLIaKtR0aUYIWnIQUm6FjcpfQIgSDQjsZl7zzuObgjde3Yfe9d4l4as6vf77y74k_dd9hOGj2yx3pz5EDmlM7Z9zB6YD9mKXezO97c-fu1OhXQrrlq3Szz537HYPD2-Vst89bJ4ruarPCDoIdcsPZWuhAKKWnFJ1ssGGFVtSDXaQK0ME7sa0VoqjEepjSm8wpbYFqjGYvrXDcy8PcSwd_F3K0ECKbDqCBu3UQg</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads</title><source>IEEE Xplore All Conference Series</source><creator>Zhou, Qinghua ; Anthony, Quentin ; Shafi, Aamir ; Subramoni, Hari ; Panda, Dhabaleswar K. DK</creator><creatorcontrib>Zhou, Qinghua ; Anthony, Quentin ; Shafi, Aamir ; Subramoni, Hari ; Panda, Dhabaleswar K. DK</creatorcontrib><description>With the rapidly increasing model sizes, state-of-the-art Deep Learning (DL) models rely on multiple GPU nodes to run distributed training. Large message communication of GPU data between the GPUs is becoming a performance bottleneck in the overall training performance. GPU-Aware MPI libraries are widely adopted for state-of-the-art DL frameworks to improve communication performance. In the existing optimization solutions for Distributed Data-Parallel (DDP) training, the broadcast operation is often utilized to sync up the updated model parameters among all the GPUs. However, for state-of-the-art GPU-Aware MPI libraries, broadcasting large GPU data turns to overburden the training performance due to the limited bandwidth of interconnect between the GPU nodes. On the other hand, the recent research on using GPU-based compression libraries to lower the pressure on the nearly saturated interconnection and co-designing online compression with the communication pattern provides a new perspective to optimize the performance of broadcast on modern GPU clusters.In this paper, we redesign the GPU-Aware MPI library to enable efficient collective-level online compression with an optimized chunked-chain scheme for large message broadcast communication. The proposed design is evaluated to show benefits at both microbenchmark and application levels. At the microbenchmark level, the proposed design can reduce the broadcast communication latency by up to 80.9% compared to the baseline using a state-of-the-art MPI library and 55.1% compared to the existing point-to-point-based compression on modern GPU clusters. For DDP training with PyTorch, the proposed design reduces the training time by up to 15.0% and 6.4% compared to the existing chunked-chain scheme and point-to-point-based compression, respectively, while keeping similar training accuracy. To the best of our knowledge, this is the first work that leverages online GPU-based compression techniques to significantly accelerate broadcast communication for DL workloads.</description><identifier>EISSN: 2640-0316</identifier><identifier>EISBN: 9781665494236</identifier><identifier>EISBN: 1665494239</identifier><identifier>DOI: 10.1109/HiPC56025.2022.00016</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Benchmark testing ; Broadcast ; Compression ; Computational modeling ; Deep learning ; GPU-Aware MPI ; Graphics processing units ; High performance computing ; Libraries ; Training</subject><ispartof>2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC), 2022, p.22-31</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10106309$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,777,781,786,787,27906,54536,54913</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10106309$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Zhou, Qinghua</creatorcontrib><creatorcontrib>Anthony, Quentin</creatorcontrib><creatorcontrib>Shafi, Aamir</creatorcontrib><creatorcontrib>Subramoni, Hari</creatorcontrib><creatorcontrib>Panda, Dhabaleswar K. DK</creatorcontrib><title>Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads</title><title>2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)</title><addtitle>HIPC</addtitle><description>With the rapidly increasing model sizes, state-of-the-art Deep Learning (DL) models rely on multiple GPU nodes to run distributed training. Large message communication of GPU data between the GPUs is becoming a performance bottleneck in the overall training performance. GPU-Aware MPI libraries are widely adopted for state-of-the-art DL frameworks to improve communication performance. In the existing optimization solutions for Distributed Data-Parallel (DDP) training, the broadcast operation is often utilized to sync up the updated model parameters among all the GPUs. However, for state-of-the-art GPU-Aware MPI libraries, broadcasting large GPU data turns to overburden the training performance due to the limited bandwidth of interconnect between the GPU nodes. On the other hand, the recent research on using GPU-based compression libraries to lower the pressure on the nearly saturated interconnection and co-designing online compression with the communication pattern provides a new perspective to optimize the performance of broadcast on modern GPU clusters.In this paper, we redesign the GPU-Aware MPI library to enable efficient collective-level online compression with an optimized chunked-chain scheme for large message broadcast communication. The proposed design is evaluated to show benefits at both microbenchmark and application levels. At the microbenchmark level, the proposed design can reduce the broadcast communication latency by up to 80.9% compared to the baseline using a state-of-the-art MPI library and 55.1% compared to the existing point-to-point-based compression on modern GPU clusters. For DDP training with PyTorch, the proposed design reduces the training time by up to 15.0% and 6.4% compared to the existing chunked-chain scheme and point-to-point-based compression, respectively, while keeping similar training accuracy. To the best of our knowledge, this is the first work that leverages online GPU-based compression techniques to significantly accelerate broadcast communication for DL workloads.</description><subject>Benchmark testing</subject><subject>Broadcast</subject><subject>Compression</subject><subject>Computational modeling</subject><subject>Deep learning</subject><subject>GPU-Aware MPI</subject><subject>Graphics processing units</subject><subject>High performance computing</subject><subject>Libraries</subject><subject>Training</subject><issn>2640-0316</issn><isbn>9781665494236</isbn><isbn>1665494239</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2022</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotzNFOwjAUxvFqYiIib8DFXmB4etqerZc4FUxI5ELiJem6M63CRtoZ49sD0asv-SXfX4iphJmUYO-WYV0ZAjQzBMQZAEi6EBNblJLIaKtR0aUYIWnIQUm6FjcpfQIgSDQjsZl7zzuObgjde3Yfe9d4l4as6vf77y74k_dd9hOGj2yx3pz5EDmlM7Z9zB6YD9mKXezO97c-fu1OhXQrrlq3Szz537HYPD2-Vst89bJ4ruarPCDoIdcsPZWuhAKKWnFJ1ssGGFVtSDXaQK0ME7sa0VoqjEepjSm8wpbYFqjGYvrXDcy8PcSwd_F3K0ECKbDqCBu3UQg</recordid><startdate>202212</startdate><enddate>202212</enddate><creator>Zhou, Qinghua</creator><creator>Anthony, Quentin</creator><creator>Shafi, Aamir</creator><creator>Subramoni, Hari</creator><creator>Panda, Dhabaleswar K. DK</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>202212</creationdate><title>Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads</title><author>Zhou, Qinghua ; Anthony, Quentin ; Shafi, Aamir ; Subramoni, Hari ; Panda, Dhabaleswar K. DK</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i204t-4e1c68a80707b3e869c1d0e23b563d450b35e6eab2299675c214557c32f6e9723</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Benchmark testing</topic><topic>Broadcast</topic><topic>Compression</topic><topic>Computational modeling</topic><topic>Deep learning</topic><topic>GPU-Aware MPI</topic><topic>Graphics processing units</topic><topic>High performance computing</topic><topic>Libraries</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhou, Qinghua</creatorcontrib><creatorcontrib>Anthony, Quentin</creatorcontrib><creatorcontrib>Shafi, Aamir</creatorcontrib><creatorcontrib>Subramoni, Hari</creatorcontrib><creatorcontrib>Panda, Dhabaleswar K. DK</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhou, Qinghua</au><au>Anthony, Quentin</au><au>Shafi, Aamir</au><au>Subramoni, Hari</au><au>Panda, Dhabaleswar K. DK</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads</atitle><btitle>2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)</btitle><stitle>HIPC</stitle><date>2022-12</date><risdate>2022</risdate><spage>22</spage><epage>31</epage><pages>22-31</pages><eissn>2640-0316</eissn><eisbn>9781665494236</eisbn><eisbn>1665494239</eisbn><coden>IEEPAD</coden><abstract>With the rapidly increasing model sizes, state-of-the-art Deep Learning (DL) models rely on multiple GPU nodes to run distributed training. Large message communication of GPU data between the GPUs is becoming a performance bottleneck in the overall training performance. GPU-Aware MPI libraries are widely adopted for state-of-the-art DL frameworks to improve communication performance. In the existing optimization solutions for Distributed Data-Parallel (DDP) training, the broadcast operation is often utilized to sync up the updated model parameters among all the GPUs. However, for state-of-the-art GPU-Aware MPI libraries, broadcasting large GPU data turns to overburden the training performance due to the limited bandwidth of interconnect between the GPU nodes. On the other hand, the recent research on using GPU-based compression libraries to lower the pressure on the nearly saturated interconnection and co-designing online compression with the communication pattern provides a new perspective to optimize the performance of broadcast on modern GPU clusters.In this paper, we redesign the GPU-Aware MPI library to enable efficient collective-level online compression with an optimized chunked-chain scheme for large message broadcast communication. The proposed design is evaluated to show benefits at both microbenchmark and application levels. At the microbenchmark level, the proposed design can reduce the broadcast communication latency by up to 80.9% compared to the baseline using a state-of-the-art MPI library and 55.1% compared to the existing point-to-point-based compression on modern GPU clusters. For DDP training with PyTorch, the proposed design reduces the training time by up to 15.0% and 6.4% compared to the existing chunked-chain scheme and point-to-point-based compression, respectively, while keeping similar training accuracy. To the best of our knowledge, this is the first work that leverages online GPU-based compression techniques to significantly accelerate broadcast communication for DL workloads.</abstract><pub>IEEE</pub><doi>10.1109/HiPC56025.2022.00016</doi><tpages>10</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	EISSN: 2640-0316
ispartof	2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC), 2022, p.22-31
issn	2640-0316
language	eng
recordid	cdi_ieee_primary_10106309
source	IEEE Xplore All Conference Series
subjects	Benchmark testing Broadcast Compression Computational modeling Deep learning GPU-Aware MPI Graphics processing units High performance computing Libraries Training
title	Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T17%3A49%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Accelerating%20Broadcast%20Communication%20with%20GPU%20Compression%20for%20Deep%20Learning%20Workloads&rft.btitle=2022%20IEEE%2029th%20International%20Conference%20on%20High%20Performance%20Computing,%20Data,%20and%20Analytics%20(HiPC)&rft.au=Zhou,%20Qinghua&rft.date=2022-12&rft.spage=22&rft.epage=31&rft.pages=22-31&rft.eissn=2640-0316&rft.coden=IEEPAD&rft_id=info:doi/10.1109/HiPC56025.2022.00016&rft.eisbn=9781665494236&rft.eisbn_list=1665494239&rft_dat=%3Cieee_CHZPO%3E10106309%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i204t-4e1c68a80707b3e869c1d0e23b563d450b35e6eab2299675c214557c32f6e9723%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10106309&rfr_iscdi=true