Loading…

High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems

The recent advent of the NVLink interconnect and Peripheral Component Interconnect express (PCIe) switch has resulted in the creation of extremely dense Graphics Processing Unit (GPU) systems like Cray CS-Storm and NVIDIA DGX. In addition to the extremely high computational capability and communicat...

Full description

Saved in:
Bibliographic Details
Main Authors: Chu, Ching-Hsiang, Hashmi, Jahanzeb Maqbool, Khorassani, Kawthar Shafie, Subramoni, Hari, Panda, Dhabaleswar K.
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The recent advent of the NVLink interconnect and Peripheral Component Interconnect express (PCIe) switch has resulted in the creation of extremely dense Graphics Processing Unit (GPU) systems like Cray CS-Storm and NVIDIA DGX. In addition to the extremely high computational capability and communication capacity within a single machine, these systems expose novel capabilities like performing load-store operations from remote GPU memory across interconnects. While researchers have proposed solutions that take advantage of load-store semantics to provide support for high-performance datatype processing on CPUs, there exists no scholarly work on how one can orchestrate such high-performance datatype-based communication for GPU-resident data. In this paper, we take up this challenge and propose high-performance and architecture-aware designs for GPU-based non-contiguous datatype processing that uses the load-store semantics exposed by modern dense GPU systems. We demonstrate that the proposed solutions can reduce the overhead of datatype processing by up to 4.7X compared to the state-of-the-art schemes for GPU-based MILC communication kernel on an NVLink2-enabled dense GPU system. For a weather forecast application kernel, the proposed designs demonstrate up to 9.9X faster HaloExchange kernel among 64 GPUs over state-of-the-art designs. The proposed adaptive scheme also reports 10% higher throughput than existing designs for a 2D Jacobi solver on 16 GPUs. To the best of our knowledge, this is the first scholarly work that takes advantage of the zero-copy based load-store semantics to perform high-performance GPU to GPU derived datatype communication on modern dense GPU systems.
ISSN:2640-0316
DOI:10.1109/HiPC.2019.00041