Loading…

PACER: Accelerating Distributed GNN Training Using Communication-Efficient Partition Refinement and Caching

Despite recent breakthroughs in distributed Graph Neural Network (GNN) training, large-scale graphs still generate significant network communication overhead, decreasing time and resource efficiency. Although recently proposed partitioning or caching methods try to reduce communication inefficiencie...

Full description

Saved in:
Bibliographic Details
Published in:The proceedings of the ACM on networking 2024-12, Vol.2 (CoNEXT4), p.1-18, Article 39
Main Authors: Mahmud, Shohaib, Shen, Haiying, Iyer, Anand
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Despite recent breakthroughs in distributed Graph Neural Network (GNN) training, large-scale graphs still generate significant network communication overhead, decreasing time and resource efficiency. Although recently proposed partitioning or caching methods try to reduce communication inefficiencies and overheads, they are not sufficiently effective due to their sampling pattern-agnostic nature. This paper proposes a Pipelined Partition Aware Caching and Communication Efficient Refinement System (Pacer), a communication-efficient distributed GNN training system. First, Pacer intelligently estimates each partition's access frequency to each vertex by jointly considering the sampling method and graph topology. Then, it uses the estimated access frequency to refine partitions and caching vertices in its two-level cache (CPU and GPU) to minimize data transfer latency. Furthermore, Pacer incorporates a pipeline-based minibatching method to mask the effect of the network communication. Experimental results on real-world graphs show that Pacer outperforms state-of-the-art distributed GNN training system in training time by 40% on average.
ISSN:2834-5509
2834-5509
DOI:10.1145/3697805