Loading…

GraphCC: A practical graph learning-based approach to Congestion Control in datacenters

Congestion Control (CC) plays a fundamental role in optimizing traffic in Datacenter Networks (DCNs). Currently, DCNs implement two main CC protocols: DCTCP and DCQCN. Both protocols are based on Explicit Congestion Notification (ECN), where switches mark packets when they detect congestion. Nowaday...

Full description

Saved in:
Bibliographic Details
Published in:Computer networks (Amsterdam, Netherlands : 1999) Netherlands : 1999), 2025-02, Vol.257, Article 110981
Main Authors: Bernárdez, Guillermo, Suárez-Varela, José, Shi, Xiang, Xiao, Shihan, Cheng, Xiangle, Barlet-Ros, Pere, Cabellos-Aparicio, Albert
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Congestion Control (CC) plays a fundamental role in optimizing traffic in Datacenter Networks (DCNs). Currently, DCNs implement two main CC protocols: DCTCP and DCQCN. Both protocols are based on Explicit Congestion Notification (ECN), where switches mark packets when they detect congestion. Nowadays, network experts carefully set ECN parameters to optimize the average network performance. However, today’s DCNs experience rapid and abrupt changes that severely affect the network state (e.g., dynamic workloads, incasts), which leads to under-utilization and sub-optimal performance. In this paper we present GraphCC, a framework for in-network CC optimization. GraphCC relies on Multi-agent Reinforcement Learning (MARL) and Graph Neural Networks (GNN), and is compatible with widely deployed ECN-based CC protocols. The proposed solution deploys distributed agents on switches that communicate with their neighbors to cooperate and optimize the global ECN configuration. In our evaluation, we test GraphCC with three real-world traffic workloads, focusing on its capability to accommodate scenarios unseen during training (e.g., traffic changes, failures). We compare GraphCC with a state-of-the-art MARL solution for ECN tuning, and observe that our method outperforms the state-of-the-art baseline in all evaluation scenarios, with improvements up to 20% in average Flow Completion Time, similar mean throughput (within 1%), and significant reductions in buffer occupancy (38.0–85.7%).
ISSN:1389-1286
DOI:10.1016/j.comnet.2024.110981