Loading…

Adaptive key partitioning in distributed stream processing

In stream processing systems, Key Grouping is a commonly employed partitioning scheme for distributing input tuples among parallel instances of stateful operators. With key grouping, tuples shared public keys in the stream are designated to the specific instance responsible for that key. Typically,...

Full description

Saved in:
Bibliographic Details
Published in:CCF transactions on high performance computing (Online) 2024-04, Vol.6 (2), p.164-178
Main Authors: Liu, Gang, Wang, Zeting, Zhou, Amelie Chi, Mao, Rui
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In stream processing systems, Key Grouping is a commonly employed partitioning scheme for distributing input tuples among parallel instances of stateful operators. With key grouping, tuples shared public keys in the stream are designated to the specific instance responsible for that key. Typically, the implementation of key grouping involves the use of a hash function. While it is convenient and deterministic, it is also known to cause load imbalance between parallel instances, especially in the presence of skewed data streams. Key-Splitting is an effective technique that distributes tasks associated with keys to downstream operators, facilitating load balancing at a relatively low cost. However, overly increasing parallel instances can lead to excessive aggregation costs, becoming a system bottleneck. In this paper, we show the high aggregation cost brought by the Key-Splitting partitioner at different levels of key separation. To address this challenge, we introduce an adaptive Key-Splitting method which controlling the degree of key separation. We propose a partitioner named FlexD, which aims to achieve dynamic adaptation of key separation limits for streaming data. The partitioner employs key grouping to distribute rare keys and dynamic expansion of processing instances to distribute hot keys. We implemented our method on Apache Storm and evaluated it by using real-world and synthetic datasets. Experimental results show that our method achieves a good balance between load balancing and aggregation cost. Moreover, it outperforms existing methods, achieving higher throughput.
ISSN:2524-4922
2524-4930
DOI:10.1007/s42514-023-00179-3