Loading…

Fangorn: adaptive execution framework for heterogeneous workloads on shared clusters

Pervasive needs for data explorations at all scales have populated modern distributed platforms with workloads of different characteristics. The growing complexities and diversities have thereafter imposed distinct challenges to execute them on shared clusters in corporate or public clouds. This pap...

Full description

Saved in:
Bibliographic Details
Published in:Proceedings of the VLDB Endowment 2021-08, Vol.14 (12), p.2972-2985
Main Authors: Chen, Yingda, Wang, Jiamang, Lu, Yifeng, Han, Ying, Lv, Zhiqiang, Min, Xuebin, Cai, Hua, Zhang, Wei, Fan, Haochuan, Li, Chao, Guan, Tao, Lin, Wei, Jia, Yangqing, Zhou, Jingren
Format: Article
Language:English
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Pervasive needs for data explorations at all scales have populated modern distributed platforms with workloads of different characteristics. The growing complexities and diversities have thereafter imposed distinct challenges to execute them on shared clusters in corporate or public clouds. This paper presents Fangorn, an adaptive execution framework built on an enriched graph model. As the underlying infrastructure for core computation platforms at Alibaba, Fangorn supports various execution modes and caters to heterogeneous workloads. With the capability to orchestrate graph executions with both long-running and requested-on-demand resources at the same time, Fangorn allows exploration of tradeoffs between latency and resource efficiency, for jobs of all scales. By modeling distributed job executions as mutable graphs with pluggable components, Fangorn offers a systematic framework to adjust job executions adaptively, according to data statistics collected during run-time. Fangorn supports an array of different computation engines ranging from relational to deep learning, and is fully deployed on production clusters across Alibaba. It manages tens of millions of distributed jobs daily, with job size scaling from one to half-million.
ISSN:2150-8097
2150-8097
DOI:10.14778/3476311.3476376