Loading…

Challenges and experiences in building an efficient apache beam runner for IBM streams

This paper describes the challenges and experiences in the development of IBM Streams runner for Apache Beam. Apache Beam is emerging as a common stream programming interface for multiple computing engines. Each participating engine implements a runner to translate Beam applications into engine-spec...

Full description

Saved in:
Bibliographic Details
Published in:Proceedings of the VLDB Endowment 2018-08, Vol.11 (12), p.1742-1754
Main Authors: Li, Shen, Gerver, Paul, MacMillan, John, Debrunner, Daniel, Marshall, William, Wu, Kun-Lung
Format: Article
Language:English
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This paper describes the challenges and experiences in the development of IBM Streams runner for Apache Beam. Apache Beam is emerging as a common stream programming interface for multiple computing engines. Each participating engine implements a runner to translate Beam applications into engine-specific programs. Hence, applications written with the Beam SDK can be executed on different underlying stream computing engines, with negligible migration penalty. IBM Streams is a widely-used enterprise streaming platform. It has a rich set of connectors and toolkits for easy integration of streaming applications with other enterprise applications. It also supports a broad range of programming language interfaces, including Java, C++, Python, Stream Processing Language (SPL) and Apache Beam. This paper focuses on our solutions to efficiently support the Beam programming abstractions in IBM Streams runner. Beam organizes data into discrete event time windows. This design, on the one hand, supports out-of-order data arrivals, but on the other hand, forces runners to maintain more states, which leads to higher space and computation overhead. IBM Streams runner mitigates this problem by efficiently indexing inter-dependent states, garbage-collecting stale keys, and enforcing bundle sizes. We also share performance concerns in Beam that could potentially impact applications. Evaluations show that IBM Streams runner outperforms Flink runner and Spark runner in most scenarios when running the Beam NEXMark benchmarks. IBM Streams runner is available for download from IBM Cloud Streaming Analytics service console.
ISSN:2150-8097
2150-8097
DOI:10.14778/3229863.3229864