Loading…

StreamMLOps: Operationalizing Online Learning for Big Data Streaming & Real-Time Applications

Continuously learning and serving from evolving streaming data and serving in real-time is a challenging problem. Traditionally, data is partitioned and processed in batches to train machine learning (ML) models. In industrial applications, static models' performance drops over time (model degr...

Full description

Saved in:
Bibliographic Details
Main Authors: Barry, Mariam, Montiel, Jacob, Bifet, Albert, Wadkar, Sameer, Manchev, Nikolay, Halford, Max, Chiky, Raja, Jaouhari, Saad EL, Shakman, Katherine B, Fehaily, Joudi Al, Le Deit, Fabrice, Tran, Vinh-Thuy, Guerizec, Eric
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Continuously learning and serving from evolving streaming data and serving in real-time is a challenging problem. Traditionally, data is partitioned and processed in batches to train machine learning (ML) models. In industrial applications, static models' performance drops over time (model degradation, concept drift), requiring new models to be trained with recent data and redeployed in production. The scientific community has been studying online and adaptive methods to address batch-learning limitations and continuously train AI tasks for industrial applications such as cyber-security, AIOps, anomaly scoring, and drift detection in stock markets. This paper deals with the MLOps aspects of deploying such online and dynamic models to address the requirements in the production systems for real-time applications. Our architectures - based on open-source tools such as Kafka and River - demonstrated how online learning methods could be scaled horizontally in production to meet the demands of a high-velocity streaming pipeline. We demonstrate an MLOps strategy to perform incremental learning from streaming data and continuously deploy the online learning model without pausing the inference pipeline. Indeed, the design satisfies requirements such as model versioning, monitoring, audibility and reproducibility of prediction in both a supervised and semi-supervised setting. Our experiments - for malicious URLs detection task - performed on high-dimensional and feature-evolving streaming data (more than 3 million features) establish the effectiveness and efficiency of online learning models compared to batch (static) machine learning regarding both time and space complexity. Finally, we provide some best practices on data engineering for deploying online models to process a real-time feature stream in production environments. Code is publicly available for reproducibility.
ISSN:2375-026X
DOI:10.1109/ICDE55515.2023.00272