Loading…
Incremental and accurate computation of machine learning models with smart data summarization
Nowadays, data scientists prefer “easy” high-level languages like R and Python, which accomplish complex mathematical tasks with a few lines of code, but they present memory and speed limitations. Data summarization has been a fundamental technique in data mining that has promise with more demanding...
Saved in:
Published in: | Journal of intelligent information systems 2022-08, Vol.59 (1), p.149-172 |
---|---|
Main Authors: | , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Nowadays, data scientists prefer “easy” high-level languages like R and Python, which accomplish complex mathematical tasks with a few lines of code, but they present memory and speed limitations. Data summarization has been a fundamental technique in data mining that has promise with more demanding data science applications. Unfortunately, most summarization approaches require reading the entire data set before computing any machine learning (ML) model, the old-fashioned way. Also, it is hard to learn models if there is an addition or removal of data samples. Keeping these motivations in mind, we present incremental algorithms to smartly compute summarization matrix, previously used in parallel DBMSs, to compute ML models incrementally in data science languages. Compared to the previous approaches, our new smart algorithms interleave model computation periodically, as the data set is being summarized. A salient feature is scalability to large data sets, provided the summarization matrix fits in RAM, a reasonable assumption in most cases. We show our incremental approach is intelligent and works for a wide spectrum of ML models. Our experimental evaluation shows models get increasingly accurate, reaching total accuracy when the data set is fully scanned. On the other hand, we show our incremental algorithms are as fast as Python ML library, and much faster than R built-in routines. |
---|---|
ISSN: | 0925-9902 1573-7675 |
DOI: | 10.1007/s10844-021-00690-5 |