Loading…

Narrow big data in a stream: Computational limitations and regression

We consider the on-line model for a data stream: data points are waiting in a queue and are accessible one-by-one by a special instruction. When a data point is processed, it is dropped forever. The data stream is assumed to be so long that it cannot be stored in memory in full: the size of memory i...

Full description

Saved in:
Bibliographic Details
Published in:Information sciences 2019-06, Vol.486, p.379-392
Main Author: Černý, Michal
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:We consider the on-line model for a data stream: data points are waiting in a queue and are accessible one-by-one by a special instruction. When a data point is processed, it is dropped forever. The data stream is assumed to be so long that it cannot be stored in memory in full: the size of memory is assumed to be polynomial in the dimension of data, but not the number of observations. This is a natural model for Narrow Big Data. First we prove a negative theorem illustrating that this model leads to serious limitations: we show that some elementary statistics, such as sample quantiles, cannot be computed in this model (the proof is based on a Kolmogorov complexity argument). This raises a crucial question which data-analytic procedures can be implemented in the stream data model and which cannot be performed at all, or only approximately (with some loss of information). After the negative results, we turn our attention to several positive results from multivariate linear regression with Narrow Big Data. We prove that least-squares based estimators and regression diagnostic statistics, such as statistics based on the residual sum of squares, can be computed in this model efficiently. The class of statistics efficiently computable in the stream data model also includes two-stage procedures involving auxiliary regressions, such as White’s heteroscedasticity test of Breusch–Godfrey autocorrelation test (which may be surprising because the procedures, as defined, seem to require a data point to be processed several times). The computation is done exactly: we do not use preprocessing steps involving data compression techniques with information loss (such as sampling or grouping) for a reduction of the size of the data set.
ISSN:0020-0255
1872-6291
DOI:10.1016/j.ins.2019.02.052