Loading…

Fast Optimal Subsampling Probability Approximation for Generalized Linear Models

For massive data, subsampling techniques are popular to mitigate computational burden by reducing the data size. In a subsampling approach, subsampling probabilities for each data point are specified to obtain an informative sub-data, and then estimates based on the sub-data are obtained to approxim...

Full description

Saved in:
Bibliographic Details
Published in:Econometrics and statistics 2024-01, Vol.29, p.224-237
Main Authors: Lee, JooChul, Schifano, Elizabeth D., Wang, HaiYing
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:For massive data, subsampling techniques are popular to mitigate computational burden by reducing the data size. In a subsampling approach, subsampling probabilities for each data point are specified to obtain an informative sub-data, and then estimates based on the sub-data are obtained to approximate estimates from the full data. Assigning subsampling probabilities based on minimization of the asymptotic mean squared error of the estimator from a general subsample (A-optimality criterion) is a popular approach, however, it is still computationally demanding to calculate the probabilities under this setting. To efficiently approximate the A-optimal subsampling probabilities for generalized linear models, randomized algorithms are proposed. To develop the algorithms, the Johnson-Lindenstrauss Transform and Subsampled Randomized Hadamard Transform are used. Additionally, optimal subsampling probabilities are derived for the Gaussian linear model in the case where both the regression coefficients and dispersion parameter are of interest, and algorithms are developed to approximate the optimal subsampling probabilities. Simulation studies indicate that the estimators based on the developed algorithms have excellent performance for statistical inference and have substantial savings in computing time compared to the direct calculation of the A-optimal subsampling probabilities.
ISSN:2452-3062
2452-3062
DOI:10.1016/j.ecosta.2021.02.007