Loading…
Probabilistic REpresentatives Mining (PREM): A Clustering Method for Distributional Data Reduction
Complex computations and analyses on massive data sets can be impractical or infeasible. Data reduction is a crucial problem in the era of big data to obtain a reduced representation of the data set to facilitate more efficient yet accurate analyses. To best preserve the integrity of the original da...
Saved in:
Published in: | AIAA journal 2022-04, Vol.60 (4), p.2580-2596 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Complex computations and analyses on massive data sets can be impractical or infeasible. Data reduction is a crucial problem in the era of big data to obtain a reduced representation of the data set to facilitate more efficient yet accurate analyses. To best preserve the integrity of the original data set, a reduced representation aims to best maintain the same data distribution, referred to as a probabilistically representative subset. This paper considers the problem of reducing a large data set to very small such subsets at which random sampling does not perform well enough. We propose a data mining approach called Probabilistic Representatives Mining (PREM) to tackle this challenge. PREM uses balanced clustering to prevent undersampling and oversampling issues and multistage computing strategy to achieve better scalability and consistency. Numerical experiments on typical probability distributions and real-world data sets in the field of aeronautics and astronautics demonstrate PREM’s superiority over the baselines. An uncertainty quantification case study from aviation environmental impact modeling further shows PREM’s effectiveness and accuracy in generating probabilistically representative small samples for costly computations. Potential limitations and extensions of the method are also discussed in the paper. |
---|---|
ISSN: | 0001-1452 1533-385X |
DOI: | 10.2514/1.J061079 |