Loading…

Probabilistic REpresentatives Mining (PREM): A Clustering Method for Distributional Data Reduction

Complex computations and analyses on massive data sets can be impractical or infeasible. Data reduction is a crucial problem in the era of big data to obtain a reduced representation of the data set to facilitate more efficient yet accurate analyses. To best preserve the integrity of the original da...

Full description

Saved in:
Bibliographic Details
Published in:AIAA journal 2022-04, Vol.60 (4), p.2580-2596
Main Authors: Gao, Zhenyu, Puranik, Tejas G, Mavris, Dimitri N
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Complex computations and analyses on massive data sets can be impractical or infeasible. Data reduction is a crucial problem in the era of big data to obtain a reduced representation of the data set to facilitate more efficient yet accurate analyses. To best preserve the integrity of the original data set, a reduced representation aims to best maintain the same data distribution, referred to as a probabilistically representative subset. This paper considers the problem of reducing a large data set to very small such subsets at which random sampling does not perform well enough. We propose a data mining approach called Probabilistic Representatives Mining (PREM) to tackle this challenge. PREM uses balanced clustering to prevent undersampling and oversampling issues and multistage computing strategy to achieve better scalability and consistency. Numerical experiments on typical probability distributions and real-world data sets in the field of aeronautics and astronautics demonstrate PREM’s superiority over the baselines. An uncertainty quantification case study from aviation environmental impact modeling further shows PREM’s effectiveness and accuracy in generating probabilistically representative small samples for costly computations. Potential limitations and extensions of the method are also discussed in the paper.
ISSN:0001-1452
1533-385X
DOI:10.2514/1.J061079