Loading…
A Multi-Criteria Approach for Fast and Robust Representative Selection from Manifolds
The problem of representative selection amounts to sampling few informative exemplars from large datasets. Existing approaches to data selection often fall short of simultaneously handling non-linear data structures, sampling concise and non-redundant subsets, rejecting outliers, and yielding interp...
Saved in:
Published in: | IEEE transactions on knowledge and data engineering 2022-07, Vol.34 (7), p.3057-3071 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The problem of representative selection amounts to sampling few informative exemplars from large datasets. Existing approaches to data selection often fall short of simultaneously handling non-linear data structures, sampling concise and non-redundant subsets, rejecting outliers, and yielding interpretable outcomes. This paper presents a novel representative selection approach, dubbed MOSAIC, for drawing descriptive sketches of arbitrary manifold structures. Resting upon a novel quadratic formulation, MOSAIC advances a multi-criteria selection approach that maximizes the global representation power of the sampled subset, ensures novelty of the samples by minimizing redundancy, and rejects disruptive information by effectively detecting outliers. Theoretical analyses shed light on geometrical characterization of the obtained sketch and reveal that the sampled representatives maximize a well-defined notion of data coverage in a transformed space. In addition, we present a highly scalable randomized implementation of the proposed algorithm shown to bring about substantial speedups. MOSAIC's superiority in achieving the desired characteristics of a representative subset all at once while exhibiting remarkable robustness to various outlier types is demonstrated via extensive experiments conducted on both real and synthetic data with comparisons to state-of-the-art algorithms. |
---|---|
ISSN: | 1041-4347 1558-2191 |
DOI: | 10.1109/TKDE.2020.3024099 |