Loading…

One-Pass Additive-Error Subset Selection for ℓp Subspace Approximation and (k, p)-Clustering

We consider the problem of subset selection for ℓ p subspace approximation and ( k ,  p )-clustering. Our aim is to efficiently find a small subset of data points such that solving the problem optimally for this subset gives a good approximation to solving the problem optimally for the original inpu...

Full description

Saved in:
Bibliographic Details
Published in:Algorithmica 2023-10, Vol.85 (10), p.3144-3167
Main Authors: Deshpande, Amit, Pratap, Rameshwar
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:We consider the problem of subset selection for ℓ p subspace approximation and ( k ,  p )-clustering. Our aim is to efficiently find a small subset of data points such that solving the problem optimally for this subset gives a good approximation to solving the problem optimally for the original input. For ℓ p subspace approximation, previously known subset selection algorithms based on volume sampling and adaptive sampling proposed in Deshpande and Varadarajan (STOC’07, 2007), for the general case of p ∈ [ 1 , ∞ ) , require multiple passes over the data. In this paper, we give a one-pass subset selection with an additive approximation guarantee for ℓ p subspace approximation, for any p ∈ [ 1 , ∞ ) . Earlier subset selection algorithms that give a one-pass multiplicative ( 1 + ϵ ) approximation work under the special cases. Cohen et al. (SODA’17, 2017) gives a one-pass subset section that offers multiplicative ( 1 + ϵ ) approximation guarantee for the special case of ℓ 2 subspace approximation. Mahabadi et al. (STOC’20, 2020) gives a one-pass noisy subset selection with ( 1 + ϵ ) approximation guarantee for ℓ p subspace approximation when p ∈ { 1 , 2 } . Our subset selection algorithm gives a weaker, additive approximation guarantee, but it works for any p ∈ [ 1 , ∞ ) . We also focus on ( k ,  p )-clustering, where the task is to group the data points into k clusters such that the sum of distances from points to cluster centers (raised to the power p ) is minimized for p ∈ [ 1 , ∞ ) . The subset selection algorithms are based on D p sampling due to Wei (NIPS’16, 2016) which is an extension of D 2 sampling proposed in Arthur and Vassilvitskii (SODA’07, 2007). Due to the inherently adaptive nature of the D p sampling, these algorithms require taking multiple passes over the input. In this work, we suggest one pass subset selection for ( k ,  p )-clustering that gives constant factor approximation with respect to the optimal solution with an additive approximation guarantee. Bachem et al. (NIPS’16, 2016) also gives one pass subset selection for k -means for p = 2 ; however, our result gives a solution for a more generic problem when p ∈ [ 1 , ∞ ) . At the core, our contribution lies in showing a one-pass MCMC-based subset selection algorithm such that its cost incurred due to the sampled points closely approximates the corresponding optimal cost, with high probability.
ISSN:0178-4617
1432-0541
DOI:10.1007/s00453-023-01124-0