Loading…

Towards U-statistics clustering inference for multiple groups

Inference in clustering is paramount to uncovering inherent group structure in data. Clustering methods that assess statistical significance have recently drawn attention due to their role in identifying patterns in high-dimensional data with applications in many scientific fields. Towards developin...

Full description

Saved in:
Bibliographic Details
Published in:Journal of statistical computation and simulation 2024-01, Vol.94 (1), p.204-222
Main Authors: Bello, Debora Zava, Valk, Marcio, Cybis, Gabriela Bettella
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Inference in clustering is paramount to uncovering inherent group structure in data. Clustering methods that assess statistical significance have recently drawn attention due to their role in identifying patterns in high-dimensional data with applications in many scientific fields. Towards developing a general framework for clustering in multiple groups, we present here a U-statistics-based approach, specially tailored for high-dimensional datasets, that clusters the data into three groups while assessing the significance of such partitions. We also consider theoretical aspects of allowing for an outlier group. Our approach stands on the U-statistics-based clustering framework of the methods in R package uclust and inherits its properties being a non-parametric method relying on very few assumptions about the data. Thus it can be applied to a wide range of datasets. Furthermore our method aims to be a statistically powerful tool to find the best partitions of the data into three groups when that particular structure is present. To do so, we first propose an extension of the test U-statistic and develop its asymptotic theory. Additionally we propose a ternary non-nested significance clustering method. Our approach is tested through multiple simulations and is shown to be comparable or have more statistical power to competing alternatives in all scenarios considered. An application to image recognition data showcases our method.
ISSN:0094-9655
1563-5163
DOI:10.1080/00949655.2023.2239978