Loading…
Towards U-statistics clustering inference for multiple groups
Inference in clustering is paramount to uncovering inherent group structure in data. Clustering methods that assess statistical significance have recently drawn attention due to their role in identifying patterns in high-dimensional data with applications in many scientific fields. Towards developin...
Saved in:
Published in: | Journal of statistical computation and simulation 2024-01, Vol.94 (1), p.204-222 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Inference in clustering is paramount to uncovering inherent group structure in data. Clustering methods that assess statistical significance have recently drawn attention due to their role in identifying patterns in high-dimensional data with applications in many scientific fields. Towards developing a general framework for clustering in multiple groups, we present here a U-statistics-based approach, specially tailored for high-dimensional datasets, that clusters the data into three groups while assessing the significance of such partitions. We also consider theoretical aspects of allowing for an outlier group. Our approach stands on the U-statistics-based clustering framework of the methods in R package uclust and inherits its properties being a non-parametric method relying on very few assumptions about the data. Thus it can be applied to a wide range of datasets. Furthermore our method aims to be a statistically powerful tool to find the best partitions of the data into three groups when that particular structure is present. To do so, we first propose an extension of the test U-statistic and develop its asymptotic theory. Additionally we propose a ternary non-nested significance clustering method. Our approach is tested through multiple simulations and is shown to be comparable or have more statistical power to competing alternatives in all scenarios considered. An application to image recognition data showcases our method. |
---|---|
ISSN: | 0094-9655 1563-5163 |
DOI: | 10.1080/00949655.2023.2239978 |