Loading…

Quantitative evaluation of internal cluster validation indices using binary data sets

Aims Different clustering methods often classify the same data set differently. Selecting the “best” clustering solution from alternatives is possible with cluster validation indices. Because of the large variety of cluster validation indices (CVIs), choosing the most suitable index concerning the d...

Full description

Saved in:

Bibliographic Details
Published in:	Journal of vegetation science 2024-09, Vol.35 (5), p.n/a
Main Authors:	Pakgohar, Naghmeh, Lengyel, Attila, Botta‐Dukát, Zoltán
Format:	Article
Language:	English
Subjects:	Algorithms Binary data Cluster analysis cluster validation Clustering Datasets geometric indices internal indices Noise levels non‐geometric indices
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Aims Different clustering methods often classify the same data set differently. Selecting the “best” clustering solution from alternatives is possible with cluster validation indices. Because of the large variety of cluster validation indices (CVIs), choosing the most suitable index concerning the data set and clustering algorithms is challenging. We aim to assess different internal clustering validation indices. Methods Artificial binary data sets with equal‐ and unequal‐sized well‐separated a priori clusters were simulated and three levels of noise were then added. Twenty replications of each of the six types of data sets (two group sizes × three levels of noise) were created and analyzed by three clustering algorithms with Jaccard dissimilarity. Twenty‐seven clustering validation indices are evaluated including both geometric and non‐geometric indices. Results Although, in theory, all CVIs could differentiate between good and wrong classifications, only a few perform as expected with noisy data. Tau and silhouette widths proved to be the best geometric CVIs both for equal and unequal cluster sizes. Among non‐geometric indices, crispness and OptimClass performed best. Conclusion We recommend using these best‐performing CVIs. We suggest plotting the CVI value against the number of clusters because the lack of a sharp peak means that the position of the maximum is uncertain. Because of the large variety of cluster validation indices (CVIs), choosing the most suitable index is challenging. We assessed several CVIs using artificial binary data sets. Only a few CVI performed as expected with noisy data. Tau and silhouette widths proved to be the best geometric CVIs both for equal and unequal cluster sizes. Among non‐geometric indices, Crispness and OptimClass performed best.
ISSN:	1100-9233 1654-1103
DOI:	10.1111/jvs.13310