Loading…
Quantitative evaluation of internal cluster validation indices using binary data sets
Aims Different clustering methods often classify the same data set differently. Selecting the “best” clustering solution from alternatives is possible with cluster validation indices. Because of the large variety of cluster validation indices (CVIs), choosing the most suitable index concerning the d...
Saved in:
Published in: | Journal of vegetation science 2024-09, Vol.35 (5), p.n/a |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Aims
Different clustering methods often classify the same data set differently. Selecting the “best” clustering solution from alternatives is possible with cluster validation indices. Because of the large variety of cluster validation indices (CVIs), choosing the most suitable index concerning the data set and clustering algorithms is challenging. We aim to assess different internal clustering validation indices.
Methods
Artificial binary data sets with equal‐ and unequal‐sized well‐separated a priori clusters were simulated and three levels of noise were then added. Twenty replications of each of the six types of data sets (two group sizes × three levels of noise) were created and analyzed by three clustering algorithms with Jaccard dissimilarity. Twenty‐seven clustering validation indices are evaluated including both geometric and non‐geometric indices.
Results
Although, in theory, all CVIs could differentiate between good and wrong classifications, only a few perform as expected with noisy data. Tau and silhouette widths proved to be the best geometric CVIs both for equal and unequal cluster sizes. Among non‐geometric indices, crispness and OptimClass performed best.
Conclusion
We recommend using these best‐performing CVIs. We suggest plotting the CVI value against the number of clusters because the lack of a sharp peak means that the position of the maximum is uncertain.
Because of the large variety of cluster validation indices (CVIs), choosing the most suitable index is challenging. We assessed several CVIs using artificial binary data sets. Only a few CVI performed as expected with noisy data. Tau and silhouette widths proved to be the best geometric CVIs both for equal and unequal cluster sizes. Among non‐geometric indices, Crispness and OptimClass performed best. |
---|---|
ISSN: | 1100-9233 1654-1103 |
DOI: | 10.1111/jvs.13310 |