Loading…

Mixed fuzzy C-means clustering

Clustering analysis becomes challenging when the dataset has mixed data types comprising categorical (nominal or ordinal scale) and numerical (interval scale) features. Mainstream distance metrics cannot handle the information in categorical data about the similarity between the observations and clu...

Full description

Saved in:
Bibliographic Details
Published in:Information sciences 2025-02, Vol.690, p.121528, Article 121528
Main Author: Demirhan, Haydar
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Clustering analysis becomes challenging when the dataset has mixed data types comprising categorical (nominal or ordinal scale) and numerical (interval scale) features. Mainstream distance metrics cannot handle the information in categorical data about the similarity between the observations and cluster centers, leading to performance loss. Various methods are introduced in the literature to handle the mixed data types in clustering. However, each method has disadvantages in capturing categorical information about similarity, adjusting the contribution of categorical information to clustering, and computational or implementation inefficiency. This study proposes a mixed fuzzy C-means clustering method for mixed data types. Two new distance metrics are developed to handle binary and multi-class nominal features. The scaled entropy of each data type is used to adjust the weight of each data type in the overall similarity metric, providing a lower bias since no user-specified weight is required. A comparative numerical study is conducted with twenty real datasets and seven benchmark methods using five cluster validation statistics. The mixed fuzzy C-means clustering performs better than the benchmark methods and is computationally efficient in practice. Since all the computer codes for implementing mixed fuzzy C-means clustering are given, the proposed method is readily applicable to practical problems.
ISSN:0020-0255
DOI:10.1016/j.ins.2024.121528