Loading…

A semiparametric method for clustering mixed data

Despite the existence of a large number of clustering algorithms, clustering remains a challenging problem. As large datasets become increasingly common in a number of different domains, it is often the case that clustering algorithms must be applied to heterogeneous sets of variables, creating an a...

Full description

Saved in:
Bibliographic Details
Published in:Machine learning 2016-12, Vol.105 (3), p.419-458
Main Authors: Foss, Alex, Markatou, Marianthi, Ray, Bonnie, Heching, Aliza
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Despite the existence of a large number of clustering algorithms, clustering remains a challenging problem. As large datasets become increasingly common in a number of different domains, it is often the case that clustering algorithms must be applied to heterogeneous sets of variables, creating an acute need for robust and scalable clustering methods for mixed continuous and categorical scale data. We show that current clustering methods for mixed-type data are generally unable to equitably balance the contribution of continuous and categorical variables without strong parametric assumptions. We develop KAMILA (KAy-means for MIxed LArge data), a clustering method that addresses this fundamental problem directly. We study theoretical aspects of our method and demonstrate its effectiveness in a series of Monte Carlo simulation studies and a set of real-world applications.
ISSN:0885-6125
1573-0565
DOI:10.1007/s10994-016-5575-7