Loading…

An efficient approximation to the K-means clustering for massive data

•An approximation to the Kmeans algorithm for massive data problems is proposed.•RPKM reduces several orders of computations while obtaining good approximations.•RPKM reduces the maximum number of Lloyd’s iterations up to a stirling number order.•Experimentally, a monotone descent of the error funct...

Full description

Saved in:
Bibliographic Details
Published in:Knowledge-based systems 2017-02, Vol.117, p.56-69
Main Authors: Capó, Marco, Pérez, Aritz, Lozano, Jose A.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•An approximation to the Kmeans algorithm for massive data problems is proposed.•RPKM reduces several orders of computations while obtaining good approximations.•RPKM reduces the maximum number of Lloyd’s iterations up to a stirling number order.•Experimentally, a monotone descent of the error function is consistently observed. Due to the progressive growth of the amount of data available in a wide variety of scientific fields, it has become more difficult to manipulate and analyze such information. In spite of its dependency on the initial settings and the large number of distance computations that it can require to converge, the K-means algorithm remains as one of the most popular clustering methods for massive datasets. In this work, we propose an efficient approximation to the K-means problem intended for massive data. Our approach recursively partitions the entire dataset into a small number of subsets, each of which is characterized by its representative (center of mass) and weight (cardinality), afterwards a weighted version of the K-means algorithm is applied over such local representation, which can drastically reduce the number of distances computed. In addition to some theoretical properties, experimental results indicate that our method outperforms well-known approaches, such as the K-means++ and the minibatch K-means, in terms of the relation between number of distance computations and the quality of the approximation.
ISSN:0950-7051
1872-7409
DOI:10.1016/j.knosys.2016.06.031