Loading…

Improving clustering by learning a bi-stochastic data similarity matrix

An idealized clustering algorithm seeks to learn a cluster-adjacency matrix such that, if two data points belong to the same cluster, the corresponding entry would be 1; otherwise, the entry would be 0. This integer (1/0) constraint makes it difficult to find the optimal solution. We propose a relax...

Full description

Saved in:

Bibliographic Details
Published in:	Knowledge and information systems 2012-08, Vol.32 (2), p.351-382
Main Authors:	Wang, Fei, Li, Ping, König, Arnd Christian, Wan, Muting
Format:	Article
Language:	English
Subjects:	Algorithms Applied sciences Cluster analysis Clustering Clusters Computer Science Computer science control theory systems Data Mining and Knowledge Discovery Data points Data processing. List processing. Character string processing Database Management Divergence Equivalence Exact sciences and technology Information retrieval. Graph Information Storage and Retrieval Information systems Information Systems and Communication Service Information Systems Applications (incl.Internet) IT in Business Memory organisation. Data processing Optimization Regular Paper Similarity Software Stochastic models Studies Theoretical computing
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	An idealized clustering algorithm seeks to learn a cluster-adjacency matrix such that, if two data points belong to the same cluster, the corresponding entry would be 1; otherwise, the entry would be 0. This integer (1/0) constraint makes it difficult to find the optimal solution. We propose a relaxation on the cluster-adjacency matrix, by deriving a bi-stochastic matrix from a data similarity (e.g., kernel) matrix according to the Bregman divergence. Our general method is named the Bregmanian Bi-Stochastication (BBS) algorithm. We focus on two popular choices of the Bregman divergence: the Euclidean distance and the Kullback–Leibler (KL) divergence. Interestingly, the BBS algorithm using the KL divergence is equivalent to the Sinkhorn–Knopp (SK) algorithm for deriving bi-stochastic matrices. We show that the BBS algorithm using the Euclidean distance is closely related to the relaxed k -means clustering and can often produce noticeably superior clustering results to the SK algorithm (and other algorithms such as Normalized Cut), through extensive experiments on public data sets.
ISSN:	0219-1377 0219-3116
DOI:	10.1007/s10115-011-0433-1