Loading…

Projected outlier detection in high-dimensional mixed-attributes data set

Detecting outlier efficiently is an active research issue in data mining, which has important applications in the field of fraud detection, network intrusion detection, monitoring criminal activities in electronic commerce, etc. Because of the sparsity of high dimensional data, it is reasonable and...

Full description

Saved in:
Bibliographic Details
Published in:Expert systems with applications 2009-04, Vol.36 (3), p.7104-7113
Main Authors: Ye, Mao, Li, Xue, Orlowska, Maria E.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Detecting outlier efficiently is an active research issue in data mining, which has important applications in the field of fraud detection, network intrusion detection, monitoring criminal activities in electronic commerce, etc. Because of the sparsity of high dimensional data, it is reasonable and meaningful to detect the outliers in suitable projected subspaces. We call such subspace and outliers in the subspace as anomaly subspace and projected outlier respectively. Many efficient algorithms have already been proposed for outlier detection based on different approaches, but there are few literatures on projected outlier detection for high dimensional data sets with mixed continuous and categorical attributes. In this paper, a novel projected outlier detection algorithm is proposed to detect projected outliers in high-dimensional mixed attribute data set. Our main contributions are: (1) combined with information entropy, a novel measure of anomaly subspace is proposed. In this anomaly subspace, meaningful outliers could be detected and explained. Unlike the previous projected outlier detection methods, the dimension of anomaly subspace is not decided beforehand; (2) theoretical analysis about this measure is presented; (3) bottom-up method is proposed to find the interesting anomaly subspaces; (4) the outlying degree of projected outlier is defined, which has good explanations; (5) the data set with mixed data type is handled; (6) experiments on synthetic and real data sets to evaluate the effectiveness of our approach are performed.
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2008.08.030