Loading…
A novel missing value imputation relying on K-means clustering and kernel-based weighting using grey relation (KWGI)
Data pre-processing is one of the crucial phases of data mining that enhances the efficiency of data mining techniques. One of the most important operations performed on data pre-processing is missing values imputation in incomplete datasets. This research presents a new imputation technique using K...
Saved in:
Published in: | Journal of intelligent & fuzzy systems 2023-01, Vol.44 (4), p.5675-5697 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Data pre-processing is one of the crucial phases of data mining that enhances the efficiency of data mining techniques. One of the most important operations performed on data pre-processing is missing values imputation in incomplete datasets. This research presents a new imputation technique using K-means and samples weighting mechanism based on Grey relation (KWGI). The Grey-based K-means algorithm applicable to all samples of incomplete datasets clusters the similar samples, then an appropriate kernel function generates appropriate weights based on the Grey relation. The missing values estimation of the incomplete samples is done based on the weighted mean to reduce the impact of outlier and vague samples. In both clustering and imputation steps, a penalty mechanism has been considered to reduce the similarity of ambiguous samples with a high number of missing values, and consequently, increase the accuracy of clustering and imputation. The KWGI method has been applied on nine natural datasets with eight state-of-the-art and commonly used methods, namely CMIWD, KNNI, HotDeck, MeanI, KmeanI, RKmeanI, ICKmeanI, and FKMI. The imputation results are evaluated by the Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) criteria. In this study, the missing values are generated at two levels, namely sample and value, and the results are discussed in a wide range of missingness from low rate to high rate. Experimental results of the t-test show that the proposed method performs significantly better than all the other compared methods. |
---|---|
ISSN: | 1064-1246 1875-8967 |
DOI: | 10.3233/JIFS-200774 |