Loading…
Optimal Entropy Genetic Fuzzy-C-Means SMOTE (OEGFCM-SMOTE)
Classification problems of unbalanced data sets are commonplace in industrial production and medical research fields. Different approaches have been proposed to handle these problems by generating synthetic samples, but most of them implement hyperparameters and tend to generate noise, because they...
Saved in:
Published in: | Knowledge-based systems 2023-02, Vol.262, p.110235, Article 110235 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Classification problems of unbalanced data sets are commonplace in industrial production and medical research fields. Different approaches have been proposed to handle these problems by generating synthetic samples, but most of them implement hyperparameters and tend to generate noise, because they neglect the entropy of the initial data. Recently, oversampling methods based on clustering have been proposed to overcome this problem. Unfortunately, they inherit the sensitivity of hard clustering methods. Moreover, the hyperparameters are manually selected. This paper introduces Optimal Entropy Genetic Fuzzy-C-Means SMOTE (OEGFCM-SMOTE) that balances data with minimum noise based on the original mathematical model, soft clustering, and evolutionary optimization. First, to handle the Kmeans sensitivity, OEGFCM-SMOTE uses a SMOTE to generate samples in safe regions based on Fuzzy-C-Means, known to be consistent with the boundary problem. Fuzzy-C-Means SMOTE processes in three steps (grouping, filtering, and interpolation) and implements 4 parameters, namely the number of clusters, the number of neighboring points of the minority data, the threshold of the unbalanced ratio and the exponent of the distribution of the minority data in the promising clusters. Second, the optimal choice of these parameters is based on a mixed-variable optimization model which minimizes the amount of noise measured by the entropy; the feasible domain is estimated by considering the density of the data set and by studying the boundary cases. Finally, this model is solved using the genetic algorithm by adopting genetic operators with appropriate rates. OEGFCM-SMOTE is evaluated using 5 classifiers, 21 unbalanced datasets (15 ordinary size and 6 Big data), and it is compared to 14 oversampling methods using three performance measures. To overcome the problem of multiple comparisons, considering different data sets, Holm’s test is used. OEGFCM-SMOTE consistently outperforms other popular oversampling methods. |
---|---|
ISSN: | 0950-7051 1872-7409 |
DOI: | 10.1016/j.knosys.2022.110235 |