Loading…

Clinical data classification using an enhanced SMOTE and chaotic evolutionary feature selection

Class imbalance and the presence of irrelevant or redundant features in training data can pose serious challenges to the development of a classification framework. This paper proposes a framework for developing a Clinical Decision Support System (CDSS) that addresses class imbalance and the feature...

Full description

Saved in:
Bibliographic Details
Published in:Computers in biology and medicine 2020-11, Vol.126, p.103991-103991, Article 103991
Main Authors: Sreejith, S., Khanna Nehemiah, H., Kannan, A.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Class imbalance and the presence of irrelevant or redundant features in training data can pose serious challenges to the development of a classification framework. This paper proposes a framework for developing a Clinical Decision Support System (CDSS) that addresses class imbalance and the feature selection problem. Under this framework, the dataset is balanced at the data level and a wrapper approach is used to perform feature selection. The following three clinical datasets from the University of California Irvine (UCI) machine learning repository were used for experimentation: the Indian Liver Patient Dataset (ILPD), the Thoracic Surgery Dataset (TSD) and the Pima Indian Diabetes (PID) dataset. The Synthetic Minority Over-sampling Technique (SMOTE), which was enhanced using Orchard's algorithm, was used to balance the datasets. A wrapper approach that uses Chaotic Multi-Verse Optimisation (CMVO) was proposed for feature subset selection. The arithmetic mean of the Matthews correlation coefficient (MCC) and F-score (F1), which was measured using a Random Forest (RF) classifier, was used as the fitness function. After selecting the relevant features, a RF, which comprises 100 estimators and uses the Information Gain Ratio as the split criteria, was used for classification. The classifier achieved a 0.65 MCC, a 0.84 F1 and 82.46% accuracy for the ILPD; a 0.74 MCC, a 0.87 F1 and 86.88% accuracy for the TSD; and a 0.78 MCC, a 0.89 F1and 89.04% accuracy for the PID dataset. The effects of balancing and feature selection on the classifier were investigated and the performance of the framework was compared with the existing works in the literature. The results showed that the proposed framework is competitive in terms of the three performance measures used. The results of a Wilcoxon test confirmed the statistical superiority of the proposed method. •A Clinical Decision Support System framework which addresses both the Class imbalance and feature selection problem is proposed.•Synthetic Minority Oversampling Technique enhanced using Orchard's algorithm is used for re-balancing the datasets.•Feature selection is performed by optimizing a novel fitness function modelled after the performance measures: F-score and Matthews Correlation Coefficient.•A Chaotic Multi-Verse Optimisation Algorithm and Gradient descent Feed Forward Artificial Neural Network is used for optimizing the fitness function.•The proposed classifier can serve as a second opinion to a physician i
ISSN:0010-4825
1879-0534
DOI:10.1016/j.compbiomed.2020.103991