Loading…

Stable variable selection of class-imbalanced data with precision-recall criterion

Screening important variables for class-imbalanced data is still a challenging task. In this study, we propose an algorithm for stably selecting key variables on class-imbalanced data based on the precision-recall curve (PRC), where the PRC is utilized as the assessment criterion in the model buildi...

Full description

Saved in:
Bibliographic Details
Published in:Chemometrics and intelligent laboratory systems 2017-12, Vol.171, p.241-250
Main Authors: Fu, Guang-Hui, Xu, Feng, Zhang, Bing-Yang, Yi, Lun-Zhao
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Screening important variables for class-imbalanced data is still a challenging task. In this study, we propose an algorithm for stably selecting key variables on class-imbalanced data based on the precision-recall curve (PRC), where the PRC is utilized as the assessment criterion in the model building stage, and sparse regularized logistic regression combined with subsampling (SRLRS) is designed to perform stable variable selection. Considering the characteristic of class-imbalanced data, we also proposed classification-based partition for cross validation, as well as leaving half of majority observations out and leaving one minority observation out (LHO-LOO) for subsampling. Simulation results and real data showed that our algorithm is highly suitable for handling class-imbalanced data, and that the PRC can be an alternative evaluation criterion for model selection when handling class-imbalanced data. •Precision-recall curve (PRC) as a criterion for variable selection of class-imbalanced data.•A novel algorithm (SRLRS) is proposed for dealing with class-imbalanced data.•A novel subsampling (LHO-LOO) strategy for class-imbalanced data is designed for stable variable selection.•Sparse regularized methods are successfully used for class-imbalanced data.
ISSN:0169-7439
1873-3239
DOI:10.1016/j.chemolab.2017.10.015