Loading…

MultiThal-classifier, a machine learning-based multi-class model for thalassemia diagnosis and classification

•XGBoost-based M−THAL model differentiates Normocytic-TT, Microcytic-TT, IDA, and controls.•Includes thalassemia trait with normal MCV, addressing screening method gaps.•Uses routine blood test data; SMOTE optimizes for imbalanced data.•SHAP values reveal MCV, MCH, RDW-SD as key features for interpr...

Full description

Saved in:
Bibliographic Details
Published in:Clinica chimica acta 2025-02, Vol.567, p.120025, Article 120025
Main Authors: Wang, WenQiang, Ye, RenQing, Tang, BaoJia, Qi, YuYing
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•XGBoost-based M−THAL model differentiates Normocytic-TT, Microcytic-TT, IDA, and controls.•Includes thalassemia trait with normal MCV, addressing screening method gaps.•Uses routine blood test data; SMOTE optimizes for imbalanced data.•SHAP values reveal MCV, MCH, RDW-SD as key features for interpretability.•Robust performance validated through external dataset testing. The differential diagnosis between iron deficiency anemia (IDA) and thalassemia trait (TT) remains a significant clinical challenge. This study aimed to develop a machine learning-based multi-class model to differentiate among Microcytic-TT(TT with low mean corpuscular volume), Normocytic-TT (TT with normal mean corpuscular volume), IDA, and healthy individuals. A comprehensive dataset comprising 1,819 individuals was analyzed using six distinct machine learning algorithms. The eXtreme Gradient Boosting (XGBoost) algorithm was ultimately selected to construct the MultiThal-Classifier (M−THAL) model. SMOTENC (Synthetic Minority Over-sampling Technique for Nominal and Continuous features) was employed to address data imbalance. Model performance was evaluated using various metrics, and SHAP values were applied to interpret the model’s predictions.Additionally, external validation was conducted to assess the model’s robustness and generalizability. After performing 1000 bootstrap resamples of the test set, the average performance metrics of M−THAL and the 95 % confidence interval(CI) were as follows, sensitivity 90.27 % (95 % CI: 84.88–95.26), specificity 97.87 % (95% CI: 97.10–98.55), PPV 93.42 % (95 % CI: 89.34–96.48), NPV 97.82% (95 % CI: 97.00–98.53), F1-score 91.50 % (95% CI: 87.29–95.34), Youden’s index 88.15 % (95 % CI: 82.33–93.70), accuracy 97.06 % (95% CI: 96.06–97.99), and AUC 94.07 % (95 % CI: 91.17–96.84).Feature importance analysis identified mean corpuscular volume(MCV), mean corpuscular hemoglobin(MCH), red cell distribution width − standard deviation(RDW-SD), and hemoglobin (HGB) were identified as the most important features. External validation confirmed the model’s robustness and generalizability. The M−THAL effectively distinguishes Normocytic-TT, Microcytic-TT, IDA, and healthy individuals using hematological parameters, offers a rapid and cost-effective screening tool that can be readily implemented in diverse healthcare settings.
ISSN:0009-8981
1873-3492
1873-3492
DOI:10.1016/j.cca.2024.120025