Loading…

A Machine Learning-Based Analytic Pipeline Applied to Clinical and Serum IgG Immunoproteome Data To Predict Chlamydia trachomatis Genital Tract Ascension and Incident Infection in Women

We developed a reusable and open-source machine learning (ML) pipeline that can provide an analytical framework for rigorous biomarker discovery. We implemented the ML pipeline to determine the predictive potential of clinical and immunoproteome antibody data for outcomes associated with Chlamydia t...

Full description

Saved in:

Bibliographic Details
Published in:	Microbiology spectrum 2023-08, Vol.11 (4), p.e0468922
Main Authors:	Liu, Chuwen, Mokashi, Neha Vivek, Darville, Toni, Sun, Xuejun, O'Connell, Catherine M, Hufnagel, Katrin, Waterboer, Tim, Zheng, Xiaojing
Format:	Article
Language:	English
Subjects:	ascension Bayes Theorem biomarker Biomarkers Chlamydia genital tract infection Chlamydia Infections Chlamydia trachomatis Female Genitalia Humans Immunoglobulin G incident infection Machine Learning pipeline Reproducibility of Results
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	We developed a reusable and open-source machine learning (ML) pipeline that can provide an analytical framework for rigorous biomarker discovery. We implemented the ML pipeline to determine the predictive potential of clinical and immunoproteome antibody data for outcomes associated with Chlamydia trachomatis ( ) infection collected from 222 cis-gender females with high exposure. We compared the predictive performance of 4 ML algorithms (naive Bayes, random forest, extreme gradient boosting with linear booster [xgbLinear], and -nearest neighbors [KNN]), screened from 215 ML methods, in combination with two different feature selection strategies, Boruta and recursive feature elimination. Recursive feature elimination performed better than Boruta in this study. In prediction of ascending infection, naive Bayes yielded a slightly higher median value of are under the receiver operating characteristic curve (AUROC) 0.57 (95% confidence interval [CI], 0.54 to 0.59) than other methods and provided biological interpretability. For prediction of incident infection among women uninfected at enrollment, KNN performed slightly better than other algorithms, with a median AUROC of 0.61 (95% CI, 0.49 to 0.70). In contrast, xgbLinear and random forest had higher predictive performances, with median AUROC of 0.63 (95% CI, 0.58 to 0.67) and 0.62 (95% CI, 0.58 to 0.64), respectively, for women infected at enrollment. Our findings suggest that clinical factors and serum anti- protein IgGs are inadequate biomarkers for ascension or incident infection. Nevertheless, our analysis highlights the utility of a pipeline that searches for biomarkers and evaluates prediction performance and interpretability. Biomarker discovery to aid early diagnosis and treatment using machine learning (ML) approaches is a rapidly developing area in host-microbe studies. However, lack of reproducibility and interpretability of ML-driven biomarker analysis hinders selection of robust biomarkers that can be applied in clinical practice. We thus developed a rigorous ML analytical framework and provide recommendations for enhancing reproducibility of biomarkers. We emphasize the importance of robustness in selection of ML methods, evaluation of performance, and interpretability of biomarkers. Our ML pipeline is reusable and open-source and can be used not only to identify host-pathogen interaction biomarkers but also in microbiome studies and ecological and environmental microbiology research.
ISSN:	2165-0497 2165-0497
DOI:	10.1128/spectrum.04689-22