Loading…

A cautionary tale about properly vetting datasets used in supervised learning predicting metabolic pathway involvement

The mapping of metabolite-specific data to pathways within cellular metabolism is a major data analysis step needed for biochemical interpretation. A variety of machine learning approaches, particularly deep learning approaches, have been used to predict these metabolite-to-pathway mappings, utilizi...

Full description

Saved in:

Bibliographic Details
Published in:	PloS one 2024-05, Vol.19 (5), p.e0299583-e0299583
Main Authors:	Huckvale, Erik D, Moseley, Hunter N B
Format:	Article
Language:	English
Subjects:	Accuracy Biology and Life Sciences Cell metabolism Chemical reactions Computer and Information Sciences Data analysis Datasets Datasets as Topic Deep learning Evaluation Graph representations Health aspects Humans Information management Learning algorithms Machine learning Metabolic Networks and Pathways Metabolic pathways Metabolism Metabolites Methods Neural networks Physical Sciences Physiological aspects Quality management Research and Analysis Methods Standard deviation Supervised learning Supervised Machine Learning
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The mapping of metabolite-specific data to pathways within cellular metabolism is a major data analysis step needed for biochemical interpretation. A variety of machine learning approaches, particularly deep learning approaches, have been used to predict these metabolite-to-pathway mappings, utilizing a training dataset of known metabolite-to-pathway mappings. A few such training datasets have been derived from the Kyoto Encyclopedia of Genes and Genomes (KEGG). However, several prior published machine learning approaches utilized an erroneous KEGG-derived training dataset that used SMILES molecular representations strings (KEGG-SMILES dataset) and contained a sizable proportion (~26%) duplicate entries. The presence of so many duplicates taint the training and testing sets generated from k-fold cross-validation of the KEGG-SMILES dataset. Therefore, the k-fold cross-validation performance of the resulting machine learning models was grossly inflated by the erroneous presence of these duplicate entries. Here we describe and evaluate the KEGG-SMILES dataset so that others may avoid using it. We also identify the prior publications that utilized this erroneous KEGG-SMILES dataset so their machine learning results can be properly and critically evaluated. In addition, we demonstrate the reduction of model k-fold cross-validation (CV) performance after de-duplicating the KEGG-SMILES dataset. This is a cautionary tale about properly vetting prior published benchmark datasets before using them in machine learning approaches. We hope others will avoid similar mistakes.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0299583