Loading…

Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites

Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address...

Full description

Saved in:
Bibliographic Details
Published in:Metabolites 2023-11, Vol.13 (11), p.1120
Main Authors: Huckvale, Erik D, Powell, Christian D, Jin, Huan, Moseley, Hunter N B
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c423t-2f1bf4094865eeed0abb2c15cc883ed249c66193a50351d631a3f166092a1f8e3
container_end_page
container_issue 11
container_start_page 1120
container_title Metabolites
container_volume 13
creator Huckvale, Erik D
Powell, Christian D
Jin, Huan
Moseley, Hunter N B
description Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1 score of 0.8180 and a Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories.
doi_str_mv 10.3390/metabo13111120
format article
fullrecord <record><control><sourceid>gale_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_544b4527c8b6447b801797e91d0371a6</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A774323428</galeid><doaj_id>oai_doaj_org_article_544b4527c8b6447b801797e91d0371a6</doaj_id><sourcerecordid>A774323428</sourcerecordid><originalsourceid>FETCH-LOGICAL-c423t-2f1bf4094865eeed0abb2c15cc883ed249c66193a50351d631a3f166092a1f8e3</originalsourceid><addsrcrecordid>eNpdUk1v1DAQjRCIVkuvHJElLly2eGzHsY-lfK20FT2UczRxJrtekrjY3qL-e8xuKR_jg62nN2_ek6eqXgI_l9LytxNl7AJIKCX4k-pUCDBLsMY-_et9Up2ltOOlNK8bDs-rE9lYawXo02r7jma3nTB-Y-8xY6LMhhDZTUQ_-3nDrtBt_UxsTRiPQOhpTCwHdh2p9y6zvCV2jXn7A-_Zar4L4x1NNGcWBnZ1MDj6TOlF9WzAMdHZw72ovn78cHP5ebn-8ml1ebFeOiVkXooBukFxq4yuiajn2HXCQe2cMZJ6oazTGqzEmssaei0B5QBacysQBkNyUa2Oun3AXXsbfcl23wb07QEIcdNizN6N1NZKdaoWjTOdVqrpDIfGNmSh57IB1EXrzVHrNobve0q5nXxyNI44U9inVhgrjTQWmkJ9_R91F_ZxLkkPLBCgiutFdX5kbbDM9_MQckRXTk-Td2GmwRf8ommUFFIJ86fBxZBSpOExEfD21w60_-5AaXj14GPfTdQ_0n__uPwJ8ROq7g</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2893121419</pqid></control><display><type>article</type><title>Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites</title><source>PubMed Central Free</source><source>Publicly Available Content Database</source><creator>Huckvale, Erik D ; Powell, Christian D ; Jin, Huan ; Moseley, Hunter N B</creator><creatorcontrib>Huckvale, Erik D ; Powell, Christian D ; Jin, Huan ; Moseley, Hunter N B</creatorcontrib><description>Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1 score of 0.8180 and a Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories.</description><identifier>ISSN: 2218-1989</identifier><identifier>EISSN: 2218-1989</identifier><identifier>DOI: 10.3390/metabo13111120</identifier><identifier>PMID: 37999216</identifier><language>eng</language><publisher>Switzerland: MDPI AG</publisher><subject>Analysis ; Classification ; Datasets ; Genomes ; Hydrogen ; Hypotheses ; Information management ; KEGG ; kegg_pull ; Learning algorithms ; Machine learning ; md_harmonize ; Metabolic pathways ; Metabolism ; metabolite ; Metabolites ; Metabolomics ; Molecular structure ; NMR ; Nuclear magnetic resonance ; pathway</subject><ispartof>Metabolites, 2023-11, Vol.13 (11), p.1120</ispartof><rights>COPYRIGHT 2023 MDPI AG</rights><rights>2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c423t-2f1bf4094865eeed0abb2c15cc883ed249c66193a50351d631a3f166092a1f8e3</cites><orcidid>0000-0003-3995-5368</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/2893121419/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2893121419?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,25753,27924,27925,37012,37013,44590,75126</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/37999216$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Huckvale, Erik D</creatorcontrib><creatorcontrib>Powell, Christian D</creatorcontrib><creatorcontrib>Jin, Huan</creatorcontrib><creatorcontrib>Moseley, Hunter N B</creatorcontrib><title>Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites</title><title>Metabolites</title><addtitle>Metabolites</addtitle><description>Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1 score of 0.8180 and a Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories.</description><subject>Analysis</subject><subject>Classification</subject><subject>Datasets</subject><subject>Genomes</subject><subject>Hydrogen</subject><subject>Hypotheses</subject><subject>Information management</subject><subject>KEGG</subject><subject>kegg_pull</subject><subject>Learning algorithms</subject><subject>Machine learning</subject><subject>md_harmonize</subject><subject>Metabolic pathways</subject><subject>Metabolism</subject><subject>metabolite</subject><subject>Metabolites</subject><subject>Metabolomics</subject><subject>Molecular structure</subject><subject>NMR</subject><subject>Nuclear magnetic resonance</subject><subject>pathway</subject><issn>2218-1989</issn><issn>2218-1989</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><sourceid>DOA</sourceid><recordid>eNpdUk1v1DAQjRCIVkuvHJElLly2eGzHsY-lfK20FT2UczRxJrtekrjY3qL-e8xuKR_jg62nN2_ek6eqXgI_l9LytxNl7AJIKCX4k-pUCDBLsMY-_et9Up2ltOOlNK8bDs-rE9lYawXo02r7jma3nTB-Y-8xY6LMhhDZTUQ_-3nDrtBt_UxsTRiPQOhpTCwHdh2p9y6zvCV2jXn7A-_Zar4L4x1NNGcWBnZ1MDj6TOlF9WzAMdHZw72ovn78cHP5ebn-8ml1ebFeOiVkXooBukFxq4yuiajn2HXCQe2cMZJ6oazTGqzEmssaei0B5QBacysQBkNyUa2Oun3AXXsbfcl23wb07QEIcdNizN6N1NZKdaoWjTOdVqrpDIfGNmSh57IB1EXrzVHrNobve0q5nXxyNI44U9inVhgrjTQWmkJ9_R91F_ZxLkkPLBCgiutFdX5kbbDM9_MQckRXTk-Td2GmwRf8ommUFFIJ86fBxZBSpOExEfD21w60_-5AaXj14GPfTdQ_0n__uPwJ8ROq7g</recordid><startdate>20231101</startdate><enddate>20231101</enddate><creator>Huckvale, Erik D</creator><creator>Powell, Christian D</creator><creator>Jin, Huan</creator><creator>Moseley, Hunter N B</creator><general>MDPI AG</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QR</scope><scope>8FD</scope><scope>8FE</scope><scope>8FH</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FR3</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>LK8</scope><scope>M7P</scope><scope>P64</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>7X8</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-3995-5368</orcidid></search><sort><creationdate>20231101</creationdate><title>Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites</title><author>Huckvale, Erik D ; Powell, Christian D ; Jin, Huan ; Moseley, Hunter N B</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c423t-2f1bf4094865eeed0abb2c15cc883ed249c66193a50351d631a3f166092a1f8e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Analysis</topic><topic>Classification</topic><topic>Datasets</topic><topic>Genomes</topic><topic>Hydrogen</topic><topic>Hypotheses</topic><topic>Information management</topic><topic>KEGG</topic><topic>kegg_pull</topic><topic>Learning algorithms</topic><topic>Machine learning</topic><topic>md_harmonize</topic><topic>Metabolic pathways</topic><topic>Metabolism</topic><topic>metabolite</topic><topic>Metabolites</topic><topic>Metabolomics</topic><topic>Molecular structure</topic><topic>NMR</topic><topic>Nuclear magnetic resonance</topic><topic>pathway</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Huckvale, Erik D</creatorcontrib><creatorcontrib>Powell, Christian D</creatorcontrib><creatorcontrib>Jin, Huan</creatorcontrib><creatorcontrib>Moseley, Hunter N B</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>Chemoreception Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>ProQuest Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Engineering Research Database</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Biological Science Collection</collection><collection>Biological Science Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>MEDLINE - Academic</collection><collection>Directory of Open Access Journals</collection><jtitle>Metabolites</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Huckvale, Erik D</au><au>Powell, Christian D</au><au>Jin, Huan</au><au>Moseley, Hunter N B</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites</atitle><jtitle>Metabolites</jtitle><addtitle>Metabolites</addtitle><date>2023-11-01</date><risdate>2023</risdate><volume>13</volume><issue>11</issue><spage>1120</spage><pages>1120-</pages><issn>2218-1989</issn><eissn>2218-1989</eissn><abstract>Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1 score of 0.8180 and a Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories.</abstract><cop>Switzerland</cop><pub>MDPI AG</pub><pmid>37999216</pmid><doi>10.3390/metabo13111120</doi><orcidid>https://orcid.org/0000-0003-3995-5368</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2218-1989
ispartof Metabolites, 2023-11, Vol.13 (11), p.1120
issn 2218-1989
2218-1989
language eng
recordid cdi_doaj_primary_oai_doaj_org_article_544b4527c8b6447b801797e91d0371a6
source PubMed Central Free; Publicly Available Content Database
subjects Analysis
Classification
Datasets
Genomes
Hydrogen
Hypotheses
Information management
KEGG
kegg_pull
Learning algorithms
Machine learning
md_harmonize
Metabolic pathways
Metabolism
metabolite
Metabolites
Metabolomics
Molecular structure
NMR
Nuclear magnetic resonance
pathway
title Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T14%3A54%3A27IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Benchmark%20Dataset%20for%20Training%20Machine%20Learning%20Models%20to%20Predict%20the%20Pathway%20Involvement%20of%20Metabolites&rft.jtitle=Metabolites&rft.au=Huckvale,%20Erik%20D&rft.date=2023-11-01&rft.volume=13&rft.issue=11&rft.spage=1120&rft.pages=1120-&rft.issn=2218-1989&rft.eissn=2218-1989&rft_id=info:doi/10.3390/metabo13111120&rft_dat=%3Cgale_doaj_%3EA774323428%3C/gale_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c423t-2f1bf4094865eeed0abb2c15cc883ed249c66193a50351d631a3f166092a1f8e3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2893121419&rft_id=info:pmid/37999216&rft_galeid=A774323428&rfr_iscdi=true