Loading…
Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites
Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address...
Saved in:
Published in: | Metabolites 2023-11, Vol.13 (11), p.1120 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | cdi_FETCH-LOGICAL-c423t-2f1bf4094865eeed0abb2c15cc883ed249c66193a50351d631a3f166092a1f8e3 |
container_end_page | |
container_issue | 11 |
container_start_page | 1120 |
container_title | Metabolites |
container_volume | 13 |
creator | Huckvale, Erik D Powell, Christian D Jin, Huan Moseley, Hunter N B |
description | Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1 score of 0.8180 and a Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories. |
doi_str_mv | 10.3390/metabo13111120 |
format | article |
fullrecord | <record><control><sourceid>gale_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_544b4527c8b6447b801797e91d0371a6</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A774323428</galeid><doaj_id>oai_doaj_org_article_544b4527c8b6447b801797e91d0371a6</doaj_id><sourcerecordid>A774323428</sourcerecordid><originalsourceid>FETCH-LOGICAL-c423t-2f1bf4094865eeed0abb2c15cc883ed249c66193a50351d631a3f166092a1f8e3</originalsourceid><addsrcrecordid>eNpdUk1v1DAQjRCIVkuvHJElLly2eGzHsY-lfK20FT2UczRxJrtekrjY3qL-e8xuKR_jg62nN2_ek6eqXgI_l9LytxNl7AJIKCX4k-pUCDBLsMY-_et9Up2ltOOlNK8bDs-rE9lYawXo02r7jma3nTB-Y-8xY6LMhhDZTUQ_-3nDrtBt_UxsTRiPQOhpTCwHdh2p9y6zvCV2jXn7A-_Zar4L4x1NNGcWBnZ1MDj6TOlF9WzAMdHZw72ovn78cHP5ebn-8ml1ebFeOiVkXooBukFxq4yuiajn2HXCQe2cMZJ6oazTGqzEmssaei0B5QBacysQBkNyUa2Oun3AXXsbfcl23wb07QEIcdNizN6N1NZKdaoWjTOdVqrpDIfGNmSh57IB1EXrzVHrNobve0q5nXxyNI44U9inVhgrjTQWmkJ9_R91F_ZxLkkPLBCgiutFdX5kbbDM9_MQckRXTk-Td2GmwRf8ommUFFIJ86fBxZBSpOExEfD21w60_-5AaXj14GPfTdQ_0n__uPwJ8ROq7g</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2893121419</pqid></control><display><type>article</type><title>Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites</title><source>PubMed Central Free</source><source>Publicly Available Content Database</source><creator>Huckvale, Erik D ; Powell, Christian D ; Jin, Huan ; Moseley, Hunter N B</creator><creatorcontrib>Huckvale, Erik D ; Powell, Christian D ; Jin, Huan ; Moseley, Hunter N B</creatorcontrib><description>Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1 score of 0.8180 and a Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories.</description><identifier>ISSN: 2218-1989</identifier><identifier>EISSN: 2218-1989</identifier><identifier>DOI: 10.3390/metabo13111120</identifier><identifier>PMID: 37999216</identifier><language>eng</language><publisher>Switzerland: MDPI AG</publisher><subject>Analysis ; Classification ; Datasets ; Genomes ; Hydrogen ; Hypotheses ; Information management ; KEGG ; kegg_pull ; Learning algorithms ; Machine learning ; md_harmonize ; Metabolic pathways ; Metabolism ; metabolite ; Metabolites ; Metabolomics ; Molecular structure ; NMR ; Nuclear magnetic resonance ; pathway</subject><ispartof>Metabolites, 2023-11, Vol.13 (11), p.1120</ispartof><rights>COPYRIGHT 2023 MDPI AG</rights><rights>2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c423t-2f1bf4094865eeed0abb2c15cc883ed249c66193a50351d631a3f166092a1f8e3</cites><orcidid>0000-0003-3995-5368</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/2893121419/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2893121419?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,25753,27924,27925,37012,37013,44590,75126</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/37999216$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Huckvale, Erik D</creatorcontrib><creatorcontrib>Powell, Christian D</creatorcontrib><creatorcontrib>Jin, Huan</creatorcontrib><creatorcontrib>Moseley, Hunter N B</creatorcontrib><title>Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites</title><title>Metabolites</title><addtitle>Metabolites</addtitle><description>Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1 score of 0.8180 and a Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories.</description><subject>Analysis</subject><subject>Classification</subject><subject>Datasets</subject><subject>Genomes</subject><subject>Hydrogen</subject><subject>Hypotheses</subject><subject>Information management</subject><subject>KEGG</subject><subject>kegg_pull</subject><subject>Learning algorithms</subject><subject>Machine learning</subject><subject>md_harmonize</subject><subject>Metabolic pathways</subject><subject>Metabolism</subject><subject>metabolite</subject><subject>Metabolites</subject><subject>Metabolomics</subject><subject>Molecular structure</subject><subject>NMR</subject><subject>Nuclear magnetic resonance</subject><subject>pathway</subject><issn>2218-1989</issn><issn>2218-1989</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><sourceid>DOA</sourceid><recordid>eNpdUk1v1DAQjRCIVkuvHJElLly2eGzHsY-lfK20FT2UczRxJrtekrjY3qL-e8xuKR_jg62nN2_ek6eqXgI_l9LytxNl7AJIKCX4k-pUCDBLsMY-_et9Up2ltOOlNK8bDs-rE9lYawXo02r7jma3nTB-Y-8xY6LMhhDZTUQ_-3nDrtBt_UxsTRiPQOhpTCwHdh2p9y6zvCV2jXn7A-_Zar4L4x1NNGcWBnZ1MDj6TOlF9WzAMdHZw72ovn78cHP5ebn-8ml1ebFeOiVkXooBukFxq4yuiajn2HXCQe2cMZJ6oazTGqzEmssaei0B5QBacysQBkNyUa2Oun3AXXsbfcl23wb07QEIcdNizN6N1NZKdaoWjTOdVqrpDIfGNmSh57IB1EXrzVHrNobve0q5nXxyNI44U9inVhgrjTQWmkJ9_R91F_ZxLkkPLBCgiutFdX5kbbDM9_MQckRXTk-Td2GmwRf8ommUFFIJ86fBxZBSpOExEfD21w60_-5AaXj14GPfTdQ_0n__uPwJ8ROq7g</recordid><startdate>20231101</startdate><enddate>20231101</enddate><creator>Huckvale, Erik D</creator><creator>Powell, Christian D</creator><creator>Jin, Huan</creator><creator>Moseley, Hunter N B</creator><general>MDPI AG</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QR</scope><scope>8FD</scope><scope>8FE</scope><scope>8FH</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FR3</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>LK8</scope><scope>M7P</scope><scope>P64</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>7X8</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-3995-5368</orcidid></search><sort><creationdate>20231101</creationdate><title>Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites</title><author>Huckvale, Erik D ; Powell, Christian D ; Jin, Huan ; Moseley, Hunter N B</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c423t-2f1bf4094865eeed0abb2c15cc883ed249c66193a50351d631a3f166092a1f8e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Analysis</topic><topic>Classification</topic><topic>Datasets</topic><topic>Genomes</topic><topic>Hydrogen</topic><topic>Hypotheses</topic><topic>Information management</topic><topic>KEGG</topic><topic>kegg_pull</topic><topic>Learning algorithms</topic><topic>Machine learning</topic><topic>md_harmonize</topic><topic>Metabolic pathways</topic><topic>Metabolism</topic><topic>metabolite</topic><topic>Metabolites</topic><topic>Metabolomics</topic><topic>Molecular structure</topic><topic>NMR</topic><topic>Nuclear magnetic resonance</topic><topic>pathway</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Huckvale, Erik D</creatorcontrib><creatorcontrib>Powell, Christian D</creatorcontrib><creatorcontrib>Jin, Huan</creatorcontrib><creatorcontrib>Moseley, Hunter N B</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>Chemoreception Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>ProQuest Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Engineering Research Database</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Biological Science Collection</collection><collection>Biological Science Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>MEDLINE - Academic</collection><collection>Directory of Open Access Journals</collection><jtitle>Metabolites</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Huckvale, Erik D</au><au>Powell, Christian D</au><au>Jin, Huan</au><au>Moseley, Hunter N B</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites</atitle><jtitle>Metabolites</jtitle><addtitle>Metabolites</addtitle><date>2023-11-01</date><risdate>2023</risdate><volume>13</volume><issue>11</issue><spage>1120</spage><pages>1120-</pages><issn>2218-1989</issn><eissn>2218-1989</eissn><abstract>Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1 score of 0.8180 and a Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories.</abstract><cop>Switzerland</cop><pub>MDPI AG</pub><pmid>37999216</pmid><doi>10.3390/metabo13111120</doi><orcidid>https://orcid.org/0000-0003-3995-5368</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2218-1989 |
ispartof | Metabolites, 2023-11, Vol.13 (11), p.1120 |
issn | 2218-1989 2218-1989 |
language | eng |
recordid | cdi_doaj_primary_oai_doaj_org_article_544b4527c8b6447b801797e91d0371a6 |
source | PubMed Central Free; Publicly Available Content Database |
subjects | Analysis Classification Datasets Genomes Hydrogen Hypotheses Information management KEGG kegg_pull Learning algorithms Machine learning md_harmonize Metabolic pathways Metabolism metabolite Metabolites Metabolomics Molecular structure NMR Nuclear magnetic resonance pathway |
title | Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T14%3A54%3A27IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Benchmark%20Dataset%20for%20Training%20Machine%20Learning%20Models%20to%20Predict%20the%20Pathway%20Involvement%20of%20Metabolites&rft.jtitle=Metabolites&rft.au=Huckvale,%20Erik%20D&rft.date=2023-11-01&rft.volume=13&rft.issue=11&rft.spage=1120&rft.pages=1120-&rft.issn=2218-1989&rft.eissn=2218-1989&rft_id=info:doi/10.3390/metabo13111120&rft_dat=%3Cgale_doaj_%3EA774323428%3C/gale_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c423t-2f1bf4094865eeed0abb2c15cc883ed249c66193a50351d631a3f166092a1f8e3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2893121419&rft_id=info:pmid/37999216&rft_galeid=A774323428&rfr_iscdi=true |