Loading…

TocoDecoy: A New Approach to Design Unbiased Datasets for Training and Benchmarking Machine-Learning Scoring Functions

Development of accurate machine-learning-based scoring functions (MLSFs) for structure-based virtual screening against a given target requires a large unbiased dataset with structurally diverse actives and decoys. However, most datasets for the development of MLSFs were designed for traditional SFs...

Full description

Saved in:
Bibliographic Details
Published in:Journal of medicinal chemistry 2022-06, Vol.65 (11), p.7918-7932
Main Authors: Zhang, Xujun, Shen, Chao, Liao, Ben, Jiang, Dejun, Wang, Jike, Wu, Zhenxing, Du, Hongyan, Wang, Tianyue, Huo, Wenbo, Xu, Lei, Cao, Dongsheng, Hsieh, Chang-Yu, Hou, Tingjun
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-a348t-df358c278191de41738ad004de006a080e5f0eca7ded06d2196de76293f649b63
cites cdi_FETCH-LOGICAL-a348t-df358c278191de41738ad004de006a080e5f0eca7ded06d2196de76293f649b63
container_end_page 7932
container_issue 11
container_start_page 7918
container_title Journal of medicinal chemistry
container_volume 65
creator Zhang, Xujun
Shen, Chao
Liao, Ben
Jiang, Dejun
Wang, Jike
Wu, Zhenxing
Du, Hongyan
Wang, Tianyue
Huo, Wenbo
Xu, Lei
Cao, Dongsheng
Hsieh, Chang-Yu
Hou, Tingjun
description Development of accurate machine-learning-based scoring functions (MLSFs) for structure-based virtual screening against a given target requires a large unbiased dataset with structurally diverse actives and decoys. However, most datasets for the development of MLSFs were designed for traditional SFs and may suffer from hidden biases and data insufficiency. Hereby, we developed a new approach named Topology-based and Conformation-based decoys generation (TocoDecoy), which integrates two strategies to generate decoys by tweaking the actives for a specific target, to generate unbiased and expandable datasets for training and benchmarking MLSFs. For hidden bias evaluation, the performance of InteractionGraphNet (IGN) trained on the TocoDecoy, LIT-PCBA, and DUD-E-like datasets was assessed. The results illustrate that the IGN model trained on the TocoDecoy dataset is competitive with that trained on the LIT-PCBA dataset but remarkably outperforms that trained on the DUD-E dataset, suggesting that the decoys in TocoDecoy are unbiased for training and benchmarking MLSFs.
doi_str_mv 10.1021/acs.jmedchem.2c00460
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2672319098</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2672319098</sourcerecordid><originalsourceid>FETCH-LOGICAL-a348t-df358c278191de41738ad004de006a080e5f0eca7ded06d2196de76293f649b63</originalsourceid><addsrcrecordid>eNp9kEtPwzAQhC0EouXxDxDykUvK2k6dhFuhvKQCB8o5cu1Nm9LaxU5A_Htc2nLkNNLuzK7mI-SMQY8BZ5dKh958iUbPcNnjGiCVsEe6rM8hSXNI90kXgPOESy465CiEOQAIxsUh6Yi-THmWZV3yOXbaDVG77ys6oM_4RQerlXdKz2jj6BBDPbX0zU5qFdDQoWqiNoFWztOxV7Wt7ZQqa-g1Wj1bKv--HjzFeG0xGaHyv45X7fxa71qrm9rZcEIOKrUIeLrVY_J2dzu-eUhGL_ePN4NRokSaN4mpRD_XPMtZwQymLBO5MrGpQQCpIAfsV4BaZQYNSMNZIQ1mkheikmkxkeKYXGzuxk4fLYamXNZB42KhLLo2lFxmXLACijxa041VexeCx6pc-To2-i4ZlGviZSRe7oiXW-Ixdr790E7i7i-0QxwNsDH8xl3rbSz8_80faGOQaQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2672319098</pqid></control><display><type>article</type><title>TocoDecoy: A New Approach to Design Unbiased Datasets for Training and Benchmarking Machine-Learning Scoring Functions</title><source>American Chemical Society:Jisc Collections:American Chemical Society Read &amp; Publish Agreement 2022-2024 (Reading list)</source><creator>Zhang, Xujun ; Shen, Chao ; Liao, Ben ; Jiang, Dejun ; Wang, Jike ; Wu, Zhenxing ; Du, Hongyan ; Wang, Tianyue ; Huo, Wenbo ; Xu, Lei ; Cao, Dongsheng ; Hsieh, Chang-Yu ; Hou, Tingjun</creator><creatorcontrib>Zhang, Xujun ; Shen, Chao ; Liao, Ben ; Jiang, Dejun ; Wang, Jike ; Wu, Zhenxing ; Du, Hongyan ; Wang, Tianyue ; Huo, Wenbo ; Xu, Lei ; Cao, Dongsheng ; Hsieh, Chang-Yu ; Hou, Tingjun</creatorcontrib><description>Development of accurate machine-learning-based scoring functions (MLSFs) for structure-based virtual screening against a given target requires a large unbiased dataset with structurally diverse actives and decoys. However, most datasets for the development of MLSFs were designed for traditional SFs and may suffer from hidden biases and data insufficiency. Hereby, we developed a new approach named Topology-based and Conformation-based decoys generation (TocoDecoy), which integrates two strategies to generate decoys by tweaking the actives for a specific target, to generate unbiased and expandable datasets for training and benchmarking MLSFs. For hidden bias evaluation, the performance of InteractionGraphNet (IGN) trained on the TocoDecoy, LIT-PCBA, and DUD-E-like datasets was assessed. The results illustrate that the IGN model trained on the TocoDecoy dataset is competitive with that trained on the LIT-PCBA dataset but remarkably outperforms that trained on the DUD-E dataset, suggesting that the decoys in TocoDecoy are unbiased for training and benchmarking MLSFs.</description><identifier>ISSN: 0022-2623</identifier><identifier>EISSN: 1520-4804</identifier><identifier>DOI: 10.1021/acs.jmedchem.2c00460</identifier><identifier>PMID: 35642777</identifier><language>eng</language><publisher>United States: American Chemical Society</publisher><subject>Benchmarking ; Ligands ; Machine Learning ; Molecular Conformation</subject><ispartof>Journal of medicinal chemistry, 2022-06, Vol.65 (11), p.7918-7932</ispartof><rights>2022 American Chemical Society</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a348t-df358c278191de41738ad004de006a080e5f0eca7ded06d2196de76293f649b63</citedby><cites>FETCH-LOGICAL-a348t-df358c278191de41738ad004de006a080e5f0eca7ded06d2196de76293f649b63</cites><orcidid>0000-0001-7227-2580 ; 0000-0002-2035-5074 ; 0000-0003-2783-5529 ; 0000-0003-3604-3785</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/35642777$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhang, Xujun</creatorcontrib><creatorcontrib>Shen, Chao</creatorcontrib><creatorcontrib>Liao, Ben</creatorcontrib><creatorcontrib>Jiang, Dejun</creatorcontrib><creatorcontrib>Wang, Jike</creatorcontrib><creatorcontrib>Wu, Zhenxing</creatorcontrib><creatorcontrib>Du, Hongyan</creatorcontrib><creatorcontrib>Wang, Tianyue</creatorcontrib><creatorcontrib>Huo, Wenbo</creatorcontrib><creatorcontrib>Xu, Lei</creatorcontrib><creatorcontrib>Cao, Dongsheng</creatorcontrib><creatorcontrib>Hsieh, Chang-Yu</creatorcontrib><creatorcontrib>Hou, Tingjun</creatorcontrib><title>TocoDecoy: A New Approach to Design Unbiased Datasets for Training and Benchmarking Machine-Learning Scoring Functions</title><title>Journal of medicinal chemistry</title><addtitle>J. Med. Chem</addtitle><description>Development of accurate machine-learning-based scoring functions (MLSFs) for structure-based virtual screening against a given target requires a large unbiased dataset with structurally diverse actives and decoys. However, most datasets for the development of MLSFs were designed for traditional SFs and may suffer from hidden biases and data insufficiency. Hereby, we developed a new approach named Topology-based and Conformation-based decoys generation (TocoDecoy), which integrates two strategies to generate decoys by tweaking the actives for a specific target, to generate unbiased and expandable datasets for training and benchmarking MLSFs. For hidden bias evaluation, the performance of InteractionGraphNet (IGN) trained on the TocoDecoy, LIT-PCBA, and DUD-E-like datasets was assessed. The results illustrate that the IGN model trained on the TocoDecoy dataset is competitive with that trained on the LIT-PCBA dataset but remarkably outperforms that trained on the DUD-E dataset, suggesting that the decoys in TocoDecoy are unbiased for training and benchmarking MLSFs.</description><subject>Benchmarking</subject><subject>Ligands</subject><subject>Machine Learning</subject><subject>Molecular Conformation</subject><issn>0022-2623</issn><issn>1520-4804</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNp9kEtPwzAQhC0EouXxDxDykUvK2k6dhFuhvKQCB8o5cu1Nm9LaxU5A_Htc2nLkNNLuzK7mI-SMQY8BZ5dKh958iUbPcNnjGiCVsEe6rM8hSXNI90kXgPOESy465CiEOQAIxsUh6Yi-THmWZV3yOXbaDVG77ys6oM_4RQerlXdKz2jj6BBDPbX0zU5qFdDQoWqiNoFWztOxV7Wt7ZQqa-g1Wj1bKv--HjzFeG0xGaHyv45X7fxa71qrm9rZcEIOKrUIeLrVY_J2dzu-eUhGL_ePN4NRokSaN4mpRD_XPMtZwQymLBO5MrGpQQCpIAfsV4BaZQYNSMNZIQ1mkheikmkxkeKYXGzuxk4fLYamXNZB42KhLLo2lFxmXLACijxa041VexeCx6pc-To2-i4ZlGviZSRe7oiXW-Ixdr790E7i7i-0QxwNsDH8xl3rbSz8_80faGOQaQ</recordid><startdate>20220609</startdate><enddate>20220609</enddate><creator>Zhang, Xujun</creator><creator>Shen, Chao</creator><creator>Liao, Ben</creator><creator>Jiang, Dejun</creator><creator>Wang, Jike</creator><creator>Wu, Zhenxing</creator><creator>Du, Hongyan</creator><creator>Wang, Tianyue</creator><creator>Huo, Wenbo</creator><creator>Xu, Lei</creator><creator>Cao, Dongsheng</creator><creator>Hsieh, Chang-Yu</creator><creator>Hou, Tingjun</creator><general>American Chemical Society</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0001-7227-2580</orcidid><orcidid>https://orcid.org/0000-0002-2035-5074</orcidid><orcidid>https://orcid.org/0000-0003-2783-5529</orcidid><orcidid>https://orcid.org/0000-0003-3604-3785</orcidid></search><sort><creationdate>20220609</creationdate><title>TocoDecoy: A New Approach to Design Unbiased Datasets for Training and Benchmarking Machine-Learning Scoring Functions</title><author>Zhang, Xujun ; Shen, Chao ; Liao, Ben ; Jiang, Dejun ; Wang, Jike ; Wu, Zhenxing ; Du, Hongyan ; Wang, Tianyue ; Huo, Wenbo ; Xu, Lei ; Cao, Dongsheng ; Hsieh, Chang-Yu ; Hou, Tingjun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a348t-df358c278191de41738ad004de006a080e5f0eca7ded06d2196de76293f649b63</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Benchmarking</topic><topic>Ligands</topic><topic>Machine Learning</topic><topic>Molecular Conformation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Xujun</creatorcontrib><creatorcontrib>Shen, Chao</creatorcontrib><creatorcontrib>Liao, Ben</creatorcontrib><creatorcontrib>Jiang, Dejun</creatorcontrib><creatorcontrib>Wang, Jike</creatorcontrib><creatorcontrib>Wu, Zhenxing</creatorcontrib><creatorcontrib>Du, Hongyan</creatorcontrib><creatorcontrib>Wang, Tianyue</creatorcontrib><creatorcontrib>Huo, Wenbo</creatorcontrib><creatorcontrib>Xu, Lei</creatorcontrib><creatorcontrib>Cao, Dongsheng</creatorcontrib><creatorcontrib>Hsieh, Chang-Yu</creatorcontrib><creatorcontrib>Hou, Tingjun</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>Journal of medicinal chemistry</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhang, Xujun</au><au>Shen, Chao</au><au>Liao, Ben</au><au>Jiang, Dejun</au><au>Wang, Jike</au><au>Wu, Zhenxing</au><au>Du, Hongyan</au><au>Wang, Tianyue</au><au>Huo, Wenbo</au><au>Xu, Lei</au><au>Cao, Dongsheng</au><au>Hsieh, Chang-Yu</au><au>Hou, Tingjun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>TocoDecoy: A New Approach to Design Unbiased Datasets for Training and Benchmarking Machine-Learning Scoring Functions</atitle><jtitle>Journal of medicinal chemistry</jtitle><addtitle>J. Med. Chem</addtitle><date>2022-06-09</date><risdate>2022</risdate><volume>65</volume><issue>11</issue><spage>7918</spage><epage>7932</epage><pages>7918-7932</pages><issn>0022-2623</issn><eissn>1520-4804</eissn><abstract>Development of accurate machine-learning-based scoring functions (MLSFs) for structure-based virtual screening against a given target requires a large unbiased dataset with structurally diverse actives and decoys. However, most datasets for the development of MLSFs were designed for traditional SFs and may suffer from hidden biases and data insufficiency. Hereby, we developed a new approach named Topology-based and Conformation-based decoys generation (TocoDecoy), which integrates two strategies to generate decoys by tweaking the actives for a specific target, to generate unbiased and expandable datasets for training and benchmarking MLSFs. For hidden bias evaluation, the performance of InteractionGraphNet (IGN) trained on the TocoDecoy, LIT-PCBA, and DUD-E-like datasets was assessed. The results illustrate that the IGN model trained on the TocoDecoy dataset is competitive with that trained on the LIT-PCBA dataset but remarkably outperforms that trained on the DUD-E dataset, suggesting that the decoys in TocoDecoy are unbiased for training and benchmarking MLSFs.</abstract><cop>United States</cop><pub>American Chemical Society</pub><pmid>35642777</pmid><doi>10.1021/acs.jmedchem.2c00460</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0001-7227-2580</orcidid><orcidid>https://orcid.org/0000-0002-2035-5074</orcidid><orcidid>https://orcid.org/0000-0003-2783-5529</orcidid><orcidid>https://orcid.org/0000-0003-3604-3785</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 0022-2623
ispartof Journal of medicinal chemistry, 2022-06, Vol.65 (11), p.7918-7932
issn 0022-2623
1520-4804
language eng
recordid cdi_proquest_miscellaneous_2672319098
source American Chemical Society:Jisc Collections:American Chemical Society Read & Publish Agreement 2022-2024 (Reading list)
subjects Benchmarking
Ligands
Machine Learning
Molecular Conformation
title TocoDecoy: A New Approach to Design Unbiased Datasets for Training and Benchmarking Machine-Learning Scoring Functions
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-13T20%3A37%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=TocoDecoy:%20A%20New%20Approach%20to%20Design%20Unbiased%20Datasets%20for%20Training%20and%20Benchmarking%20Machine-Learning%20Scoring%20Functions&rft.jtitle=Journal%20of%20medicinal%20chemistry&rft.au=Zhang,%20Xujun&rft.date=2022-06-09&rft.volume=65&rft.issue=11&rft.spage=7918&rft.epage=7932&rft.pages=7918-7932&rft.issn=0022-2623&rft.eissn=1520-4804&rft_id=info:doi/10.1021/acs.jmedchem.2c00460&rft_dat=%3Cproquest_cross%3E2672319098%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-a348t-df358c278191de41738ad004de006a080e5f0eca7ded06d2196de76293f649b63%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2672319098&rft_id=info:pmid/35642777&rfr_iscdi=true