Loading…
TocoDecoy: A New Approach to Design Unbiased Datasets for Training and Benchmarking Machine-Learning Scoring Functions
Development of accurate machine-learning-based scoring functions (MLSFs) for structure-based virtual screening against a given target requires a large unbiased dataset with structurally diverse actives and decoys. However, most datasets for the development of MLSFs were designed for traditional SFs...
Saved in:
Published in: | Journal of medicinal chemistry 2022-06, Vol.65 (11), p.7918-7932 |
---|---|
Main Authors: | , , , , , , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-a348t-df358c278191de41738ad004de006a080e5f0eca7ded06d2196de76293f649b63 |
---|---|
cites | cdi_FETCH-LOGICAL-a348t-df358c278191de41738ad004de006a080e5f0eca7ded06d2196de76293f649b63 |
container_end_page | 7932 |
container_issue | 11 |
container_start_page | 7918 |
container_title | Journal of medicinal chemistry |
container_volume | 65 |
creator | Zhang, Xujun Shen, Chao Liao, Ben Jiang, Dejun Wang, Jike Wu, Zhenxing Du, Hongyan Wang, Tianyue Huo, Wenbo Xu, Lei Cao, Dongsheng Hsieh, Chang-Yu Hou, Tingjun |
description | Development of accurate machine-learning-based scoring functions (MLSFs) for structure-based virtual screening against a given target requires a large unbiased dataset with structurally diverse actives and decoys. However, most datasets for the development of MLSFs were designed for traditional SFs and may suffer from hidden biases and data insufficiency. Hereby, we developed a new approach named Topology-based and Conformation-based decoys generation (TocoDecoy), which integrates two strategies to generate decoys by tweaking the actives for a specific target, to generate unbiased and expandable datasets for training and benchmarking MLSFs. For hidden bias evaluation, the performance of InteractionGraphNet (IGN) trained on the TocoDecoy, LIT-PCBA, and DUD-E-like datasets was assessed. The results illustrate that the IGN model trained on the TocoDecoy dataset is competitive with that trained on the LIT-PCBA dataset but remarkably outperforms that trained on the DUD-E dataset, suggesting that the decoys in TocoDecoy are unbiased for training and benchmarking MLSFs. |
doi_str_mv | 10.1021/acs.jmedchem.2c00460 |
format | article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2672319098</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2672319098</sourcerecordid><originalsourceid>FETCH-LOGICAL-a348t-df358c278191de41738ad004de006a080e5f0eca7ded06d2196de76293f649b63</originalsourceid><addsrcrecordid>eNp9kEtPwzAQhC0EouXxDxDykUvK2k6dhFuhvKQCB8o5cu1Nm9LaxU5A_Htc2nLkNNLuzK7mI-SMQY8BZ5dKh958iUbPcNnjGiCVsEe6rM8hSXNI90kXgPOESy465CiEOQAIxsUh6Yi-THmWZV3yOXbaDVG77ys6oM_4RQerlXdKz2jj6BBDPbX0zU5qFdDQoWqiNoFWztOxV7Wt7ZQqa-g1Wj1bKv--HjzFeG0xGaHyv45X7fxa71qrm9rZcEIOKrUIeLrVY_J2dzu-eUhGL_ePN4NRokSaN4mpRD_XPMtZwQymLBO5MrGpQQCpIAfsV4BaZQYNSMNZIQ1mkheikmkxkeKYXGzuxk4fLYamXNZB42KhLLo2lFxmXLACijxa041VexeCx6pc-To2-i4ZlGviZSRe7oiXW-Ixdr790E7i7i-0QxwNsDH8xl3rbSz8_80faGOQaQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2672319098</pqid></control><display><type>article</type><title>TocoDecoy: A New Approach to Design Unbiased Datasets for Training and Benchmarking Machine-Learning Scoring Functions</title><source>American Chemical Society:Jisc Collections:American Chemical Society Read & Publish Agreement 2022-2024 (Reading list)</source><creator>Zhang, Xujun ; Shen, Chao ; Liao, Ben ; Jiang, Dejun ; Wang, Jike ; Wu, Zhenxing ; Du, Hongyan ; Wang, Tianyue ; Huo, Wenbo ; Xu, Lei ; Cao, Dongsheng ; Hsieh, Chang-Yu ; Hou, Tingjun</creator><creatorcontrib>Zhang, Xujun ; Shen, Chao ; Liao, Ben ; Jiang, Dejun ; Wang, Jike ; Wu, Zhenxing ; Du, Hongyan ; Wang, Tianyue ; Huo, Wenbo ; Xu, Lei ; Cao, Dongsheng ; Hsieh, Chang-Yu ; Hou, Tingjun</creatorcontrib><description>Development of accurate machine-learning-based scoring functions (MLSFs) for structure-based virtual screening against a given target requires a large unbiased dataset with structurally diverse actives and decoys. However, most datasets for the development of MLSFs were designed for traditional SFs and may suffer from hidden biases and data insufficiency. Hereby, we developed a new approach named Topology-based and Conformation-based decoys generation (TocoDecoy), which integrates two strategies to generate decoys by tweaking the actives for a specific target, to generate unbiased and expandable datasets for training and benchmarking MLSFs. For hidden bias evaluation, the performance of InteractionGraphNet (IGN) trained on the TocoDecoy, LIT-PCBA, and DUD-E-like datasets was assessed. The results illustrate that the IGN model trained on the TocoDecoy dataset is competitive with that trained on the LIT-PCBA dataset but remarkably outperforms that trained on the DUD-E dataset, suggesting that the decoys in TocoDecoy are unbiased for training and benchmarking MLSFs.</description><identifier>ISSN: 0022-2623</identifier><identifier>EISSN: 1520-4804</identifier><identifier>DOI: 10.1021/acs.jmedchem.2c00460</identifier><identifier>PMID: 35642777</identifier><language>eng</language><publisher>United States: American Chemical Society</publisher><subject>Benchmarking ; Ligands ; Machine Learning ; Molecular Conformation</subject><ispartof>Journal of medicinal chemistry, 2022-06, Vol.65 (11), p.7918-7932</ispartof><rights>2022 American Chemical Society</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a348t-df358c278191de41738ad004de006a080e5f0eca7ded06d2196de76293f649b63</citedby><cites>FETCH-LOGICAL-a348t-df358c278191de41738ad004de006a080e5f0eca7ded06d2196de76293f649b63</cites><orcidid>0000-0001-7227-2580 ; 0000-0002-2035-5074 ; 0000-0003-2783-5529 ; 0000-0003-3604-3785</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/35642777$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhang, Xujun</creatorcontrib><creatorcontrib>Shen, Chao</creatorcontrib><creatorcontrib>Liao, Ben</creatorcontrib><creatorcontrib>Jiang, Dejun</creatorcontrib><creatorcontrib>Wang, Jike</creatorcontrib><creatorcontrib>Wu, Zhenxing</creatorcontrib><creatorcontrib>Du, Hongyan</creatorcontrib><creatorcontrib>Wang, Tianyue</creatorcontrib><creatorcontrib>Huo, Wenbo</creatorcontrib><creatorcontrib>Xu, Lei</creatorcontrib><creatorcontrib>Cao, Dongsheng</creatorcontrib><creatorcontrib>Hsieh, Chang-Yu</creatorcontrib><creatorcontrib>Hou, Tingjun</creatorcontrib><title>TocoDecoy: A New Approach to Design Unbiased Datasets for Training and Benchmarking Machine-Learning Scoring Functions</title><title>Journal of medicinal chemistry</title><addtitle>J. Med. Chem</addtitle><description>Development of accurate machine-learning-based scoring functions (MLSFs) for structure-based virtual screening against a given target requires a large unbiased dataset with structurally diverse actives and decoys. However, most datasets for the development of MLSFs were designed for traditional SFs and may suffer from hidden biases and data insufficiency. Hereby, we developed a new approach named Topology-based and Conformation-based decoys generation (TocoDecoy), which integrates two strategies to generate decoys by tweaking the actives for a specific target, to generate unbiased and expandable datasets for training and benchmarking MLSFs. For hidden bias evaluation, the performance of InteractionGraphNet (IGN) trained on the TocoDecoy, LIT-PCBA, and DUD-E-like datasets was assessed. The results illustrate that the IGN model trained on the TocoDecoy dataset is competitive with that trained on the LIT-PCBA dataset but remarkably outperforms that trained on the DUD-E dataset, suggesting that the decoys in TocoDecoy are unbiased for training and benchmarking MLSFs.</description><subject>Benchmarking</subject><subject>Ligands</subject><subject>Machine Learning</subject><subject>Molecular Conformation</subject><issn>0022-2623</issn><issn>1520-4804</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNp9kEtPwzAQhC0EouXxDxDykUvK2k6dhFuhvKQCB8o5cu1Nm9LaxU5A_Htc2nLkNNLuzK7mI-SMQY8BZ5dKh958iUbPcNnjGiCVsEe6rM8hSXNI90kXgPOESy465CiEOQAIxsUh6Yi-THmWZV3yOXbaDVG77ys6oM_4RQerlXdKz2jj6BBDPbX0zU5qFdDQoWqiNoFWztOxV7Wt7ZQqa-g1Wj1bKv--HjzFeG0xGaHyv45X7fxa71qrm9rZcEIOKrUIeLrVY_J2dzu-eUhGL_ePN4NRokSaN4mpRD_XPMtZwQymLBO5MrGpQQCpIAfsV4BaZQYNSMNZIQ1mkheikmkxkeKYXGzuxk4fLYamXNZB42KhLLo2lFxmXLACijxa041VexeCx6pc-To2-i4ZlGviZSRe7oiXW-Ixdr790E7i7i-0QxwNsDH8xl3rbSz8_80faGOQaQ</recordid><startdate>20220609</startdate><enddate>20220609</enddate><creator>Zhang, Xujun</creator><creator>Shen, Chao</creator><creator>Liao, Ben</creator><creator>Jiang, Dejun</creator><creator>Wang, Jike</creator><creator>Wu, Zhenxing</creator><creator>Du, Hongyan</creator><creator>Wang, Tianyue</creator><creator>Huo, Wenbo</creator><creator>Xu, Lei</creator><creator>Cao, Dongsheng</creator><creator>Hsieh, Chang-Yu</creator><creator>Hou, Tingjun</creator><general>American Chemical Society</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0001-7227-2580</orcidid><orcidid>https://orcid.org/0000-0002-2035-5074</orcidid><orcidid>https://orcid.org/0000-0003-2783-5529</orcidid><orcidid>https://orcid.org/0000-0003-3604-3785</orcidid></search><sort><creationdate>20220609</creationdate><title>TocoDecoy: A New Approach to Design Unbiased Datasets for Training and Benchmarking Machine-Learning Scoring Functions</title><author>Zhang, Xujun ; Shen, Chao ; Liao, Ben ; Jiang, Dejun ; Wang, Jike ; Wu, Zhenxing ; Du, Hongyan ; Wang, Tianyue ; Huo, Wenbo ; Xu, Lei ; Cao, Dongsheng ; Hsieh, Chang-Yu ; Hou, Tingjun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a348t-df358c278191de41738ad004de006a080e5f0eca7ded06d2196de76293f649b63</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Benchmarking</topic><topic>Ligands</topic><topic>Machine Learning</topic><topic>Molecular Conformation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Xujun</creatorcontrib><creatorcontrib>Shen, Chao</creatorcontrib><creatorcontrib>Liao, Ben</creatorcontrib><creatorcontrib>Jiang, Dejun</creatorcontrib><creatorcontrib>Wang, Jike</creatorcontrib><creatorcontrib>Wu, Zhenxing</creatorcontrib><creatorcontrib>Du, Hongyan</creatorcontrib><creatorcontrib>Wang, Tianyue</creatorcontrib><creatorcontrib>Huo, Wenbo</creatorcontrib><creatorcontrib>Xu, Lei</creatorcontrib><creatorcontrib>Cao, Dongsheng</creatorcontrib><creatorcontrib>Hsieh, Chang-Yu</creatorcontrib><creatorcontrib>Hou, Tingjun</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>Journal of medicinal chemistry</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhang, Xujun</au><au>Shen, Chao</au><au>Liao, Ben</au><au>Jiang, Dejun</au><au>Wang, Jike</au><au>Wu, Zhenxing</au><au>Du, Hongyan</au><au>Wang, Tianyue</au><au>Huo, Wenbo</au><au>Xu, Lei</au><au>Cao, Dongsheng</au><au>Hsieh, Chang-Yu</au><au>Hou, Tingjun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>TocoDecoy: A New Approach to Design Unbiased Datasets for Training and Benchmarking Machine-Learning Scoring Functions</atitle><jtitle>Journal of medicinal chemistry</jtitle><addtitle>J. Med. Chem</addtitle><date>2022-06-09</date><risdate>2022</risdate><volume>65</volume><issue>11</issue><spage>7918</spage><epage>7932</epage><pages>7918-7932</pages><issn>0022-2623</issn><eissn>1520-4804</eissn><abstract>Development of accurate machine-learning-based scoring functions (MLSFs) for structure-based virtual screening against a given target requires a large unbiased dataset with structurally diverse actives and decoys. However, most datasets for the development of MLSFs were designed for traditional SFs and may suffer from hidden biases and data insufficiency. Hereby, we developed a new approach named Topology-based and Conformation-based decoys generation (TocoDecoy), which integrates two strategies to generate decoys by tweaking the actives for a specific target, to generate unbiased and expandable datasets for training and benchmarking MLSFs. For hidden bias evaluation, the performance of InteractionGraphNet (IGN) trained on the TocoDecoy, LIT-PCBA, and DUD-E-like datasets was assessed. The results illustrate that the IGN model trained on the TocoDecoy dataset is competitive with that trained on the LIT-PCBA dataset but remarkably outperforms that trained on the DUD-E dataset, suggesting that the decoys in TocoDecoy are unbiased for training and benchmarking MLSFs.</abstract><cop>United States</cop><pub>American Chemical Society</pub><pmid>35642777</pmid><doi>10.1021/acs.jmedchem.2c00460</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0001-7227-2580</orcidid><orcidid>https://orcid.org/0000-0002-2035-5074</orcidid><orcidid>https://orcid.org/0000-0003-2783-5529</orcidid><orcidid>https://orcid.org/0000-0003-3604-3785</orcidid></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0022-2623 |
ispartof | Journal of medicinal chemistry, 2022-06, Vol.65 (11), p.7918-7932 |
issn | 0022-2623 1520-4804 |
language | eng |
recordid | cdi_proquest_miscellaneous_2672319098 |
source | American Chemical Society:Jisc Collections:American Chemical Society Read & Publish Agreement 2022-2024 (Reading list) |
subjects | Benchmarking Ligands Machine Learning Molecular Conformation |
title | TocoDecoy: A New Approach to Design Unbiased Datasets for Training and Benchmarking Machine-Learning Scoring Functions |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-13T20%3A37%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=TocoDecoy:%20A%20New%20Approach%20to%20Design%20Unbiased%20Datasets%20for%20Training%20and%20Benchmarking%20Machine-Learning%20Scoring%20Functions&rft.jtitle=Journal%20of%20medicinal%20chemistry&rft.au=Zhang,%20Xujun&rft.date=2022-06-09&rft.volume=65&rft.issue=11&rft.spage=7918&rft.epage=7932&rft.pages=7918-7932&rft.issn=0022-2623&rft.eissn=1520-4804&rft_id=info:doi/10.1021/acs.jmedchem.2c00460&rft_dat=%3Cproquest_cross%3E2672319098%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-a348t-df358c278191de41738ad004de006a080e5f0eca7ded06d2196de76293f649b63%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2672319098&rft_id=info:pmid/35642777&rfr_iscdi=true |