Loading…

Benchmarking Active Learning Protocols for Ligand-Binding Affinity Prediction

Active learning (AL) has become a powerful tool in computational drug discovery, enabling the identification of top binders from vast molecular libraries. To design a robust AL protocol, it is important to understand the influence of AL parameters, as well as the features of the data sets on the out...

Full description

Saved in:
Bibliographic Details
Published in:Journal of chemical information and modeling 2024-03, Vol.64 (6), p.1955-1965
Main Authors: Gorantla, Rohan, Kubincová, Alžbeta, Suutari, Benjamin, Cossins, Benjamin P., Mey, Antonia S. J. S.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-a462t-5dbfd084eafb028cf51e498ca22833c3bc54e33c5b5aa9c613314143349626213
cites cdi_FETCH-LOGICAL-a462t-5dbfd084eafb028cf51e498ca22833c3bc54e33c5b5aa9c613314143349626213
container_end_page 1965
container_issue 6
container_start_page 1955
container_title Journal of chemical information and modeling
container_volume 64
creator Gorantla, Rohan
Kubincová, Alžbeta
Suutari, Benjamin
Cossins, Benjamin P.
Mey, Antonia S. J. S.
description Active learning (AL) has become a powerful tool in computational drug discovery, enabling the identification of top binders from vast molecular libraries. To design a robust AL protocol, it is important to understand the influence of AL parameters, as well as the features of the data sets on the outcomes. We use four affinity data sets for different targets (TYK2, USP7, D2R, Mpro) to systematically evaluate the performance of machine learning models [Gaussian process (GP) model and Chemprop model], sample selection protocols, and the batch size based on metrics describing the overall predictive power of the model (R2, Spearman rank, root-mean-square error) as well as the accurate identification of top 2%/5% binders (Recall, F1 score). Both models have a comparable Recall of top binders on large data sets, but the GP model surpasses the Chemprop model when training data are sparse. A larger initial batch size, especially on diverse data sets, increased the Recall of both models as well as overall correlation metrics. However, for subsequent cycles, smaller batch sizes of 20 or 30 compounds proved to be desirable. Furthermore, adding artificial Gaussian noise to the data up to a certain threshold still allowed the model to identify clusters with top-scoring compounds. However, excessive noise (
doi_str_mv 10.1021/acs.jcim.4c00220
format article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_10966646</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2986181401</sourcerecordid><originalsourceid>FETCH-LOGICAL-a462t-5dbfd084eafb028cf51e498ca22833c3bc54e33c5b5aa9c613314143349626213</originalsourceid><addsrcrecordid>eNp1kUtLAzEUhYMoVqt7V1Jw48KpeU3MrKQtvqCiCwV3IZPJtKnTpCYzhf570ycquMoN-c7JuRwAzhDsIojRtVShO1Fm2qUKQozhHjhCKc2SjMGP_e2cZqwFjkOYQEhIxvAhaBFOKUMEHYHnvrZqPJX-09hRp6dqM9edoZbeLu-v3tVOuSp0Suc7QzOStkj6xhYruCyNNfUiUrowUensCTgoZRX06eZsg_f7u7fBYzJ8eXga9IaJpAzXSVrkZQE51bLMIeaqTJGmGVcSY06IIrlKqY5DmqdSZiomJYgiSgiN8RlGpA1u176zJp_qQmlbe1mJmTdxk4Vw0ojfL9aMxcjNBYIZY4yy6HC5cfDuq9GhFlMTlK4qabVrgsAZ4ZjfIMgjevEHnbjG27hfpDhDHFG4jATXlPIuBK_LXRoExbIsEcsSy7LEpqwoOf-5xU6wbScCV2tgJd1--q_fNzy_oHY</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2986181401</pqid></control><display><type>article</type><title>Benchmarking Active Learning Protocols for Ligand-Binding Affinity Prediction</title><source>American Chemical Society:Jisc Collections:American Chemical Society Read &amp; Publish Agreement 2022-2024 (Reading list)</source><creator>Gorantla, Rohan ; Kubincová, Alžbeta ; Suutari, Benjamin ; Cossins, Benjamin P. ; Mey, Antonia S. J. S.</creator><creatorcontrib>Gorantla, Rohan ; Kubincová, Alžbeta ; Suutari, Benjamin ; Cossins, Benjamin P. ; Mey, Antonia S. J. S.</creatorcontrib><description>Active learning (AL) has become a powerful tool in computational drug discovery, enabling the identification of top binders from vast molecular libraries. To design a robust AL protocol, it is important to understand the influence of AL parameters, as well as the features of the data sets on the outcomes. We use four affinity data sets for different targets (TYK2, USP7, D2R, Mpro) to systematically evaluate the performance of machine learning models [Gaussian process (GP) model and Chemprop model], sample selection protocols, and the batch size based on metrics describing the overall predictive power of the model (R2, Spearman rank, root-mean-square error) as well as the accurate identification of top 2%/5% binders (Recall, F1 score). Both models have a comparable Recall of top binders on large data sets, but the GP model surpasses the Chemprop model when training data are sparse. A larger initial batch size, especially on diverse data sets, increased the Recall of both models as well as overall correlation metrics. However, for subsequent cycles, smaller batch sizes of 20 or 30 compounds proved to be desirable. Furthermore, adding artificial Gaussian noise to the data up to a certain threshold still allowed the model to identify clusters with top-scoring compounds. However, excessive noise (&lt;1σ) did impact the model’s predictive and exploitative capabilities.</description><identifier>ISSN: 1549-9596</identifier><identifier>ISSN: 1549-960X</identifier><identifier>EISSN: 1549-960X</identifier><identifier>DOI: 10.1021/acs.jcim.4c00220</identifier><identifier>PMID: 38446131</identifier><language>eng</language><publisher>United States: American Chemical Society</publisher><subject>Active learning ; Affinity ; Benchmarking ; Datasets ; Drug Discovery - methods ; Gaussian process ; Impact prediction ; Ligands ; Machine Learning ; Machine Learning and Deep Learning ; Noise prediction ; Random noise ; Recall ; Software</subject><ispartof>Journal of chemical information and modeling, 2024-03, Vol.64 (6), p.1955-1965</ispartof><rights>2024 The Authors. Published by American Chemical Society</rights><rights>Copyright American Chemical Society Mar 25, 2024</rights><rights>2024 The Authors. Published by American Chemical Society 2024 The Authors</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a462t-5dbfd084eafb028cf51e498ca22833c3bc54e33c5b5aa9c613314143349626213</citedby><cites>FETCH-LOGICAL-a462t-5dbfd084eafb028cf51e498ca22833c3bc54e33c5b5aa9c613314143349626213</cites><orcidid>0000-0001-7512-5252</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,314,780,784,885,27924,27925</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38446131$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Gorantla, Rohan</creatorcontrib><creatorcontrib>Kubincová, Alžbeta</creatorcontrib><creatorcontrib>Suutari, Benjamin</creatorcontrib><creatorcontrib>Cossins, Benjamin P.</creatorcontrib><creatorcontrib>Mey, Antonia S. J. S.</creatorcontrib><title>Benchmarking Active Learning Protocols for Ligand-Binding Affinity Prediction</title><title>Journal of chemical information and modeling</title><addtitle>J. Chem. Inf. Model</addtitle><description>Active learning (AL) has become a powerful tool in computational drug discovery, enabling the identification of top binders from vast molecular libraries. To design a robust AL protocol, it is important to understand the influence of AL parameters, as well as the features of the data sets on the outcomes. We use four affinity data sets for different targets (TYK2, USP7, D2R, Mpro) to systematically evaluate the performance of machine learning models [Gaussian process (GP) model and Chemprop model], sample selection protocols, and the batch size based on metrics describing the overall predictive power of the model (R2, Spearman rank, root-mean-square error) as well as the accurate identification of top 2%/5% binders (Recall, F1 score). Both models have a comparable Recall of top binders on large data sets, but the GP model surpasses the Chemprop model when training data are sparse. A larger initial batch size, especially on diverse data sets, increased the Recall of both models as well as overall correlation metrics. However, for subsequent cycles, smaller batch sizes of 20 or 30 compounds proved to be desirable. Furthermore, adding artificial Gaussian noise to the data up to a certain threshold still allowed the model to identify clusters with top-scoring compounds. However, excessive noise (&lt;1σ) did impact the model’s predictive and exploitative capabilities.</description><subject>Active learning</subject><subject>Affinity</subject><subject>Benchmarking</subject><subject>Datasets</subject><subject>Drug Discovery - methods</subject><subject>Gaussian process</subject><subject>Impact prediction</subject><subject>Ligands</subject><subject>Machine Learning</subject><subject>Machine Learning and Deep Learning</subject><subject>Noise prediction</subject><subject>Random noise</subject><subject>Recall</subject><subject>Software</subject><issn>1549-9596</issn><issn>1549-960X</issn><issn>1549-960X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNp1kUtLAzEUhYMoVqt7V1Jw48KpeU3MrKQtvqCiCwV3IZPJtKnTpCYzhf570ycquMoN-c7JuRwAzhDsIojRtVShO1Fm2qUKQozhHjhCKc2SjMGP_e2cZqwFjkOYQEhIxvAhaBFOKUMEHYHnvrZqPJX-09hRp6dqM9edoZbeLu-v3tVOuSp0Suc7QzOStkj6xhYruCyNNfUiUrowUensCTgoZRX06eZsg_f7u7fBYzJ8eXga9IaJpAzXSVrkZQE51bLMIeaqTJGmGVcSY06IIrlKqY5DmqdSZiomJYgiSgiN8RlGpA1u176zJp_qQmlbe1mJmTdxk4Vw0ojfL9aMxcjNBYIZY4yy6HC5cfDuq9GhFlMTlK4qabVrgsAZ4ZjfIMgjevEHnbjG27hfpDhDHFG4jATXlPIuBK_LXRoExbIsEcsSy7LEpqwoOf-5xU6wbScCV2tgJd1--q_fNzy_oHY</recordid><startdate>20240325</startdate><enddate>20240325</enddate><creator>Gorantla, Rohan</creator><creator>Kubincová, Alžbeta</creator><creator>Suutari, Benjamin</creator><creator>Cossins, Benjamin P.</creator><creator>Mey, Antonia S. J. S.</creator><general>American Chemical Society</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SR</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0001-7512-5252</orcidid></search><sort><creationdate>20240325</creationdate><title>Benchmarking Active Learning Protocols for Ligand-Binding Affinity Prediction</title><author>Gorantla, Rohan ; Kubincová, Alžbeta ; Suutari, Benjamin ; Cossins, Benjamin P. ; Mey, Antonia S. J. S.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a462t-5dbfd084eafb028cf51e498ca22833c3bc54e33c5b5aa9c613314143349626213</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Active learning</topic><topic>Affinity</topic><topic>Benchmarking</topic><topic>Datasets</topic><topic>Drug Discovery - methods</topic><topic>Gaussian process</topic><topic>Impact prediction</topic><topic>Ligands</topic><topic>Machine Learning</topic><topic>Machine Learning and Deep Learning</topic><topic>Noise prediction</topic><topic>Random noise</topic><topic>Recall</topic><topic>Software</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Gorantla, Rohan</creatorcontrib><creatorcontrib>Kubincová, Alžbeta</creatorcontrib><creatorcontrib>Suutari, Benjamin</creatorcontrib><creatorcontrib>Cossins, Benjamin P.</creatorcontrib><creatorcontrib>Mey, Antonia S. J. S.</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Journal of chemical information and modeling</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Gorantla, Rohan</au><au>Kubincová, Alžbeta</au><au>Suutari, Benjamin</au><au>Cossins, Benjamin P.</au><au>Mey, Antonia S. J. S.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Benchmarking Active Learning Protocols for Ligand-Binding Affinity Prediction</atitle><jtitle>Journal of chemical information and modeling</jtitle><addtitle>J. Chem. Inf. Model</addtitle><date>2024-03-25</date><risdate>2024</risdate><volume>64</volume><issue>6</issue><spage>1955</spage><epage>1965</epage><pages>1955-1965</pages><issn>1549-9596</issn><issn>1549-960X</issn><eissn>1549-960X</eissn><abstract>Active learning (AL) has become a powerful tool in computational drug discovery, enabling the identification of top binders from vast molecular libraries. To design a robust AL protocol, it is important to understand the influence of AL parameters, as well as the features of the data sets on the outcomes. We use four affinity data sets for different targets (TYK2, USP7, D2R, Mpro) to systematically evaluate the performance of machine learning models [Gaussian process (GP) model and Chemprop model], sample selection protocols, and the batch size based on metrics describing the overall predictive power of the model (R2, Spearman rank, root-mean-square error) as well as the accurate identification of top 2%/5% binders (Recall, F1 score). Both models have a comparable Recall of top binders on large data sets, but the GP model surpasses the Chemprop model when training data are sparse. A larger initial batch size, especially on diverse data sets, increased the Recall of both models as well as overall correlation metrics. However, for subsequent cycles, smaller batch sizes of 20 or 30 compounds proved to be desirable. Furthermore, adding artificial Gaussian noise to the data up to a certain threshold still allowed the model to identify clusters with top-scoring compounds. However, excessive noise (&lt;1σ) did impact the model’s predictive and exploitative capabilities.</abstract><cop>United States</cop><pub>American Chemical Society</pub><pmid>38446131</pmid><doi>10.1021/acs.jcim.4c00220</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0001-7512-5252</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1549-9596
ispartof Journal of chemical information and modeling, 2024-03, Vol.64 (6), p.1955-1965
issn 1549-9596
1549-960X
1549-960X
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_10966646
source American Chemical Society:Jisc Collections:American Chemical Society Read & Publish Agreement 2022-2024 (Reading list)
subjects Active learning
Affinity
Benchmarking
Datasets
Drug Discovery - methods
Gaussian process
Impact prediction
Ligands
Machine Learning
Machine Learning and Deep Learning
Noise prediction
Random noise
Recall
Software
title Benchmarking Active Learning Protocols for Ligand-Binding Affinity Prediction
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T16%3A07%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Benchmarking%20Active%20Learning%20Protocols%20for%20Ligand-Binding%20Affinity%20Prediction&rft.jtitle=Journal%20of%20chemical%20information%20and%20modeling&rft.au=Gorantla,%20Rohan&rft.date=2024-03-25&rft.volume=64&rft.issue=6&rft.spage=1955&rft.epage=1965&rft.pages=1955-1965&rft.issn=1549-9596&rft.eissn=1549-960X&rft_id=info:doi/10.1021/acs.jcim.4c00220&rft_dat=%3Cproquest_pubme%3E2986181401%3C/proquest_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-a462t-5dbfd084eafb028cf51e498ca22833c3bc54e33c5b5aa9c613314143349626213%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2986181401&rft_id=info:pmid/38446131&rfr_iscdi=true