Loading…

Extraction of chemical structures from literature and patent documents using open access chemistry toolkits: a case study with PFAS

The extraction of chemical information from documents is a demanding task in cheminformatics due to the variety of text and image-based representations of chemistry. The present work describes the extraction of chemical compounds with unique chemical structures from the open access CORE (COnnecting...

Full description

Saved in:
Bibliographic Details
Published in:Digital discovery 2022-08, Vol.1 (4), p.490-501
Main Authors: Barnabas, Shadrack J., Böhme, Timo, Boyer, Stephen K., Irmer, Matthias, Ruttkies, Christoph, Wetherbee, Ian, Kondić, Todor, Schymanski, Emma L., Weber, Lutz
Format: Article
Language:English
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c289t-884e3b83d307a4aae479e3e1f19834c74afd8dde02553c57717bf431138eb8973
cites cdi_FETCH-LOGICAL-c289t-884e3b83d307a4aae479e3e1f19834c74afd8dde02553c57717bf431138eb8973
container_end_page 501
container_issue 4
container_start_page 490
container_title Digital discovery
container_volume 1
creator Barnabas, Shadrack J.
Böhme, Timo
Boyer, Stephen K.
Irmer, Matthias
Ruttkies, Christoph
Wetherbee, Ian
Kondić, Todor
Schymanski, Emma L.
Weber, Lutz
description The extraction of chemical information from documents is a demanding task in cheminformatics due to the variety of text and image-based representations of chemistry. The present work describes the extraction of chemical compounds with unique chemical structures from the open access CORE (COnnecting REpositories) and Google Patents full text document repositories. The importance of structure normalization is demonstrated using three open access cheminformatics toolkits: the Chemistry Development Kit (CDK), RDKit and OpenChemLib (OCL). Each toolkit was used for structure parsing, normalization and subsequent substructure searching, using SMILES as structure representations of chemical molecules and International Chemical Identifiers (InChIs) for comparison. Per- and polyfluoroalkyl substances (PFAS) were chosen as a case study to perform the substructure search, due to their high environmental relevance, their presence in both literature and patent corpuses, and the current lack of community consensus on their definition. Three different structural definitions of PFAS were chosen to highlight the implications of various definitions from a cheminformatics perspective. Since CDK, RDKit and OCL implement different criteria and methods for SMILES parsing and normalization, different numbers of parsed compounds were extracted, which were then evaluated using the three PFAS definitions. A comparison of these toolkits and definitions is provided, along with a discussion of the implications for PFAS screening and text mining efforts in cheminformatics. Finally, the extracted PFAS (∼1.7 M PFAS from patents and ∼27 K from CORE) were compared against various existing PFAS lists and are provided in various formats for further community research efforts.
doi_str_mv 10.1039/D2DD00019A
format article
fullrecord <record><control><sourceid>crossref</sourceid><recordid>TN_cdi_crossref_primary_10_1039_D2DD00019A</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1039_D2DD00019A</sourcerecordid><originalsourceid>FETCH-LOGICAL-c289t-884e3b83d307a4aae479e3e1f19834c74afd8dde02553c57717bf431138eb8973</originalsourceid><addsrcrecordid>eNpNkE1LAzEYhIMoWLQXf8F7FlaTzW6TeCv9UKGgoIK3Jc2-sdHdTUmyaM_-cbdU0NMMw_DADCEXjF4xytX1PJ_PKaVMTY_IKJ_wMqNKvh7_86dkHOP70MmFYIxPRuR78ZWCNsn5DrwFs8HWGd1ATKE3qQ8YwQbfQuMSBr0PQHc1bHXCLkHtTd8OJkIfXfcGfosdaGMwxgNqwOwged98uBRvQIPREQd4X-_g06UNPC6nT-fkxOom4vhXz8jLcvE8u8tWD7f3s-kqM7lUKZOyQL6WvOZU6EJrLIRCjswyJXlhRKFtLesaaV6W3JTDQrG2BR92SlxLJfgZuTxwTfAxBrTVNrhWh13FaLV_sPp7kP8AiSplxw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Extraction of chemical structures from literature and patent documents using open access chemistry toolkits: a case study with PFAS</title><source>Alma/SFX Local Collection</source><creator>Barnabas, Shadrack J. ; Böhme, Timo ; Boyer, Stephen K. ; Irmer, Matthias ; Ruttkies, Christoph ; Wetherbee, Ian ; Kondić, Todor ; Schymanski, Emma L. ; Weber, Lutz</creator><creatorcontrib>Barnabas, Shadrack J. ; Böhme, Timo ; Boyer, Stephen K. ; Irmer, Matthias ; Ruttkies, Christoph ; Wetherbee, Ian ; Kondić, Todor ; Schymanski, Emma L. ; Weber, Lutz</creatorcontrib><description>The extraction of chemical information from documents is a demanding task in cheminformatics due to the variety of text and image-based representations of chemistry. The present work describes the extraction of chemical compounds with unique chemical structures from the open access CORE (COnnecting REpositories) and Google Patents full text document repositories. The importance of structure normalization is demonstrated using three open access cheminformatics toolkits: the Chemistry Development Kit (CDK), RDKit and OpenChemLib (OCL). Each toolkit was used for structure parsing, normalization and subsequent substructure searching, using SMILES as structure representations of chemical molecules and International Chemical Identifiers (InChIs) for comparison. Per- and polyfluoroalkyl substances (PFAS) were chosen as a case study to perform the substructure search, due to their high environmental relevance, their presence in both literature and patent corpuses, and the current lack of community consensus on their definition. Three different structural definitions of PFAS were chosen to highlight the implications of various definitions from a cheminformatics perspective. Since CDK, RDKit and OCL implement different criteria and methods for SMILES parsing and normalization, different numbers of parsed compounds were extracted, which were then evaluated using the three PFAS definitions. A comparison of these toolkits and definitions is provided, along with a discussion of the implications for PFAS screening and text mining efforts in cheminformatics. Finally, the extracted PFAS (∼1.7 M PFAS from patents and ∼27 K from CORE) were compared against various existing PFAS lists and are provided in various formats for further community research efforts.</description><identifier>ISSN: 2635-098X</identifier><identifier>EISSN: 2635-098X</identifier><identifier>DOI: 10.1039/D2DD00019A</identifier><language>eng</language><ispartof>Digital discovery, 2022-08, Vol.1 (4), p.490-501</ispartof><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c289t-884e3b83d307a4aae479e3e1f19834c74afd8dde02553c57717bf431138eb8973</citedby><cites>FETCH-LOGICAL-c289t-884e3b83d307a4aae479e3e1f19834c74afd8dde02553c57717bf431138eb8973</cites><orcidid>0000-0002-6932-1190 ; 0000-0002-7105-9187 ; 0000-0001-6662-4375 ; 0000-0002-0101-4346 ; 0000-0002-7719-9594 ; 0000-0002-8621-8689 ; 0000-0001-6868-8145</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Barnabas, Shadrack J.</creatorcontrib><creatorcontrib>Böhme, Timo</creatorcontrib><creatorcontrib>Boyer, Stephen K.</creatorcontrib><creatorcontrib>Irmer, Matthias</creatorcontrib><creatorcontrib>Ruttkies, Christoph</creatorcontrib><creatorcontrib>Wetherbee, Ian</creatorcontrib><creatorcontrib>Kondić, Todor</creatorcontrib><creatorcontrib>Schymanski, Emma L.</creatorcontrib><creatorcontrib>Weber, Lutz</creatorcontrib><title>Extraction of chemical structures from literature and patent documents using open access chemistry toolkits: a case study with PFAS</title><title>Digital discovery</title><description>The extraction of chemical information from documents is a demanding task in cheminformatics due to the variety of text and image-based representations of chemistry. The present work describes the extraction of chemical compounds with unique chemical structures from the open access CORE (COnnecting REpositories) and Google Patents full text document repositories. The importance of structure normalization is demonstrated using three open access cheminformatics toolkits: the Chemistry Development Kit (CDK), RDKit and OpenChemLib (OCL). Each toolkit was used for structure parsing, normalization and subsequent substructure searching, using SMILES as structure representations of chemical molecules and International Chemical Identifiers (InChIs) for comparison. Per- and polyfluoroalkyl substances (PFAS) were chosen as a case study to perform the substructure search, due to their high environmental relevance, their presence in both literature and patent corpuses, and the current lack of community consensus on their definition. Three different structural definitions of PFAS were chosen to highlight the implications of various definitions from a cheminformatics perspective. Since CDK, RDKit and OCL implement different criteria and methods for SMILES parsing and normalization, different numbers of parsed compounds were extracted, which were then evaluated using the three PFAS definitions. A comparison of these toolkits and definitions is provided, along with a discussion of the implications for PFAS screening and text mining efforts in cheminformatics. Finally, the extracted PFAS (∼1.7 M PFAS from patents and ∼27 K from CORE) were compared against various existing PFAS lists and are provided in various formats for further community research efforts.</description><issn>2635-098X</issn><issn>2635-098X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNpNkE1LAzEYhIMoWLQXf8F7FlaTzW6TeCv9UKGgoIK3Jc2-sdHdTUmyaM_-cbdU0NMMw_DADCEXjF4xytX1PJ_PKaVMTY_IKJ_wMqNKvh7_86dkHOP70MmFYIxPRuR78ZWCNsn5DrwFs8HWGd1ATKE3qQ8YwQbfQuMSBr0PQHc1bHXCLkHtTd8OJkIfXfcGfosdaGMwxgNqwOwged98uBRvQIPREQd4X-_g06UNPC6nT-fkxOom4vhXz8jLcvE8u8tWD7f3s-kqM7lUKZOyQL6WvOZU6EJrLIRCjswyJXlhRKFtLesaaV6W3JTDQrG2BR92SlxLJfgZuTxwTfAxBrTVNrhWh13FaLV_sPp7kP8AiSplxw</recordid><startdate>20220808</startdate><enddate>20220808</enddate><creator>Barnabas, Shadrack J.</creator><creator>Böhme, Timo</creator><creator>Boyer, Stephen K.</creator><creator>Irmer, Matthias</creator><creator>Ruttkies, Christoph</creator><creator>Wetherbee, Ian</creator><creator>Kondić, Todor</creator><creator>Schymanski, Emma L.</creator><creator>Weber, Lutz</creator><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-6932-1190</orcidid><orcidid>https://orcid.org/0000-0002-7105-9187</orcidid><orcidid>https://orcid.org/0000-0001-6662-4375</orcidid><orcidid>https://orcid.org/0000-0002-0101-4346</orcidid><orcidid>https://orcid.org/0000-0002-7719-9594</orcidid><orcidid>https://orcid.org/0000-0002-8621-8689</orcidid><orcidid>https://orcid.org/0000-0001-6868-8145</orcidid></search><sort><creationdate>20220808</creationdate><title>Extraction of chemical structures from literature and patent documents using open access chemistry toolkits: a case study with PFAS</title><author>Barnabas, Shadrack J. ; Böhme, Timo ; Boyer, Stephen K. ; Irmer, Matthias ; Ruttkies, Christoph ; Wetherbee, Ian ; Kondić, Todor ; Schymanski, Emma L. ; Weber, Lutz</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c289t-884e3b83d307a4aae479e3e1f19834c74afd8dde02553c57717bf431138eb8973</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Barnabas, Shadrack J.</creatorcontrib><creatorcontrib>Böhme, Timo</creatorcontrib><creatorcontrib>Boyer, Stephen K.</creatorcontrib><creatorcontrib>Irmer, Matthias</creatorcontrib><creatorcontrib>Ruttkies, Christoph</creatorcontrib><creatorcontrib>Wetherbee, Ian</creatorcontrib><creatorcontrib>Kondić, Todor</creatorcontrib><creatorcontrib>Schymanski, Emma L.</creatorcontrib><creatorcontrib>Weber, Lutz</creatorcontrib><collection>CrossRef</collection><jtitle>Digital discovery</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Barnabas, Shadrack J.</au><au>Böhme, Timo</au><au>Boyer, Stephen K.</au><au>Irmer, Matthias</au><au>Ruttkies, Christoph</au><au>Wetherbee, Ian</au><au>Kondić, Todor</au><au>Schymanski, Emma L.</au><au>Weber, Lutz</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Extraction of chemical structures from literature and patent documents using open access chemistry toolkits: a case study with PFAS</atitle><jtitle>Digital discovery</jtitle><date>2022-08-08</date><risdate>2022</risdate><volume>1</volume><issue>4</issue><spage>490</spage><epage>501</epage><pages>490-501</pages><issn>2635-098X</issn><eissn>2635-098X</eissn><abstract>The extraction of chemical information from documents is a demanding task in cheminformatics due to the variety of text and image-based representations of chemistry. The present work describes the extraction of chemical compounds with unique chemical structures from the open access CORE (COnnecting REpositories) and Google Patents full text document repositories. The importance of structure normalization is demonstrated using three open access cheminformatics toolkits: the Chemistry Development Kit (CDK), RDKit and OpenChemLib (OCL). Each toolkit was used for structure parsing, normalization and subsequent substructure searching, using SMILES as structure representations of chemical molecules and International Chemical Identifiers (InChIs) for comparison. Per- and polyfluoroalkyl substances (PFAS) were chosen as a case study to perform the substructure search, due to their high environmental relevance, their presence in both literature and patent corpuses, and the current lack of community consensus on their definition. Three different structural definitions of PFAS were chosen to highlight the implications of various definitions from a cheminformatics perspective. Since CDK, RDKit and OCL implement different criteria and methods for SMILES parsing and normalization, different numbers of parsed compounds were extracted, which were then evaluated using the three PFAS definitions. A comparison of these toolkits and definitions is provided, along with a discussion of the implications for PFAS screening and text mining efforts in cheminformatics. Finally, the extracted PFAS (∼1.7 M PFAS from patents and ∼27 K from CORE) were compared against various existing PFAS lists and are provided in various formats for further community research efforts.</abstract><doi>10.1039/D2DD00019A</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0002-6932-1190</orcidid><orcidid>https://orcid.org/0000-0002-7105-9187</orcidid><orcidid>https://orcid.org/0000-0001-6662-4375</orcidid><orcidid>https://orcid.org/0000-0002-0101-4346</orcidid><orcidid>https://orcid.org/0000-0002-7719-9594</orcidid><orcidid>https://orcid.org/0000-0002-8621-8689</orcidid><orcidid>https://orcid.org/0000-0001-6868-8145</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2635-098X
ispartof Digital discovery, 2022-08, Vol.1 (4), p.490-501
issn 2635-098X
2635-098X
language eng
recordid cdi_crossref_primary_10_1039_D2DD00019A
source Alma/SFX Local Collection
title Extraction of chemical structures from literature and patent documents using open access chemistry toolkits: a case study with PFAS
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-02T20%3A11%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Extraction%20of%20chemical%20structures%20from%20literature%20and%20patent%20documents%20using%20open%20access%20chemistry%20toolkits:%20a%20case%20study%20with%20PFAS&rft.jtitle=Digital%20discovery&rft.au=Barnabas,%20Shadrack%20J.&rft.date=2022-08-08&rft.volume=1&rft.issue=4&rft.spage=490&rft.epage=501&rft.pages=490-501&rft.issn=2635-098X&rft.eissn=2635-098X&rft_id=info:doi/10.1039/D2DD00019A&rft_dat=%3Ccrossref%3E10_1039_D2DD00019A%3C/crossref%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c289t-884e3b83d307a4aae479e3e1f19834c74afd8dde02553c57717bf431138eb8973%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true