Loading…

Uncertainty-Informed Screening for Safer Solvents Used in the Synthesis of Perovskite via Language Models

The challenge of accurately predicting toxicity of industrial solvents used in perovskite synthesis is a necessary undertaking but is limited by a lack of a targeted and structured toxicity data. This paper presents a novel framework that combines an automated data extraction using language models,...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2024-09
Main Authors: Mukherjee, Arpan, Giri, Deepesh, Rajan, Krishna
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Mukherjee, Arpan
Giri, Deepesh
Rajan, Krishna
description The challenge of accurately predicting toxicity of industrial solvents used in perovskite synthesis is a necessary undertaking but is limited by a lack of a targeted and structured toxicity data. This paper presents a novel framework that combines an automated data extraction using language models, and an uncertainty-informed prediction model to fill data gaps and improve prediction confidence. First, we have utilized and compared two approaches to automatically extract relevant data from a corpus of scientific literature on solvents used in perovskite synthesis: smaller bidirectional language models like BERT and ELMo are used for their repeatability and deterministic outputs, while autoregressive large language model (LLM) such as GPT-3.5 is used to leverage its larger training corpus and better response generation. Our novel 'prompting and verification' technique integrated with an LLM aims at targeted extraction and refinement, thereby reducing hallucination and improving the quality of the extracted data using the LLM. Next, the extracted data is fed into our pre-trained multi-task binary classification deep learning to predict the ED nature of extracted solvents. We have used a Shannon entropy-based uncertainty quantification utilizing the class probabilities obtained from the classification model to quantify uncertainty and identify data gaps in our predictions. This approach leads to the curation of a structured dataset for solvents used in perovskite synthesis and their uncertainty-informed virtual toxicity assessment. Additionally, chord diagrams have been used to visualize solvent interactions and prioritize those with potential hazards, revealing that 70% of the solvent interactions were primarily associated with two specific perovskites.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3111729469</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3111729469</sourcerecordid><originalsourceid>FETCH-proquest_journals_31117294693</originalsourceid><addsrcrecordid>eNqNissKwjAQRYMgWNR_GHBdaJP6WouioCDUrkvQaY3WiWbSQv_eLPwAN_fAPWcgIqlUGq8yKUdiyvxIkkQulnI-V5EwBV3ReW3I9_GBKuteeIP86hDJUA3hgFxXGNY2HZJnKDgUhsDfEfKeAtgw2ArO6GzHT-MROqPhqKludY1wsjdseCKGlW4Ypz-OxWy3vWz28dvZT4vsy4dtHQVVqjRNl3KdLdbqv-oLvtZISw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3111729469</pqid></control><display><type>article</type><title>Uncertainty-Informed Screening for Safer Solvents Used in the Synthesis of Perovskite via Language Models</title><source>Publicly Available Content Database</source><creator>Mukherjee, Arpan ; Giri, Deepesh ; Rajan, Krishna</creator><creatorcontrib>Mukherjee, Arpan ; Giri, Deepesh ; Rajan, Krishna</creatorcontrib><description>The challenge of accurately predicting toxicity of industrial solvents used in perovskite synthesis is a necessary undertaking but is limited by a lack of a targeted and structured toxicity data. This paper presents a novel framework that combines an automated data extraction using language models, and an uncertainty-informed prediction model to fill data gaps and improve prediction confidence. First, we have utilized and compared two approaches to automatically extract relevant data from a corpus of scientific literature on solvents used in perovskite synthesis: smaller bidirectional language models like BERT and ELMo are used for their repeatability and deterministic outputs, while autoregressive large language model (LLM) such as GPT-3.5 is used to leverage its larger training corpus and better response generation. Our novel 'prompting and verification' technique integrated with an LLM aims at targeted extraction and refinement, thereby reducing hallucination and improving the quality of the extracted data using the LLM. Next, the extracted data is fed into our pre-trained multi-task binary classification deep learning to predict the ED nature of extracted solvents. We have used a Shannon entropy-based uncertainty quantification utilizing the class probabilities obtained from the classification model to quantify uncertainty and identify data gaps in our predictions. This approach leads to the curation of a structured dataset for solvents used in perovskite synthesis and their uncertainty-informed virtual toxicity assessment. Additionally, chord diagrams have been used to visualize solvent interactions and prioritize those with potential hazards, revealing that 70% of the solvent interactions were primarily associated with two specific perovskites.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Classification ; Entropy (Information theory) ; Hazard assessment ; Hazard identification ; Large language models ; Machine learning ; Perovskites ; Prediction models ; Predictions ; Solvents ; Synthesis ; Toxic hazards ; Toxicity ; Uncertainty</subject><ispartof>arXiv.org, 2024-09</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3111729469?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Mukherjee, Arpan</creatorcontrib><creatorcontrib>Giri, Deepesh</creatorcontrib><creatorcontrib>Rajan, Krishna</creatorcontrib><title>Uncertainty-Informed Screening for Safer Solvents Used in the Synthesis of Perovskite via Language Models</title><title>arXiv.org</title><description>The challenge of accurately predicting toxicity of industrial solvents used in perovskite synthesis is a necessary undertaking but is limited by a lack of a targeted and structured toxicity data. This paper presents a novel framework that combines an automated data extraction using language models, and an uncertainty-informed prediction model to fill data gaps and improve prediction confidence. First, we have utilized and compared two approaches to automatically extract relevant data from a corpus of scientific literature on solvents used in perovskite synthesis: smaller bidirectional language models like BERT and ELMo are used for their repeatability and deterministic outputs, while autoregressive large language model (LLM) such as GPT-3.5 is used to leverage its larger training corpus and better response generation. Our novel 'prompting and verification' technique integrated with an LLM aims at targeted extraction and refinement, thereby reducing hallucination and improving the quality of the extracted data using the LLM. Next, the extracted data is fed into our pre-trained multi-task binary classification deep learning to predict the ED nature of extracted solvents. We have used a Shannon entropy-based uncertainty quantification utilizing the class probabilities obtained from the classification model to quantify uncertainty and identify data gaps in our predictions. This approach leads to the curation of a structured dataset for solvents used in perovskite synthesis and their uncertainty-informed virtual toxicity assessment. Additionally, chord diagrams have been used to visualize solvent interactions and prioritize those with potential hazards, revealing that 70% of the solvent interactions were primarily associated with two specific perovskites.</description><subject>Classification</subject><subject>Entropy (Information theory)</subject><subject>Hazard assessment</subject><subject>Hazard identification</subject><subject>Large language models</subject><subject>Machine learning</subject><subject>Perovskites</subject><subject>Prediction models</subject><subject>Predictions</subject><subject>Solvents</subject><subject>Synthesis</subject><subject>Toxic hazards</subject><subject>Toxicity</subject><subject>Uncertainty</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNissKwjAQRYMgWNR_GHBdaJP6WouioCDUrkvQaY3WiWbSQv_eLPwAN_fAPWcgIqlUGq8yKUdiyvxIkkQulnI-V5EwBV3ReW3I9_GBKuteeIP86hDJUA3hgFxXGNY2HZJnKDgUhsDfEfKeAtgw2ArO6GzHT-MROqPhqKludY1wsjdseCKGlW4Ypz-OxWy3vWz28dvZT4vsy4dtHQVVqjRNl3KdLdbqv-oLvtZISw</recordid><startdate>20240930</startdate><enddate>20240930</enddate><creator>Mukherjee, Arpan</creator><creator>Giri, Deepesh</creator><creator>Rajan, Krishna</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240930</creationdate><title>Uncertainty-Informed Screening for Safer Solvents Used in the Synthesis of Perovskite via Language Models</title><author>Mukherjee, Arpan ; Giri, Deepesh ; Rajan, Krishna</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31117294693</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Classification</topic><topic>Entropy (Information theory)</topic><topic>Hazard assessment</topic><topic>Hazard identification</topic><topic>Large language models</topic><topic>Machine learning</topic><topic>Perovskites</topic><topic>Prediction models</topic><topic>Predictions</topic><topic>Solvents</topic><topic>Synthesis</topic><topic>Toxic hazards</topic><topic>Toxicity</topic><topic>Uncertainty</topic><toplevel>online_resources</toplevel><creatorcontrib>Mukherjee, Arpan</creatorcontrib><creatorcontrib>Giri, Deepesh</creatorcontrib><creatorcontrib>Rajan, Krishna</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Mukherjee, Arpan</au><au>Giri, Deepesh</au><au>Rajan, Krishna</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Uncertainty-Informed Screening for Safer Solvents Used in the Synthesis of Perovskite via Language Models</atitle><jtitle>arXiv.org</jtitle><date>2024-09-30</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>The challenge of accurately predicting toxicity of industrial solvents used in perovskite synthesis is a necessary undertaking but is limited by a lack of a targeted and structured toxicity data. This paper presents a novel framework that combines an automated data extraction using language models, and an uncertainty-informed prediction model to fill data gaps and improve prediction confidence. First, we have utilized and compared two approaches to automatically extract relevant data from a corpus of scientific literature on solvents used in perovskite synthesis: smaller bidirectional language models like BERT and ELMo are used for their repeatability and deterministic outputs, while autoregressive large language model (LLM) such as GPT-3.5 is used to leverage its larger training corpus and better response generation. Our novel 'prompting and verification' technique integrated with an LLM aims at targeted extraction and refinement, thereby reducing hallucination and improving the quality of the extracted data using the LLM. Next, the extracted data is fed into our pre-trained multi-task binary classification deep learning to predict the ED nature of extracted solvents. We have used a Shannon entropy-based uncertainty quantification utilizing the class probabilities obtained from the classification model to quantify uncertainty and identify data gaps in our predictions. This approach leads to the curation of a structured dataset for solvents used in perovskite synthesis and their uncertainty-informed virtual toxicity assessment. Additionally, chord diagrams have been used to visualize solvent interactions and prioritize those with potential hazards, revealing that 70% of the solvent interactions were primarily associated with two specific perovskites.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-09
issn 2331-8422
language eng
recordid cdi_proquest_journals_3111729469
source Publicly Available Content Database
subjects Classification
Entropy (Information theory)
Hazard assessment
Hazard identification
Large language models
Machine learning
Perovskites
Prediction models
Predictions
Solvents
Synthesis
Toxic hazards
Toxicity
Uncertainty
title Uncertainty-Informed Screening for Safer Solvents Used in the Synthesis of Perovskite via Language Models
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T09%3A02%3A49IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Uncertainty-Informed%20Screening%20for%20Safer%20Solvents%20Used%20in%20the%20Synthesis%20of%20Perovskite%20via%20Language%20Models&rft.jtitle=arXiv.org&rft.au=Mukherjee,%20Arpan&rft.date=2024-09-30&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3111729469%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_31117294693%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3111729469&rft_id=info:pmid/&rfr_iscdi=true