Loading…

Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model

Artificial intelligence (AI) can be a tool in the diagnosis and acquisition of knowledge, particularly in dentistry, sparking debates on its application in clinical decision-making. This study aims to evaluate the accuracy, completeness, and reliability of the responses generated by Chatbot Generati...

Full description

Saved in:
Bibliographic Details
Published in:Curēus (Palo Alto, CA) CA), 2024-07, Vol.16 (7), p.e65658
Main Authors: Molena, Kelly F, Macedo, Ana P, Ijaz, Anum, Carvalho, Fabrício K, Gallo, Maria Julia D, Wanderley Garcia de Paula E Silva, Francisco, de Rossi, Andiara, Mezzomo, Luis A, Mugayar, Leda Regina F, Queiroz, Alexandra M
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c300t-a1cad702de7cce54a8ca2f5489fd74fad84bdbbc86e629a559844b228a9a42733
container_end_page
container_issue 7
container_start_page e65658
container_title Curēus (Palo Alto, CA)
container_volume 16
creator Molena, Kelly F
Macedo, Ana P
Ijaz, Anum
Carvalho, Fabrício K
Gallo, Maria Julia D
Wanderley Garcia de Paula E Silva, Francisco
de Rossi, Andiara
Mezzomo, Luis A
Mugayar, Leda Regina F
Queiroz, Alexandra M
description Artificial intelligence (AI) can be a tool in the diagnosis and acquisition of knowledge, particularly in dentistry, sparking debates on its application in clinical decision-making. This study aims to evaluate the accuracy, completeness, and reliability of the responses generated by Chatbot Generative Pre-Trained Transformer (ChatGPT) 3.5 in dentistry using expert-formulated questions. Experts were invited to create three questions, answers, and respective references according to specialized fields of activity. The Likert scale was used to evaluate agreement levels between experts and ChatGPT responses. Statistical analysis compared descriptive and binary question groups in terms of accuracy and completeness. Questions with low accuracy underwent re-evaluation, and subsequent responses were compared for improvement. The Wilcoxon test was utilized (α = 0.05). Ten experts across six dental specialties generated 30 binary and descriptive dental questions and references. The accuracy score had a median of 5.50 and a mean of 4.17. For completeness, the median was 2.00 and the mean was 2.07. No difference was observed between descriptive and binary responses for accuracy and completeness. However, re-evaluated responses showed a significant improvement with a significant difference in accuracy (median 5.50 vs. 6.00; mean 4.17 vs. 4.80; p=0.042) and completeness (median 2.0 vs. 2.0; mean 2.07 vs. 2.30; p=0.011). References were more incorrect than correct, with no differences between descriptive and binary questions. ChatGPT initially demonstrated good accuracy and completeness, which was further improved with machine learning (ML) over time. However, some inaccurate answers and references persisted. Human critical discernment continues to be essential to facing complex clinical cases and advancing theoretical knowledge and evidence-based practice.
doi_str_mv 10.7759/cureus.65658
format article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_11352766</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3099257588</sourcerecordid><originalsourceid>FETCH-LOGICAL-c300t-a1cad702de7cce54a8ca2f5489fd74fad84bdbbc86e629a559844b228a9a42733</originalsourceid><addsrcrecordid>eNpdkUFv1DAQhSMEolXpjTOyxIXDpnWcOHa4oGgp20pFVFDO0cSZ7Lry2lvbqZQfw3_F7bZV4TQjzTdPb-Zl2fuCngjBm1M1eZzCSc1rLl9lh6yoZS4LWb1-0R9kxyHcUEoLKhgV9G12UDaMclHSw-xPGwKGoO2axA2SViVFUPOCLN12ZzCiTdMFATuQn2g09NroOBM3ktZHPWqlwZALG9EYvUarMF-lFQ8R7xfCztkkT7QlX9FGHaKfP5OWXGnjIvkVp2EmZ3dgJohPDpYbiKura_LdDWjeZW9GMAGPH-tR9vvb2fXyPL_8sbpYtpe5KimNORQKBkHZgEIp5BVIBWzklWzGQVQjDLLqh75XssaaNcB5I6uqZ0xCAxUTZXmUfdnr7qZ-i4NKXj2Ybuf1FvzcOdDdvxOrN93a3XVFUXIm6jopfHpU8O52whC7rQ4qfQUsuil0JW0a0XBZi4R-_A-9cZO36b4HinHBpUzUYk8p70LwOD67KWh3n323z757yD7hH15e8Aw_JV3-BcTkrro</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3099257588</pqid></control><display><type>article</type><title>Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model</title><source>Publicly Available Content Database</source><source>PubMed Central</source><creator>Molena, Kelly F ; Macedo, Ana P ; Ijaz, Anum ; Carvalho, Fabrício K ; Gallo, Maria Julia D ; Wanderley Garcia de Paula E Silva, Francisco ; de Rossi, Andiara ; Mezzomo, Luis A ; Mugayar, Leda Regina F ; Queiroz, Alexandra M</creator><creatorcontrib>Molena, Kelly F ; Macedo, Ana P ; Ijaz, Anum ; Carvalho, Fabrício K ; Gallo, Maria Julia D ; Wanderley Garcia de Paula E Silva, Francisco ; de Rossi, Andiara ; Mezzomo, Luis A ; Mugayar, Leda Regina F ; Queiroz, Alexandra M</creatorcontrib><description>Artificial intelligence (AI) can be a tool in the diagnosis and acquisition of knowledge, particularly in dentistry, sparking debates on its application in clinical decision-making. This study aims to evaluate the accuracy, completeness, and reliability of the responses generated by Chatbot Generative Pre-Trained Transformer (ChatGPT) 3.5 in dentistry using expert-formulated questions. Experts were invited to create three questions, answers, and respective references according to specialized fields of activity. The Likert scale was used to evaluate agreement levels between experts and ChatGPT responses. Statistical analysis compared descriptive and binary question groups in terms of accuracy and completeness. Questions with low accuracy underwent re-evaluation, and subsequent responses were compared for improvement. The Wilcoxon test was utilized (α = 0.05). Ten experts across six dental specialties generated 30 binary and descriptive dental questions and references. The accuracy score had a median of 5.50 and a mean of 4.17. For completeness, the median was 2.00 and the mean was 2.07. No difference was observed between descriptive and binary responses for accuracy and completeness. However, re-evaluated responses showed a significant improvement with a significant difference in accuracy (median 5.50 vs. 6.00; mean 4.17 vs. 4.80; p=0.042) and completeness (median 2.0 vs. 2.0; mean 2.07 vs. 2.30; p=0.011). References were more incorrect than correct, with no differences between descriptive and binary questions. ChatGPT initially demonstrated good accuracy and completeness, which was further improved with machine learning (ML) over time. However, some inaccurate answers and references persisted. Human critical discernment continues to be essential to facing complex clinical cases and advancing theoretical knowledge and evidence-based practice.</description><identifier>ISSN: 2168-8184</identifier><identifier>EISSN: 2168-8184</identifier><identifier>DOI: 10.7759/cureus.65658</identifier><identifier>PMID: 39205730</identifier><language>eng</language><publisher>United States: Cureus Inc</publisher><subject>Accuracy ; Artificial intelligence ; Chatbots ; Citations ; Data collection ; Dentistry ; Healthcare Technology ; Knowledge acquisition ; Likert scale ; Software</subject><ispartof>Curēus (Palo Alto, CA), 2024-07, Vol.16 (7), p.e65658</ispartof><rights>Copyright © 2024, Molena et al.</rights><rights>Copyright © 2024, Molena et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>Copyright © 2024, Molena et al. 2024 Molena et al.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c300t-a1cad702de7cce54a8ca2f5489fd74fad84bdbbc86e629a559844b228a9a42733</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/3099257588/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/3099257588?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,881,25732,27903,27904,36991,36992,44569,53769,53771,74872</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/39205730$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Molena, Kelly F</creatorcontrib><creatorcontrib>Macedo, Ana P</creatorcontrib><creatorcontrib>Ijaz, Anum</creatorcontrib><creatorcontrib>Carvalho, Fabrício K</creatorcontrib><creatorcontrib>Gallo, Maria Julia D</creatorcontrib><creatorcontrib>Wanderley Garcia de Paula E Silva, Francisco</creatorcontrib><creatorcontrib>de Rossi, Andiara</creatorcontrib><creatorcontrib>Mezzomo, Luis A</creatorcontrib><creatorcontrib>Mugayar, Leda Regina F</creatorcontrib><creatorcontrib>Queiroz, Alexandra M</creatorcontrib><title>Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model</title><title>Curēus (Palo Alto, CA)</title><addtitle>Cureus</addtitle><description>Artificial intelligence (AI) can be a tool in the diagnosis and acquisition of knowledge, particularly in dentistry, sparking debates on its application in clinical decision-making. This study aims to evaluate the accuracy, completeness, and reliability of the responses generated by Chatbot Generative Pre-Trained Transformer (ChatGPT) 3.5 in dentistry using expert-formulated questions. Experts were invited to create three questions, answers, and respective references according to specialized fields of activity. The Likert scale was used to evaluate agreement levels between experts and ChatGPT responses. Statistical analysis compared descriptive and binary question groups in terms of accuracy and completeness. Questions with low accuracy underwent re-evaluation, and subsequent responses were compared for improvement. The Wilcoxon test was utilized (α = 0.05). Ten experts across six dental specialties generated 30 binary and descriptive dental questions and references. The accuracy score had a median of 5.50 and a mean of 4.17. For completeness, the median was 2.00 and the mean was 2.07. No difference was observed between descriptive and binary responses for accuracy and completeness. However, re-evaluated responses showed a significant improvement with a significant difference in accuracy (median 5.50 vs. 6.00; mean 4.17 vs. 4.80; p=0.042) and completeness (median 2.0 vs. 2.0; mean 2.07 vs. 2.30; p=0.011). References were more incorrect than correct, with no differences between descriptive and binary questions. ChatGPT initially demonstrated good accuracy and completeness, which was further improved with machine learning (ML) over time. However, some inaccurate answers and references persisted. Human critical discernment continues to be essential to facing complex clinical cases and advancing theoretical knowledge and evidence-based practice.</description><subject>Accuracy</subject><subject>Artificial intelligence</subject><subject>Chatbots</subject><subject>Citations</subject><subject>Data collection</subject><subject>Dentistry</subject><subject>Healthcare Technology</subject><subject>Knowledge acquisition</subject><subject>Likert scale</subject><subject>Software</subject><issn>2168-8184</issn><issn>2168-8184</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNpdkUFv1DAQhSMEolXpjTOyxIXDpnWcOHa4oGgp20pFVFDO0cSZ7Lry2lvbqZQfw3_F7bZV4TQjzTdPb-Zl2fuCngjBm1M1eZzCSc1rLl9lh6yoZS4LWb1-0R9kxyHcUEoLKhgV9G12UDaMclHSw-xPGwKGoO2axA2SViVFUPOCLN12ZzCiTdMFATuQn2g09NroOBM3ktZHPWqlwZALG9EYvUarMF-lFQ8R7xfCztkkT7QlX9FGHaKfP5OWXGnjIvkVp2EmZ3dgJohPDpYbiKura_LdDWjeZW9GMAGPH-tR9vvb2fXyPL_8sbpYtpe5KimNORQKBkHZgEIp5BVIBWzklWzGQVQjDLLqh75XssaaNcB5I6uqZ0xCAxUTZXmUfdnr7qZ-i4NKXj2Ybuf1FvzcOdDdvxOrN93a3XVFUXIm6jopfHpU8O52whC7rQ4qfQUsuil0JW0a0XBZi4R-_A-9cZO36b4HinHBpUzUYk8p70LwOD67KWh3n323z757yD7hH15e8Aw_JV3-BcTkrro</recordid><startdate>20240729</startdate><enddate>20240729</enddate><creator>Molena, Kelly F</creator><creator>Macedo, Ana P</creator><creator>Ijaz, Anum</creator><creator>Carvalho, Fabrício K</creator><creator>Gallo, Maria Julia D</creator><creator>Wanderley Garcia de Paula E Silva, Francisco</creator><creator>de Rossi, Andiara</creator><creator>Mezzomo, Luis A</creator><creator>Mugayar, Leda Regina F</creator><creator>Queiroz, Alexandra M</creator><general>Cureus Inc</general><general>Cureus</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7X7</scope><scope>7XB</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>K9.</scope><scope>M0S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>7X8</scope><scope>5PM</scope></search><sort><creationdate>20240729</creationdate><title>Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model</title><author>Molena, Kelly F ; Macedo, Ana P ; Ijaz, Anum ; Carvalho, Fabrício K ; Gallo, Maria Julia D ; Wanderley Garcia de Paula E Silva, Francisco ; de Rossi, Andiara ; Mezzomo, Luis A ; Mugayar, Leda Regina F ; Queiroz, Alexandra M</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c300t-a1cad702de7cce54a8ca2f5489fd74fad84bdbbc86e629a559844b228a9a42733</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Artificial intelligence</topic><topic>Chatbots</topic><topic>Citations</topic><topic>Data collection</topic><topic>Dentistry</topic><topic>Healthcare Technology</topic><topic>Knowledge acquisition</topic><topic>Likert scale</topic><topic>Software</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Molena, Kelly F</creatorcontrib><creatorcontrib>Macedo, Ana P</creatorcontrib><creatorcontrib>Ijaz, Anum</creatorcontrib><creatorcontrib>Carvalho, Fabrício K</creatorcontrib><creatorcontrib>Gallo, Maria Julia D</creatorcontrib><creatorcontrib>Wanderley Garcia de Paula E Silva, Francisco</creatorcontrib><creatorcontrib>de Rossi, Andiara</creatorcontrib><creatorcontrib>Mezzomo, Luis A</creatorcontrib><creatorcontrib>Mugayar, Leda Regina F</creatorcontrib><creatorcontrib>Queiroz, Alexandra M</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Health &amp; Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>Health &amp; Medical Collection (Alumni Edition)</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Curēus (Palo Alto, CA)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Molena, Kelly F</au><au>Macedo, Ana P</au><au>Ijaz, Anum</au><au>Carvalho, Fabrício K</au><au>Gallo, Maria Julia D</au><au>Wanderley Garcia de Paula E Silva, Francisco</au><au>de Rossi, Andiara</au><au>Mezzomo, Luis A</au><au>Mugayar, Leda Regina F</au><au>Queiroz, Alexandra M</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model</atitle><jtitle>Curēus (Palo Alto, CA)</jtitle><addtitle>Cureus</addtitle><date>2024-07-29</date><risdate>2024</risdate><volume>16</volume><issue>7</issue><spage>e65658</spage><pages>e65658-</pages><issn>2168-8184</issn><eissn>2168-8184</eissn><abstract>Artificial intelligence (AI) can be a tool in the diagnosis and acquisition of knowledge, particularly in dentistry, sparking debates on its application in clinical decision-making. This study aims to evaluate the accuracy, completeness, and reliability of the responses generated by Chatbot Generative Pre-Trained Transformer (ChatGPT) 3.5 in dentistry using expert-formulated questions. Experts were invited to create three questions, answers, and respective references according to specialized fields of activity. The Likert scale was used to evaluate agreement levels between experts and ChatGPT responses. Statistical analysis compared descriptive and binary question groups in terms of accuracy and completeness. Questions with low accuracy underwent re-evaluation, and subsequent responses were compared for improvement. The Wilcoxon test was utilized (α = 0.05). Ten experts across six dental specialties generated 30 binary and descriptive dental questions and references. The accuracy score had a median of 5.50 and a mean of 4.17. For completeness, the median was 2.00 and the mean was 2.07. No difference was observed between descriptive and binary responses for accuracy and completeness. However, re-evaluated responses showed a significant improvement with a significant difference in accuracy (median 5.50 vs. 6.00; mean 4.17 vs. 4.80; p=0.042) and completeness (median 2.0 vs. 2.0; mean 2.07 vs. 2.30; p=0.011). References were more incorrect than correct, with no differences between descriptive and binary questions. ChatGPT initially demonstrated good accuracy and completeness, which was further improved with machine learning (ML) over time. However, some inaccurate answers and references persisted. Human critical discernment continues to be essential to facing complex clinical cases and advancing theoretical knowledge and evidence-based practice.</abstract><cop>United States</cop><pub>Cureus Inc</pub><pmid>39205730</pmid><doi>10.7759/cureus.65658</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2168-8184
ispartof Curēus (Palo Alto, CA), 2024-07, Vol.16 (7), p.e65658
issn 2168-8184
2168-8184
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_11352766
source Publicly Available Content Database; PubMed Central
subjects Accuracy
Artificial intelligence
Chatbots
Citations
Data collection
Dentistry
Healthcare Technology
Knowledge acquisition
Likert scale
Software
title Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-27T02%3A06%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Assessing%20the%20Accuracy,%20Completeness,%20and%20Reliability%20of%20Artificial%20Intelligence-Generated%20Responses%20in%20Dentistry:%20A%20Pilot%20Study%20Evaluating%20the%20ChatGPT%20Model&rft.jtitle=Cur%C4%93us%20(Palo%20Alto,%20CA)&rft.au=Molena,%20Kelly%20F&rft.date=2024-07-29&rft.volume=16&rft.issue=7&rft.spage=e65658&rft.pages=e65658-&rft.issn=2168-8184&rft.eissn=2168-8184&rft_id=info:doi/10.7759/cureus.65658&rft_dat=%3Cproquest_pubme%3E3099257588%3C/proquest_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c300t-a1cad702de7cce54a8ca2f5489fd74fad84bdbbc86e629a559844b228a9a42733%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3099257588&rft_id=info:pmid/39205730&rfr_iscdi=true