Loading…

Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model

Artificial intelligence (AI) can be a tool in the diagnosis and acquisition of knowledge, particularly in dentistry, sparking debates on its application in clinical decision-making. This study aims to evaluate the accuracy, completeness, and reliability of the responses generated by Chatbot Generati...

Full description

Saved in:

Bibliographic Details
Published in:	Curēus (Palo Alto, CA) CA), 2024-07, Vol.16 (7), p.e65658
Main Authors:	Molena, Kelly F, Macedo, Ana P, Ijaz, Anum, Carvalho, Fabrício K, Gallo, Maria Julia D, Wanderley Garcia de Paula E Silva, Francisco, de Rossi, Andiara, Mezzomo, Luis A, Mugayar, Leda Regina F, Queiroz, Alexandra M
Format:	Article
Language:	English
Subjects:	Accuracy Artificial intelligence Chatbots Citations Data collection Dentistry Healthcare Technology Knowledge acquisition Likert scale Software
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites	cdi_FETCH-LOGICAL-c300t-a1cad702de7cce54a8ca2f5489fd74fad84bdbbc86e629a559844b228a9a42733
container_end_page
container_issue	7
container_start_page	e65658
container_title	Curēus (Palo Alto, CA)
container_volume	16
creator	Molena, Kelly F Macedo, Ana P Ijaz, Anum Carvalho, Fabrício K Gallo, Maria Julia D Wanderley Garcia de Paula E Silva, Francisco de Rossi, Andiara Mezzomo, Luis A Mugayar, Leda Regina F Queiroz, Alexandra M
description	Artificial intelligence (AI) can be a tool in the diagnosis and acquisition of knowledge, particularly in dentistry, sparking debates on its application in clinical decision-making. This study aims to evaluate the accuracy, completeness, and reliability of the responses generated by Chatbot Generative Pre-Trained Transformer (ChatGPT) 3.5 in dentistry using expert-formulated questions. Experts were invited to create three questions, answers, and respective references according to specialized fields of activity. The Likert scale was used to evaluate agreement levels between experts and ChatGPT responses. Statistical analysis compared descriptive and binary question groups in terms of accuracy and completeness. Questions with low accuracy underwent re-evaluation, and subsequent responses were compared for improvement. The Wilcoxon test was utilized (α = 0.05). Ten experts across six dental specialties generated 30 binary and descriptive dental questions and references. The accuracy score had a median of 5.50 and a mean of 4.17. For completeness, the median was 2.00 and the mean was 2.07. No difference was observed between descriptive and binary responses for accuracy and completeness. However, re-evaluated responses showed a significant improvement with a significant difference in accuracy (median 5.50 vs. 6.00; mean 4.17 vs. 4.80; p=0.042) and completeness (median 2.0 vs. 2.0; mean 2.07 vs. 2.30; p=0.011). References were more incorrect than correct, with no differences between descriptive and binary questions. ChatGPT initially demonstrated good accuracy and completeness, which was further improved with machine learning (ML) over time. However, some inaccurate answers and references persisted. Human critical discernment continues to be essential to facing complex clinical cases and advancing theoretical knowledge and evidence-based practice.
doi_str_mv	10.7759/cureus.65658
format	article
fullrecord	<record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_11352766</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3099257588</sourcerecordid><originalsourceid>FETCH-LOGICAL-c300t-a1cad702de7cce54a8ca2f5489fd74fad84bdbbc86e629a559844b228a9a42733</originalsourceid><addsrcrecordid>eNpdkUFv1DAQhSMEolXpjTOyxIXDpnWcOHa4oGgp20pFVFDO0cSZ7Lry2lvbqZQfw3_F7bZV4TQjzTdPb-Zl2fuCngjBm1M1eZzCSc1rLl9lh6yoZS4LWb1-0R9kxyHcUEoLKhgV9G12UDaMclHSw-xPGwKGoO2axA2SViVFUPOCLN12ZzCiTdMFATuQn2g09NroOBM3ktZHPWqlwZALG9EYvUarMF-lFQ8R7xfCztkkT7QlX9FGHaKfP5OWXGnjIvkVp2EmZ3dgJohPDpYbiKura_LdDWjeZW9GMAGPH-tR9vvb2fXyPL_8sbpYtpe5KimNORQKBkHZgEIp5BVIBWzklWzGQVQjDLLqh75XssaaNcB5I6uqZ0xCAxUTZXmUfdnr7qZ-i4NKXj2Ybuf1FvzcOdDdvxOrN93a3XVFUXIm6jopfHpU8O52whC7rQ4qfQUsuil0JW0a0XBZi4R-_A-9cZO36b4HinHBpUzUYk8p70LwOD67KWh3n323z757yD7hH15e8Aw_JV3-BcTkrro</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3099257588</pqid></control><display><type>article</type><title>Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model</title><source>Publicly Available Content Database</source><source>PubMed Central</source><creator>Molena, Kelly F ; Macedo, Ana P ; Ijaz, Anum ; Carvalho, Fabrício K ; Gallo, Maria Julia D ; Wanderley Garcia de Paula E Silva, Francisco ; de Rossi, Andiara ; Mezzomo, Luis A ; Mugayar, Leda Regina F ; Queiroz, Alexandra M</creator><creatorcontrib>Molena, Kelly F ; Macedo, Ana P ; Ijaz, Anum ; Carvalho, Fabrício K ; Gallo, Maria Julia D ; Wanderley Garcia de Paula E Silva, Francisco ; de Rossi, Andiara ; Mezzomo, Luis A ; Mugayar, Leda Regina F ; Queiroz, Alexandra M</creatorcontrib><description>Artificial intelligence (AI) can be a tool in the diagnosis and acquisition of knowledge, particularly in dentistry, sparking debates on its application in clinical decision-making. This study aims to evaluate the accuracy, completeness, and reliability of the responses generated by Chatbot Generative Pre-Trained Transformer (ChatGPT) 3.5 in dentistry using expert-formulated questions. Experts were invited to create three questions, answers, and respective references according to specialized fields of activity. The Likert scale was used to evaluate agreement levels between experts and ChatGPT responses. Statistical analysis compared descriptive and binary question groups in terms of accuracy and completeness. Questions with low accuracy underwent re-evaluation, and subsequent responses were compared for improvement. The Wilcoxon test was utilized (α = 0.05). Ten experts across six dental specialties generated 30 binary and descriptive dental questions and references. The accuracy score had a median of 5.50 and a mean of 4.17. For completeness, the median was 2.00 and the mean was 2.07. No difference was observed between descriptive and binary responses for accuracy and completeness. However, re-evaluated responses showed a significant improvement with a significant difference in accuracy (median 5.50 vs. 6.00; mean 4.17 vs. 4.80; p=0.042) and completeness (median 2.0 vs. 2.0; mean 2.07 vs. 2.30; p=0.011). References were more incorrect than correct, with no differences between descriptive and binary questions. ChatGPT initially demonstrated good accuracy and completeness, which was further improved with machine learning (ML) over time. However, some inaccurate answers and references persisted. Human critical discernment continues to be essential to facing complex clinical cases and advancing theoretical knowledge and evidence-based practice.</description><identifier>ISSN: 2168-8184</identifier><identifier>EISSN: 2168-8184</identifier><identifier>DOI: 10.7759/cureus.65658</identifier><identifier>PMID: 39205730</identifier><language>eng</language><publisher>United States: Cureus Inc</publisher><subject>Accuracy ; Artificial intelligence ; Chatbots ; Citations ; Data collection ; Dentistry ; Healthcare Technology ; Knowledge acquisition ; Likert scale ; Software</subject><ispartof>Curēus (Palo Alto, CA), 2024-07, Vol.16 (7), p.e65658</ispartof><rights>Copyright © 2024, Molena et al.</rights><rights>Copyright © 2024, Molena et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>Copyright © 2024, Molena et al. 2024 Molena et al.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c300t-a1cad702de7cce54a8ca2f5489fd74fad84bdbbc86e629a559844b228a9a42733</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/3099257588/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/3099257588?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,881,25732,27903,27904,36991,36992,44569,53769,53771,74872</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/39205730$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Molena, Kelly F</creatorcontrib><creatorcontrib>Macedo, Ana P</creatorcontrib><creatorcontrib>Ijaz, Anum</creatorcontrib><creatorcontrib>Carvalho, Fabrício K</creatorcontrib><creatorcontrib>Gallo, Maria Julia D</creatorcontrib><creatorcontrib>Wanderley Garcia de Paula E Silva, Francisco</creatorcontrib><creatorcontrib>de Rossi, Andiara</creatorcontrib><creatorcontrib>Mezzomo, Luis A</creatorcontrib><creatorcontrib>Mugayar, Leda Regina F</creatorcontrib><creatorcontrib>Queiroz, Alexandra M</creatorcontrib><title>Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model</title><title>Curēus (Palo Alto, CA)</title><addtitle>Cureus</addtitle><description>Artificial intelligence (AI) can be a tool in the diagnosis and acquisition of knowledge, particularly in dentistry, sparking debates on its application in clinical decision-making. This study aims to evaluate the accuracy, completeness, and reliability of the responses generated by Chatbot Generative Pre-Trained Transformer (ChatGPT) 3.5 in dentistry using expert-formulated questions. Experts were invited to create three questions, answers, and respective references according to specialized fields of activity. The Likert scale was used to evaluate agreement levels between experts and ChatGPT responses. Statistical analysis compared descriptive and binary question groups in terms of accuracy and completeness. Questions with low accuracy underwent re-evaluation, and subsequent responses were compared for improvement. The Wilcoxon test was utilized (α = 0.05). Ten experts across six dental specialties generated 30 binary and descriptive dental questions and references. The accuracy score had a median of 5.50 and a mean of 4.17. For completeness, the median was 2.00 and the mean was 2.07. No difference was observed between descriptive and binary responses for accuracy and completeness. However, re-evaluated responses showed a significant improvement with a significant difference in accuracy (median 5.50 vs. 6.00; mean 4.17 vs. 4.80; p=0.042) and completeness (median 2.0 vs. 2.0; mean 2.07 vs. 2.30; p=0.011). References were more incorrect than correct, with no differences between descriptive and binary questions. ChatGPT initially demonstrated good accuracy and completeness, which was further improved with machine learning (ML) over time. However, some inaccurate answers and references persisted. Human critical discernment continues to be essential to facing complex clinical cases and advancing theoretical knowledge and evidence-based practice.</description><subject>Accuracy</subject><subject>Artificial intelligence</subject><subject>Chatbots</subject><subject>Citations</subject><subject>Data collection</subject><subject>Dentistry</subject><subject>Healthcare Technology</subject><subject>Knowledge acquisition</subject><subject>Likert scale</subject><subject>Software</subject><issn>2168-8184</issn><issn>2168-8184</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNpdkUFv1DAQhSMEolXpjTOyxIXDpnWcOHa4oGgp20pFVFDO0cSZ7Lry2lvbqZQfw3_F7bZV4TQjzTdPb-Zl2fuCngjBm1M1eZzCSc1rLl9lh6yoZS4LWb1-0R9kxyHcUEoLKhgV9G12UDaMclHSw-xPGwKGoO2axA2SViVFUPOCLN12ZzCiTdMFATuQn2g09NroOBM3ktZHPWqlwZALG9EYvUarMF-lFQ8R7xfCztkkT7QlX9FGHaKfP5OWXGnjIvkVp2EmZ3dgJohPDpYbiKura_LdDWjeZW9GMAGPH-tR9vvb2fXyPL_8sbpYtpe5KimNORQKBkHZgEIp5BVIBWzklWzGQVQjDLLqh75XssaaNcB5I6uqZ0xCAxUTZXmUfdnr7qZ-i4NKXj2Ybuf1FvzcOdDdvxOrN93a3XVFUXIm6jopfHpU8O52whC7rQ4qfQUsuil0JW0a0XBZi4R-_A-9cZO36b4HinHBpUzUYk8p70LwOD67KWh3n323z757yD7hH15e8Aw_JV3-BcTkrro</recordid><startdate>20240729</startdate><enddate>20240729</enddate><creator>Molena, Kelly F</creator><creator>Macedo, Ana P</creator><creator>Ijaz, Anum</creator><creator>Carvalho, Fabrício K</creator><creator>Gallo, Maria Julia D</creator><creator>Wanderley Garcia de Paula E Silva, Francisco</creator><creator>de Rossi, Andiara</creator><creator>Mezzomo, Luis A</creator><creator>Mugayar, Leda Regina F</creator><creator>Queiroz, Alexandra M</creator><general>Cureus Inc</general><general>Cureus</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7X7</scope><scope>7XB</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>K9.</scope><scope>M0S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>7X8</scope><scope>5PM</scope></search><sort><creationdate>20240729</creationdate><title>Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model</title><author>Molena, Kelly F ; Macedo, Ana P ; Ijaz, Anum ; Carvalho, Fabrício K ; Gallo, Maria Julia D ; Wanderley Garcia de Paula E Silva, Francisco ; de Rossi, Andiara ; Mezzomo, Luis A ; Mugayar, Leda Regina F ; Queiroz, Alexandra M</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c300t-a1cad702de7cce54a8ca2f5489fd74fad84bdbbc86e629a559844b228a9a42733</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Artificial intelligence</topic><topic>Chatbots</topic><topic>Citations</topic><topic>Data collection</topic><topic>Dentistry</topic><topic>Healthcare Technology</topic><topic>Knowledge acquisition</topic><topic>Likert scale</topic><topic>Software</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Molena, Kelly F</creatorcontrib><creatorcontrib>Macedo, Ana P</creatorcontrib><creatorcontrib>Ijaz, Anum</creatorcontrib><creatorcontrib>Carvalho, Fabrício K</creatorcontrib><creatorcontrib>Gallo, Maria Julia D</creatorcontrib><creatorcontrib>Wanderley Garcia de Paula E Silva, Francisco</creatorcontrib><creatorcontrib>de Rossi, Andiara</creatorcontrib><creatorcontrib>Mezzomo, Luis A</creatorcontrib><creatorcontrib>Mugayar, Leda Regina F</creatorcontrib><creatorcontrib>Queiroz, Alexandra M</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Health & Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>Health & Medical Collection (Alumni Edition)</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Curēus (Palo Alto, CA)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Molena, Kelly F</au><au>Macedo, Ana P</au><au>Ijaz, Anum</au><au>Carvalho, Fabrício K</au><au>Gallo, Maria Julia D</au><au>Wanderley Garcia de Paula E Silva, Francisco</au><au>de Rossi, Andiara</au><au>Mezzomo, Luis A</au><au>Mugayar, Leda Regina F</au><au>Queiroz, Alexandra M</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model</atitle><jtitle>Curēus (Palo Alto, CA)</jtitle><addtitle>Cureus</addtitle><date>2024-07-29</date><risdate>2024</risdate><volume>16</volume><issue>7</issue><spage>e65658</spage><pages>e65658-</pages><issn>2168-8184</issn><eissn>2168-8184</eissn><abstract>Artificial intelligence (AI) can be a tool in the diagnosis and acquisition of knowledge, particularly in dentistry, sparking debates on its application in clinical decision-making. This study aims to evaluate the accuracy, completeness, and reliability of the responses generated by Chatbot Generative Pre-Trained Transformer (ChatGPT) 3.5 in dentistry using expert-formulated questions. Experts were invited to create three questions, answers, and respective references according to specialized fields of activity. The Likert scale was used to evaluate agreement levels between experts and ChatGPT responses. Statistical analysis compared descriptive and binary question groups in terms of accuracy and completeness. Questions with low accuracy underwent re-evaluation, and subsequent responses were compared for improvement. The Wilcoxon test was utilized (α = 0.05). Ten experts across six dental specialties generated 30 binary and descriptive dental questions and references. The accuracy score had a median of 5.50 and a mean of 4.17. For completeness, the median was 2.00 and the mean was 2.07. No difference was observed between descriptive and binary responses for accuracy and completeness. However, re-evaluated responses showed a significant improvement with a significant difference in accuracy (median 5.50 vs. 6.00; mean 4.17 vs. 4.80; p=0.042) and completeness (median 2.0 vs. 2.0; mean 2.07 vs. 2.30; p=0.011). References were more incorrect than correct, with no differences between descriptive and binary questions. ChatGPT initially demonstrated good accuracy and completeness, which was further improved with machine learning (ML) over time. However, some inaccurate answers and references persisted. Human critical discernment continues to be essential to facing complex clinical cases and advancing theoretical knowledge and evidence-based practice.</abstract><cop>United States</cop><pub>Cureus Inc</pub><pmid>39205730</pmid><doi>10.7759/cureus.65658</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2168-8184
ispartof	Curēus (Palo Alto, CA), 2024-07, Vol.16 (7), p.e65658
issn	2168-8184 2168-8184
language	eng
recordid	cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_11352766
source	Publicly Available Content Database; PubMed Central
subjects	Accuracy Artificial intelligence Chatbots Citations Data collection Dentistry Healthcare Technology Knowledge acquisition Likert scale Software
title	Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-27T02%3A06%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Assessing%20the%20Accuracy,%20Completeness,%20and%20Reliability%20of%20Artificial%20Intelligence-Generated%20Responses%20in%20Dentistry:%20A%20Pilot%20Study%20Evaluating%20the%20ChatGPT%20Model&rft.jtitle=Cur%C4%93us%20(Palo%20Alto,%20CA)&rft.au=Molena,%20Kelly%20F&rft.date=2024-07-29&rft.volume=16&rft.issue=7&rft.spage=e65658&rft.pages=e65658-&rft.issn=2168-8184&rft.eissn=2168-8184&rft_id=info:doi/10.7759/cureus.65658&rft_dat=%3Cproquest_pubme%3E3099257588%3C/proquest_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c300t-a1cad702de7cce54a8ca2f5489fd74fad84bdbbc86e629a559844b228a9a42733%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3099257588&rft_id=info:pmid/39205730&rfr_iscdi=true