Loading…
Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model
Artificial intelligence (AI) can be a tool in the diagnosis and acquisition of knowledge, particularly in dentistry, sparking debates on its application in clinical decision-making. This study aims to evaluate the accuracy, completeness, and reliability of the responses generated by Chatbot Generati...
Saved in:
Published in: | Curēus (Palo Alto, CA) CA), 2024-07, Vol.16 (7), p.e65658 |
---|---|
Main Authors: | , , , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | cdi_FETCH-LOGICAL-c300t-a1cad702de7cce54a8ca2f5489fd74fad84bdbbc86e629a559844b228a9a42733 |
container_end_page | |
container_issue | 7 |
container_start_page | e65658 |
container_title | Curēus (Palo Alto, CA) |
container_volume | 16 |
creator | Molena, Kelly F Macedo, Ana P Ijaz, Anum Carvalho, Fabrício K Gallo, Maria Julia D Wanderley Garcia de Paula E Silva, Francisco de Rossi, Andiara Mezzomo, Luis A Mugayar, Leda Regina F Queiroz, Alexandra M |
description | Artificial intelligence (AI) can be a tool in the diagnosis and acquisition of knowledge, particularly in dentistry, sparking debates on its application in clinical decision-making.
This study aims to evaluate the accuracy, completeness, and reliability of the responses generated by Chatbot Generative Pre-Trained Transformer (ChatGPT) 3.5 in dentistry using expert-formulated questions.
Experts were invited to create three questions, answers, and respective references according to specialized fields of activity. The Likert scale was used to evaluate agreement levels between experts and ChatGPT responses. Statistical analysis compared descriptive and binary question groups in terms of accuracy and completeness. Questions with low accuracy underwent re-evaluation, and subsequent responses were compared for improvement. The Wilcoxon test was utilized (α = 0.05).
Ten experts across six dental specialties generated 30 binary and descriptive dental questions and references. The accuracy score had a median of 5.50 and a mean of 4.17. For completeness, the median was 2.00 and the mean was 2.07. No difference was observed between descriptive and binary responses for accuracy and completeness. However, re-evaluated responses showed a significant improvement with a significant difference in accuracy (median 5.50 vs. 6.00; mean 4.17 vs. 4.80; p=0.042) and completeness (median 2.0 vs. 2.0; mean 2.07 vs. 2.30; p=0.011). References were more incorrect than correct, with no differences between descriptive and binary questions.
ChatGPT initially demonstrated good accuracy and completeness, which was further improved with machine learning (ML) over time. However, some inaccurate answers and references persisted. Human critical discernment continues to be essential to facing complex clinical cases and advancing theoretical knowledge and evidence-based practice. |
doi_str_mv | 10.7759/cureus.65658 |
format | article |
fullrecord | <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_11352766</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3099257588</sourcerecordid><originalsourceid>FETCH-LOGICAL-c300t-a1cad702de7cce54a8ca2f5489fd74fad84bdbbc86e629a559844b228a9a42733</originalsourceid><addsrcrecordid>eNpdkUFv1DAQhSMEolXpjTOyxIXDpnWcOHa4oGgp20pFVFDO0cSZ7Lry2lvbqZQfw3_F7bZV4TQjzTdPb-Zl2fuCngjBm1M1eZzCSc1rLl9lh6yoZS4LWb1-0R9kxyHcUEoLKhgV9G12UDaMclHSw-xPGwKGoO2axA2SViVFUPOCLN12ZzCiTdMFATuQn2g09NroOBM3ktZHPWqlwZALG9EYvUarMF-lFQ8R7xfCztkkT7QlX9FGHaKfP5OWXGnjIvkVp2EmZ3dgJohPDpYbiKura_LdDWjeZW9GMAGPH-tR9vvb2fXyPL_8sbpYtpe5KimNORQKBkHZgEIp5BVIBWzklWzGQVQjDLLqh75XssaaNcB5I6uqZ0xCAxUTZXmUfdnr7qZ-i4NKXj2Ybuf1FvzcOdDdvxOrN93a3XVFUXIm6jopfHpU8O52whC7rQ4qfQUsuil0JW0a0XBZi4R-_A-9cZO36b4HinHBpUzUYk8p70LwOD67KWh3n323z757yD7hH15e8Aw_JV3-BcTkrro</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3099257588</pqid></control><display><type>article</type><title>Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model</title><source>Publicly Available Content Database</source><source>PubMed Central</source><creator>Molena, Kelly F ; Macedo, Ana P ; Ijaz, Anum ; Carvalho, Fabrício K ; Gallo, Maria Julia D ; Wanderley Garcia de Paula E Silva, Francisco ; de Rossi, Andiara ; Mezzomo, Luis A ; Mugayar, Leda Regina F ; Queiroz, Alexandra M</creator><creatorcontrib>Molena, Kelly F ; Macedo, Ana P ; Ijaz, Anum ; Carvalho, Fabrício K ; Gallo, Maria Julia D ; Wanderley Garcia de Paula E Silva, Francisco ; de Rossi, Andiara ; Mezzomo, Luis A ; Mugayar, Leda Regina F ; Queiroz, Alexandra M</creatorcontrib><description>Artificial intelligence (AI) can be a tool in the diagnosis and acquisition of knowledge, particularly in dentistry, sparking debates on its application in clinical decision-making.
This study aims to evaluate the accuracy, completeness, and reliability of the responses generated by Chatbot Generative Pre-Trained Transformer (ChatGPT) 3.5 in dentistry using expert-formulated questions.
Experts were invited to create three questions, answers, and respective references according to specialized fields of activity. The Likert scale was used to evaluate agreement levels between experts and ChatGPT responses. Statistical analysis compared descriptive and binary question groups in terms of accuracy and completeness. Questions with low accuracy underwent re-evaluation, and subsequent responses were compared for improvement. The Wilcoxon test was utilized (α = 0.05).
Ten experts across six dental specialties generated 30 binary and descriptive dental questions and references. The accuracy score had a median of 5.50 and a mean of 4.17. For completeness, the median was 2.00 and the mean was 2.07. No difference was observed between descriptive and binary responses for accuracy and completeness. However, re-evaluated responses showed a significant improvement with a significant difference in accuracy (median 5.50 vs. 6.00; mean 4.17 vs. 4.80; p=0.042) and completeness (median 2.0 vs. 2.0; mean 2.07 vs. 2.30; p=0.011). References were more incorrect than correct, with no differences between descriptive and binary questions.
ChatGPT initially demonstrated good accuracy and completeness, which was further improved with machine learning (ML) over time. However, some inaccurate answers and references persisted. Human critical discernment continues to be essential to facing complex clinical cases and advancing theoretical knowledge and evidence-based practice.</description><identifier>ISSN: 2168-8184</identifier><identifier>EISSN: 2168-8184</identifier><identifier>DOI: 10.7759/cureus.65658</identifier><identifier>PMID: 39205730</identifier><language>eng</language><publisher>United States: Cureus Inc</publisher><subject>Accuracy ; Artificial intelligence ; Chatbots ; Citations ; Data collection ; Dentistry ; Healthcare Technology ; Knowledge acquisition ; Likert scale ; Software</subject><ispartof>Curēus (Palo Alto, CA), 2024-07, Vol.16 (7), p.e65658</ispartof><rights>Copyright © 2024, Molena et al.</rights><rights>Copyright © 2024, Molena et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>Copyright © 2024, Molena et al. 2024 Molena et al.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c300t-a1cad702de7cce54a8ca2f5489fd74fad84bdbbc86e629a559844b228a9a42733</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/3099257588/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/3099257588?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,881,25732,27903,27904,36991,36992,44569,53769,53771,74872</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/39205730$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Molena, Kelly F</creatorcontrib><creatorcontrib>Macedo, Ana P</creatorcontrib><creatorcontrib>Ijaz, Anum</creatorcontrib><creatorcontrib>Carvalho, Fabrício K</creatorcontrib><creatorcontrib>Gallo, Maria Julia D</creatorcontrib><creatorcontrib>Wanderley Garcia de Paula E Silva, Francisco</creatorcontrib><creatorcontrib>de Rossi, Andiara</creatorcontrib><creatorcontrib>Mezzomo, Luis A</creatorcontrib><creatorcontrib>Mugayar, Leda Regina F</creatorcontrib><creatorcontrib>Queiroz, Alexandra M</creatorcontrib><title>Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model</title><title>Curēus (Palo Alto, CA)</title><addtitle>Cureus</addtitle><description>Artificial intelligence (AI) can be a tool in the diagnosis and acquisition of knowledge, particularly in dentistry, sparking debates on its application in clinical decision-making.
This study aims to evaluate the accuracy, completeness, and reliability of the responses generated by Chatbot Generative Pre-Trained Transformer (ChatGPT) 3.5 in dentistry using expert-formulated questions.
Experts were invited to create three questions, answers, and respective references according to specialized fields of activity. The Likert scale was used to evaluate agreement levels between experts and ChatGPT responses. Statistical analysis compared descriptive and binary question groups in terms of accuracy and completeness. Questions with low accuracy underwent re-evaluation, and subsequent responses were compared for improvement. The Wilcoxon test was utilized (α = 0.05).
Ten experts across six dental specialties generated 30 binary and descriptive dental questions and references. The accuracy score had a median of 5.50 and a mean of 4.17. For completeness, the median was 2.00 and the mean was 2.07. No difference was observed between descriptive and binary responses for accuracy and completeness. However, re-evaluated responses showed a significant improvement with a significant difference in accuracy (median 5.50 vs. 6.00; mean 4.17 vs. 4.80; p=0.042) and completeness (median 2.0 vs. 2.0; mean 2.07 vs. 2.30; p=0.011). References were more incorrect than correct, with no differences between descriptive and binary questions.
ChatGPT initially demonstrated good accuracy and completeness, which was further improved with machine learning (ML) over time. However, some inaccurate answers and references persisted. Human critical discernment continues to be essential to facing complex clinical cases and advancing theoretical knowledge and evidence-based practice.</description><subject>Accuracy</subject><subject>Artificial intelligence</subject><subject>Chatbots</subject><subject>Citations</subject><subject>Data collection</subject><subject>Dentistry</subject><subject>Healthcare Technology</subject><subject>Knowledge acquisition</subject><subject>Likert scale</subject><subject>Software</subject><issn>2168-8184</issn><issn>2168-8184</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNpdkUFv1DAQhSMEolXpjTOyxIXDpnWcOHa4oGgp20pFVFDO0cSZ7Lry2lvbqZQfw3_F7bZV4TQjzTdPb-Zl2fuCngjBm1M1eZzCSc1rLl9lh6yoZS4LWb1-0R9kxyHcUEoLKhgV9G12UDaMclHSw-xPGwKGoO2axA2SViVFUPOCLN12ZzCiTdMFATuQn2g09NroOBM3ktZHPWqlwZALG9EYvUarMF-lFQ8R7xfCztkkT7QlX9FGHaKfP5OWXGnjIvkVp2EmZ3dgJohPDpYbiKura_LdDWjeZW9GMAGPH-tR9vvb2fXyPL_8sbpYtpe5KimNORQKBkHZgEIp5BVIBWzklWzGQVQjDLLqh75XssaaNcB5I6uqZ0xCAxUTZXmUfdnr7qZ-i4NKXj2Ybuf1FvzcOdDdvxOrN93a3XVFUXIm6jopfHpU8O52whC7rQ4qfQUsuil0JW0a0XBZi4R-_A-9cZO36b4HinHBpUzUYk8p70LwOD67KWh3n323z757yD7hH15e8Aw_JV3-BcTkrro</recordid><startdate>20240729</startdate><enddate>20240729</enddate><creator>Molena, Kelly F</creator><creator>Macedo, Ana P</creator><creator>Ijaz, Anum</creator><creator>Carvalho, Fabrício K</creator><creator>Gallo, Maria Julia D</creator><creator>Wanderley Garcia de Paula E Silva, Francisco</creator><creator>de Rossi, Andiara</creator><creator>Mezzomo, Luis A</creator><creator>Mugayar, Leda Regina F</creator><creator>Queiroz, Alexandra M</creator><general>Cureus Inc</general><general>Cureus</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7X7</scope><scope>7XB</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>K9.</scope><scope>M0S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>7X8</scope><scope>5PM</scope></search><sort><creationdate>20240729</creationdate><title>Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model</title><author>Molena, Kelly F ; Macedo, Ana P ; Ijaz, Anum ; Carvalho, Fabrício K ; Gallo, Maria Julia D ; Wanderley Garcia de Paula E Silva, Francisco ; de Rossi, Andiara ; Mezzomo, Luis A ; Mugayar, Leda Regina F ; Queiroz, Alexandra M</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c300t-a1cad702de7cce54a8ca2f5489fd74fad84bdbbc86e629a559844b228a9a42733</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Artificial intelligence</topic><topic>Chatbots</topic><topic>Citations</topic><topic>Data collection</topic><topic>Dentistry</topic><topic>Healthcare Technology</topic><topic>Knowledge acquisition</topic><topic>Likert scale</topic><topic>Software</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Molena, Kelly F</creatorcontrib><creatorcontrib>Macedo, Ana P</creatorcontrib><creatorcontrib>Ijaz, Anum</creatorcontrib><creatorcontrib>Carvalho, Fabrício K</creatorcontrib><creatorcontrib>Gallo, Maria Julia D</creatorcontrib><creatorcontrib>Wanderley Garcia de Paula E Silva, Francisco</creatorcontrib><creatorcontrib>de Rossi, Andiara</creatorcontrib><creatorcontrib>Mezzomo, Luis A</creatorcontrib><creatorcontrib>Mugayar, Leda Regina F</creatorcontrib><creatorcontrib>Queiroz, Alexandra M</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Health & Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>Health & Medical Collection (Alumni Edition)</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Curēus (Palo Alto, CA)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Molena, Kelly F</au><au>Macedo, Ana P</au><au>Ijaz, Anum</au><au>Carvalho, Fabrício K</au><au>Gallo, Maria Julia D</au><au>Wanderley Garcia de Paula E Silva, Francisco</au><au>de Rossi, Andiara</au><au>Mezzomo, Luis A</au><au>Mugayar, Leda Regina F</au><au>Queiroz, Alexandra M</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model</atitle><jtitle>Curēus (Palo Alto, CA)</jtitle><addtitle>Cureus</addtitle><date>2024-07-29</date><risdate>2024</risdate><volume>16</volume><issue>7</issue><spage>e65658</spage><pages>e65658-</pages><issn>2168-8184</issn><eissn>2168-8184</eissn><abstract>Artificial intelligence (AI) can be a tool in the diagnosis and acquisition of knowledge, particularly in dentistry, sparking debates on its application in clinical decision-making.
This study aims to evaluate the accuracy, completeness, and reliability of the responses generated by Chatbot Generative Pre-Trained Transformer (ChatGPT) 3.5 in dentistry using expert-formulated questions.
Experts were invited to create three questions, answers, and respective references according to specialized fields of activity. The Likert scale was used to evaluate agreement levels between experts and ChatGPT responses. Statistical analysis compared descriptive and binary question groups in terms of accuracy and completeness. Questions with low accuracy underwent re-evaluation, and subsequent responses were compared for improvement. The Wilcoxon test was utilized (α = 0.05).
Ten experts across six dental specialties generated 30 binary and descriptive dental questions and references. The accuracy score had a median of 5.50 and a mean of 4.17. For completeness, the median was 2.00 and the mean was 2.07. No difference was observed between descriptive and binary responses for accuracy and completeness. However, re-evaluated responses showed a significant improvement with a significant difference in accuracy (median 5.50 vs. 6.00; mean 4.17 vs. 4.80; p=0.042) and completeness (median 2.0 vs. 2.0; mean 2.07 vs. 2.30; p=0.011). References were more incorrect than correct, with no differences between descriptive and binary questions.
ChatGPT initially demonstrated good accuracy and completeness, which was further improved with machine learning (ML) over time. However, some inaccurate answers and references persisted. Human critical discernment continues to be essential to facing complex clinical cases and advancing theoretical knowledge and evidence-based practice.</abstract><cop>United States</cop><pub>Cureus Inc</pub><pmid>39205730</pmid><doi>10.7759/cureus.65658</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2168-8184 |
ispartof | Curēus (Palo Alto, CA), 2024-07, Vol.16 (7), p.e65658 |
issn | 2168-8184 2168-8184 |
language | eng |
recordid | cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_11352766 |
source | Publicly Available Content Database; PubMed Central |
subjects | Accuracy Artificial intelligence Chatbots Citations Data collection Dentistry Healthcare Technology Knowledge acquisition Likert scale Software |
title | Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-27T02%3A06%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Assessing%20the%20Accuracy,%20Completeness,%20and%20Reliability%20of%20Artificial%20Intelligence-Generated%20Responses%20in%20Dentistry:%20A%20Pilot%20Study%20Evaluating%20the%20ChatGPT%20Model&rft.jtitle=Cur%C4%93us%20(Palo%20Alto,%20CA)&rft.au=Molena,%20Kelly%20F&rft.date=2024-07-29&rft.volume=16&rft.issue=7&rft.spage=e65658&rft.pages=e65658-&rft.issn=2168-8184&rft.eissn=2168-8184&rft_id=info:doi/10.7759/cureus.65658&rft_dat=%3Cproquest_pubme%3E3099257588%3C/proquest_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c300t-a1cad702de7cce54a8ca2f5489fd74fad84bdbbc86e629a559844b228a9a42733%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3099257588&rft_id=info:pmid/39205730&rfr_iscdi=true |