Loading…

Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study

Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to "hallucinations" (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how thes...

Full description

Saved in:
Bibliographic Details
Published in:Journal of medical Internet research 2024-01, Vol.26 (4), p.e52113
Main Authors: Herrmann-Werner, Anne, Festl-Wietek, Teresa, Holderried, Friederike, Herschbach, Lea, Griewatz, Jan, Masters, Ken, Zipfel, Stephan, Mahling, Moritz
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c559t-b3a559652350176e6b59707c3d6dd311ea29ae8247385987812028dc522a3c8d3
cites cdi_FETCH-LOGICAL-c559t-b3a559652350176e6b59707c3d6dd311ea29ae8247385987812028dc522a3c8d3
container_end_page
container_issue 4
container_start_page e52113
container_title Journal of medical Internet research
container_volume 26
creator Herrmann-Werner, Anne
Festl-Wietek, Teresa
Holderried, Friederike
Herschbach, Lea
Griewatz, Jan
Masters, Ken
Zipfel, Stephan
Mahling, Moritz
description Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to "hallucinations" (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy. This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions. We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy. GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P=.002 for the detailed prompt and P
doi_str_mv 10.2196/52113
format article
fullrecord <record><control><sourceid>gale_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_141900fd5e9040928b5d9d3343472c97</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A780390092</galeid><doaj_id>oai_doaj_org_article_141900fd5e9040928b5d9d3343472c97</doaj_id><sourcerecordid>A780390092</sourcerecordid><originalsourceid>FETCH-LOGICAL-c559t-b3a559652350176e6b59707c3d6dd311ea29ae8247385987812028dc522a3c8d3</originalsourceid><addsrcrecordid>eNptkm9v0zAQxiMEYlvZV0CREAJeZPhPHNu8QaUao9IKQ-teW659SV0lcYmTqf32uO0YK0J-cdb5d4_vHl2SnGN0QbAsPjKCMX2WnOKcikwIjp8_uZ8kZyGsECIol_hlckIFKTDl4jRZj0OAEFxbpZOl7q9u5u9COtOhh26b-jL9UnvfxNRcb3zrm216t2dvwtYsffCN7p1JZ2CdcS2klxvdpD8HCL3zbfiUztwGbDaDfultSG_7wW5fJS9KXQc4f4ij5O7r5XzyLbv-cTWdjK8zw5jsswXVMRaMUIYwL6BYMMkRN9QW1lKMQROpQZCcU8Gk4AITRIQ1jBBNjbB0lEwPutbrlVp3rtHdVnnt1D7hu0rpLjZfg8I5lgiVloFEOZJELJiVltKc5pwYyaPW54PWelg0YA20fafrI9Hjl9YtVeXvFUYij53JqPD-QaHzv3b-qMYFA3WtW_BDUERiLgrM47yj5M0_6MoPXRu92lMFxgzRv1Sl4wSuLX382OxE1ZgLRONAcqd18R8qHguNM76F0sX8UcGHo4LI9LDpKz2EoKa334_ZtwfWdD6EDspHQzBSu51U-52M3Oun7j1Sf5aQ_gb4H9Zg</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2917611503</pqid></control><display><type>article</type><title>Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study</title><source>Applied Social Sciences Index &amp; Abstracts (ASSIA)</source><source>PubMed (Medline)</source><source>Library &amp; Information Science Abstracts (LISA)</source><source>Publicly Available Content Database</source><source>Social Science Premium Collection</source><source>Library &amp; Information Science Collection</source><creator>Herrmann-Werner, Anne ; Festl-Wietek, Teresa ; Holderried, Friederike ; Herschbach, Lea ; Griewatz, Jan ; Masters, Ken ; Zipfel, Stephan ; Mahling, Moritz</creator><creatorcontrib>Herrmann-Werner, Anne ; Festl-Wietek, Teresa ; Holderried, Friederike ; Herschbach, Lea ; Griewatz, Jan ; Masters, Ken ; Zipfel, Stephan ; Mahling, Moritz</creatorcontrib><description>Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to "hallucinations" (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy. This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions. We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy. GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P=.002 for the detailed prompt and P&lt;.001 for the short prompt). Independent of the prompt, GPT-4's lowest exam performance was 78.9% (15/19), thereby always surpassing the "pass" threshold. Our qualitative analysis of incorrect answers, based on Bloom's taxonomy, showed that errors were primarily in the "remember" (29/68) and "understand" (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom's taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.</description><identifier>ISSN: 1438-8871</identifier><identifier>ISSN: 1439-4456</identifier><identifier>EISSN: 1438-8871</identifier><identifier>DOI: 10.2196/52113</identifier><identifier>PMID: 38261378</identifier><language>eng</language><publisher>Canada: Journal of Medical Internet Research</publisher><subject>Answers ; Anxiety disorders ; Application programming interface ; Blooms taxonomy ; Chatbots ; Classification ; Cognition &amp; reasoning ; Cognitive ability ; Data analysis ; Education, Medical ; Educational objectives ; Hallucinations ; Health care reform ; Heart attacks ; Humans ; Language ; Learning ; Medical education ; Medical schools ; Medical students ; Medicine ; Methods ; Multiple choice ; Original Paper ; Post traumatic stress disorder ; Psychosomatic Medicine ; Psychotherapy ; Qualitative research ; Research Design ; Tests</subject><ispartof>Journal of medical Internet research, 2024-01, Vol.26 (4), p.e52113</ispartof><rights>Anne Herrmann-Werner, Teresa Festl-Wietek, Friederike Holderried, Lea Herschbach, Jan Griewatz, Ken Masters, Stephan Zipfel, Moritz Mahling. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 23.01.2024.</rights><rights>COPYRIGHT 2024 Journal of Medical Internet Research</rights><rights>2024. This work is licensed under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>Anne Herrmann-Werner, Teresa Festl-Wietek, Friederike Holderried, Lea Herschbach, Jan Griewatz, Ken Masters, Stephan Zipfel, Moritz Mahling. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 23.01.2024. 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c559t-b3a559652350176e6b59707c3d6dd311ea29ae8247385987812028dc522a3c8d3</citedby><cites>FETCH-LOGICAL-c559t-b3a559652350176e6b59707c3d6dd311ea29ae8247385987812028dc522a3c8d3</cites><orcidid>0000-0003-3425-5020 ; 0009-0005-6378-5073 ; 0000-0003-1450-1757 ; 0000-0001-7960-4015 ; 0000-0003-1659-4440 ; 0000-0002-9731-3171 ; 0000-0003-1828-0920 ; 0000-0003-2413-7047</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/2917611503/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2917611503?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,885,12846,21381,21394,25753,27305,27924,27925,30999,33611,33612,33906,33907,34135,37012,37013,43733,43892,44590,74221,74409,75126</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38261378$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Herrmann-Werner, Anne</creatorcontrib><creatorcontrib>Festl-Wietek, Teresa</creatorcontrib><creatorcontrib>Holderried, Friederike</creatorcontrib><creatorcontrib>Herschbach, Lea</creatorcontrib><creatorcontrib>Griewatz, Jan</creatorcontrib><creatorcontrib>Masters, Ken</creatorcontrib><creatorcontrib>Zipfel, Stephan</creatorcontrib><creatorcontrib>Mahling, Moritz</creatorcontrib><title>Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study</title><title>Journal of medical Internet research</title><addtitle>J Med Internet Res</addtitle><description>Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to "hallucinations" (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy. This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions. We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy. GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P=.002 for the detailed prompt and P&lt;.001 for the short prompt). Independent of the prompt, GPT-4's lowest exam performance was 78.9% (15/19), thereby always surpassing the "pass" threshold. Our qualitative analysis of incorrect answers, based on Bloom's taxonomy, showed that errors were primarily in the "remember" (29/68) and "understand" (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom's taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.</description><subject>Answers</subject><subject>Anxiety disorders</subject><subject>Application programming interface</subject><subject>Blooms taxonomy</subject><subject>Chatbots</subject><subject>Classification</subject><subject>Cognition &amp; reasoning</subject><subject>Cognitive ability</subject><subject>Data analysis</subject><subject>Education, Medical</subject><subject>Educational objectives</subject><subject>Hallucinations</subject><subject>Health care reform</subject><subject>Heart attacks</subject><subject>Humans</subject><subject>Language</subject><subject>Learning</subject><subject>Medical education</subject><subject>Medical schools</subject><subject>Medical students</subject><subject>Medicine</subject><subject>Methods</subject><subject>Multiple choice</subject><subject>Original Paper</subject><subject>Post traumatic stress disorder</subject><subject>Psychosomatic Medicine</subject><subject>Psychotherapy</subject><subject>Qualitative research</subject><subject>Research Design</subject><subject>Tests</subject><issn>1438-8871</issn><issn>1439-4456</issn><issn>1438-8871</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>7QJ</sourceid><sourceid>ALSLI</sourceid><sourceid>CNYFK</sourceid><sourceid>F2A</sourceid><sourceid>M1O</sourceid><sourceid>PIMPY</sourceid><sourceid>DOA</sourceid><recordid>eNptkm9v0zAQxiMEYlvZV0CREAJeZPhPHNu8QaUao9IKQ-teW659SV0lcYmTqf32uO0YK0J-cdb5d4_vHl2SnGN0QbAsPjKCMX2WnOKcikwIjp8_uZ8kZyGsECIol_hlckIFKTDl4jRZj0OAEFxbpZOl7q9u5u9COtOhh26b-jL9UnvfxNRcb3zrm216t2dvwtYsffCN7p1JZ2CdcS2klxvdpD8HCL3zbfiUztwGbDaDfultSG_7wW5fJS9KXQc4f4ij5O7r5XzyLbv-cTWdjK8zw5jsswXVMRaMUIYwL6BYMMkRN9QW1lKMQROpQZCcU8Gk4AITRIQ1jBBNjbB0lEwPutbrlVp3rtHdVnnt1D7hu0rpLjZfg8I5lgiVloFEOZJELJiVltKc5pwYyaPW54PWelg0YA20fafrI9Hjl9YtVeXvFUYij53JqPD-QaHzv3b-qMYFA3WtW_BDUERiLgrM47yj5M0_6MoPXRu92lMFxgzRv1Sl4wSuLX382OxE1ZgLRONAcqd18R8qHguNM76F0sX8UcGHo4LI9LDpKz2EoKa334_ZtwfWdD6EDspHQzBSu51U-52M3Oun7j1Sf5aQ_gb4H9Zg</recordid><startdate>20240123</startdate><enddate>20240123</enddate><creator>Herrmann-Werner, Anne</creator><creator>Festl-Wietek, Teresa</creator><creator>Holderried, Friederike</creator><creator>Herschbach, Lea</creator><creator>Griewatz, Jan</creator><creator>Masters, Ken</creator><creator>Zipfel, Stephan</creator><creator>Mahling, Moritz</creator><general>Journal of Medical Internet Research</general><general>Gunther Eysenbach MD MPH, Associate Professor</general><general>JMIR Publications</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>ISN</scope><scope>3V.</scope><scope>7QJ</scope><scope>7RV</scope><scope>7X7</scope><scope>7XB</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ALSLI</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>CCPQU</scope><scope>CNYFK</scope><scope>DWQXO</scope><scope>E3H</scope><scope>F2A</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>K9.</scope><scope>KB0</scope><scope>M0S</scope><scope>M1O</scope><scope>NAPCQ</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-3425-5020</orcidid><orcidid>https://orcid.org/0009-0005-6378-5073</orcidid><orcidid>https://orcid.org/0000-0003-1450-1757</orcidid><orcidid>https://orcid.org/0000-0001-7960-4015</orcidid><orcidid>https://orcid.org/0000-0003-1659-4440</orcidid><orcidid>https://orcid.org/0000-0002-9731-3171</orcidid><orcidid>https://orcid.org/0000-0003-1828-0920</orcidid><orcidid>https://orcid.org/0000-0003-2413-7047</orcidid></search><sort><creationdate>20240123</creationdate><title>Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study</title><author>Herrmann-Werner, Anne ; Festl-Wietek, Teresa ; Holderried, Friederike ; Herschbach, Lea ; Griewatz, Jan ; Masters, Ken ; Zipfel, Stephan ; Mahling, Moritz</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c559t-b3a559652350176e6b59707c3d6dd311ea29ae8247385987812028dc522a3c8d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Answers</topic><topic>Anxiety disorders</topic><topic>Application programming interface</topic><topic>Blooms taxonomy</topic><topic>Chatbots</topic><topic>Classification</topic><topic>Cognition &amp; reasoning</topic><topic>Cognitive ability</topic><topic>Data analysis</topic><topic>Education, Medical</topic><topic>Educational objectives</topic><topic>Hallucinations</topic><topic>Health care reform</topic><topic>Heart attacks</topic><topic>Humans</topic><topic>Language</topic><topic>Learning</topic><topic>Medical education</topic><topic>Medical schools</topic><topic>Medical students</topic><topic>Medicine</topic><topic>Methods</topic><topic>Multiple choice</topic><topic>Original Paper</topic><topic>Post traumatic stress disorder</topic><topic>Psychosomatic Medicine</topic><topic>Psychotherapy</topic><topic>Qualitative research</topic><topic>Research Design</topic><topic>Tests</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Herrmann-Werner, Anne</creatorcontrib><creatorcontrib>Festl-Wietek, Teresa</creatorcontrib><creatorcontrib>Holderried, Friederike</creatorcontrib><creatorcontrib>Herschbach, Lea</creatorcontrib><creatorcontrib>Griewatz, Jan</creatorcontrib><creatorcontrib>Masters, Ken</creatorcontrib><creatorcontrib>Zipfel, Stephan</creatorcontrib><creatorcontrib>Mahling, Moritz</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Canada</collection><collection>ProQuest Central (Corporate)</collection><collection>Applied Social Sciences Index &amp; Abstracts (ASSIA)</collection><collection>ProQuest Nursing and Allied Health Journals</collection><collection>Health &amp; Medical Collection (Proquest)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>Social Science Premium Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>ProQuest One Community College</collection><collection>Library &amp; Information Science Collection</collection><collection>ProQuest Central Korea</collection><collection>Library &amp; Information Sciences Abstracts (LISA)</collection><collection>Library &amp; Information Science Abstracts (LISA)</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>Nursing &amp; Allied Health Database (Alumni Edition)</collection><collection>Health &amp; Medical Collection (Alumni Edition)</collection><collection>ProQuest Library Science Database</collection><collection>Nursing &amp; Allied Health Premium</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>Journal of medical Internet research</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Herrmann-Werner, Anne</au><au>Festl-Wietek, Teresa</au><au>Holderried, Friederike</au><au>Herschbach, Lea</au><au>Griewatz, Jan</au><au>Masters, Ken</au><au>Zipfel, Stephan</au><au>Mahling, Moritz</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study</atitle><jtitle>Journal of medical Internet research</jtitle><addtitle>J Med Internet Res</addtitle><date>2024-01-23</date><risdate>2024</risdate><volume>26</volume><issue>4</issue><spage>e52113</spage><pages>e52113-</pages><issn>1438-8871</issn><issn>1439-4456</issn><eissn>1438-8871</eissn><abstract>Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to "hallucinations" (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy. This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions. We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy. GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P=.002 for the detailed prompt and P&lt;.001 for the short prompt). Independent of the prompt, GPT-4's lowest exam performance was 78.9% (15/19), thereby always surpassing the "pass" threshold. Our qualitative analysis of incorrect answers, based on Bloom's taxonomy, showed that errors were primarily in the "remember" (29/68) and "understand" (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom's taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.</abstract><cop>Canada</cop><pub>Journal of Medical Internet Research</pub><pmid>38261378</pmid><doi>10.2196/52113</doi><orcidid>https://orcid.org/0000-0003-3425-5020</orcidid><orcidid>https://orcid.org/0009-0005-6378-5073</orcidid><orcidid>https://orcid.org/0000-0003-1450-1757</orcidid><orcidid>https://orcid.org/0000-0001-7960-4015</orcidid><orcidid>https://orcid.org/0000-0003-1659-4440</orcidid><orcidid>https://orcid.org/0000-0002-9731-3171</orcidid><orcidid>https://orcid.org/0000-0003-1828-0920</orcidid><orcidid>https://orcid.org/0000-0003-2413-7047</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1438-8871
ispartof Journal of medical Internet research, 2024-01, Vol.26 (4), p.e52113
issn 1438-8871
1439-4456
1438-8871
language eng
recordid cdi_doaj_primary_oai_doaj_org_article_141900fd5e9040928b5d9d3343472c97
source Applied Social Sciences Index & Abstracts (ASSIA); PubMed (Medline); Library & Information Science Abstracts (LISA); Publicly Available Content Database; Social Science Premium Collection; Library & Information Science Collection
subjects Answers
Anxiety disorders
Application programming interface
Blooms taxonomy
Chatbots
Classification
Cognition & reasoning
Cognitive ability
Data analysis
Education, Medical
Educational objectives
Hallucinations
Health care reform
Heart attacks
Humans
Language
Learning
Medical education
Medical schools
Medical students
Medicine
Methods
Multiple choice
Original Paper
Post traumatic stress disorder
Psychosomatic Medicine
Psychotherapy
Qualitative research
Research Design
Tests
title Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-30T21%3A53%3A44IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Assessing%20ChatGPT's%20Mastery%20of%20Bloom's%20Taxonomy%20Using%20Psychosomatic%20Medicine%20Exam%20Questions:%20Mixed-Methods%20Study&rft.jtitle=Journal%20of%20medical%20Internet%20research&rft.au=Herrmann-Werner,%20Anne&rft.date=2024-01-23&rft.volume=26&rft.issue=4&rft.spage=e52113&rft.pages=e52113-&rft.issn=1438-8871&rft.eissn=1438-8871&rft_id=info:doi/10.2196/52113&rft_dat=%3Cgale_doaj_%3EA780390092%3C/gale_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c559t-b3a559652350176e6b59707c3d6dd311ea29ae8247385987812028dc522a3c8d3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2917611503&rft_id=info:pmid/38261378&rft_galeid=A780390092&rfr_iscdi=true