Loading…
Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study
Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to "hallucinations" (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how thes...
Saved in:
Published in: | Journal of medical Internet research 2024-01, Vol.26 (4), p.e52113 |
---|---|
Main Authors: | , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c559t-b3a559652350176e6b59707c3d6dd311ea29ae8247385987812028dc522a3c8d3 |
---|---|
cites | cdi_FETCH-LOGICAL-c559t-b3a559652350176e6b59707c3d6dd311ea29ae8247385987812028dc522a3c8d3 |
container_end_page | |
container_issue | 4 |
container_start_page | e52113 |
container_title | Journal of medical Internet research |
container_volume | 26 |
creator | Herrmann-Werner, Anne Festl-Wietek, Teresa Holderried, Friederike Herschbach, Lea Griewatz, Jan Masters, Ken Zipfel, Stephan Mahling, Moritz |
description | Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to "hallucinations" (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy.
This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions.
We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy.
GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P=.002 for the detailed prompt and P |
doi_str_mv | 10.2196/52113 |
format | article |
fullrecord | <record><control><sourceid>gale_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_141900fd5e9040928b5d9d3343472c97</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A780390092</galeid><doaj_id>oai_doaj_org_article_141900fd5e9040928b5d9d3343472c97</doaj_id><sourcerecordid>A780390092</sourcerecordid><originalsourceid>FETCH-LOGICAL-c559t-b3a559652350176e6b59707c3d6dd311ea29ae8247385987812028dc522a3c8d3</originalsourceid><addsrcrecordid>eNptkm9v0zAQxiMEYlvZV0CREAJeZPhPHNu8QaUao9IKQ-teW659SV0lcYmTqf32uO0YK0J-cdb5d4_vHl2SnGN0QbAsPjKCMX2WnOKcikwIjp8_uZ8kZyGsECIol_hlckIFKTDl4jRZj0OAEFxbpZOl7q9u5u9COtOhh26b-jL9UnvfxNRcb3zrm216t2dvwtYsffCN7p1JZ2CdcS2klxvdpD8HCL3zbfiUztwGbDaDfultSG_7wW5fJS9KXQc4f4ij5O7r5XzyLbv-cTWdjK8zw5jsswXVMRaMUIYwL6BYMMkRN9QW1lKMQROpQZCcU8Gk4AITRIQ1jBBNjbB0lEwPutbrlVp3rtHdVnnt1D7hu0rpLjZfg8I5lgiVloFEOZJELJiVltKc5pwYyaPW54PWelg0YA20fafrI9Hjl9YtVeXvFUYij53JqPD-QaHzv3b-qMYFA3WtW_BDUERiLgrM47yj5M0_6MoPXRu92lMFxgzRv1Sl4wSuLX382OxE1ZgLRONAcqd18R8qHguNM76F0sX8UcGHo4LI9LDpKz2EoKa334_ZtwfWdD6EDspHQzBSu51U-52M3Oun7j1Sf5aQ_gb4H9Zg</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2917611503</pqid></control><display><type>article</type><title>Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study</title><source>Applied Social Sciences Index & Abstracts (ASSIA)</source><source>PubMed (Medline)</source><source>Library & Information Science Abstracts (LISA)</source><source>Publicly Available Content Database</source><source>Social Science Premium Collection</source><source>Library & Information Science Collection</source><creator>Herrmann-Werner, Anne ; Festl-Wietek, Teresa ; Holderried, Friederike ; Herschbach, Lea ; Griewatz, Jan ; Masters, Ken ; Zipfel, Stephan ; Mahling, Moritz</creator><creatorcontrib>Herrmann-Werner, Anne ; Festl-Wietek, Teresa ; Holderried, Friederike ; Herschbach, Lea ; Griewatz, Jan ; Masters, Ken ; Zipfel, Stephan ; Mahling, Moritz</creatorcontrib><description>Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to "hallucinations" (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy.
This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions.
We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy.
GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P=.002 for the detailed prompt and P<.001 for the short prompt). Independent of the prompt, GPT-4's lowest exam performance was 78.9% (15/19), thereby always surpassing the "pass" threshold. Our qualitative analysis of incorrect answers, based on Bloom's taxonomy, showed that errors were primarily in the "remember" (29/68) and "understand" (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines.
GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom's taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.</description><identifier>ISSN: 1438-8871</identifier><identifier>ISSN: 1439-4456</identifier><identifier>EISSN: 1438-8871</identifier><identifier>DOI: 10.2196/52113</identifier><identifier>PMID: 38261378</identifier><language>eng</language><publisher>Canada: Journal of Medical Internet Research</publisher><subject>Answers ; Anxiety disorders ; Application programming interface ; Blooms taxonomy ; Chatbots ; Classification ; Cognition & reasoning ; Cognitive ability ; Data analysis ; Education, Medical ; Educational objectives ; Hallucinations ; Health care reform ; Heart attacks ; Humans ; Language ; Learning ; Medical education ; Medical schools ; Medical students ; Medicine ; Methods ; Multiple choice ; Original Paper ; Post traumatic stress disorder ; Psychosomatic Medicine ; Psychotherapy ; Qualitative research ; Research Design ; Tests</subject><ispartof>Journal of medical Internet research, 2024-01, Vol.26 (4), p.e52113</ispartof><rights>Anne Herrmann-Werner, Teresa Festl-Wietek, Friederike Holderried, Lea Herschbach, Jan Griewatz, Ken Masters, Stephan Zipfel, Moritz Mahling. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 23.01.2024.</rights><rights>COPYRIGHT 2024 Journal of Medical Internet Research</rights><rights>2024. This work is licensed under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>Anne Herrmann-Werner, Teresa Festl-Wietek, Friederike Holderried, Lea Herschbach, Jan Griewatz, Ken Masters, Stephan Zipfel, Moritz Mahling. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 23.01.2024. 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c559t-b3a559652350176e6b59707c3d6dd311ea29ae8247385987812028dc522a3c8d3</citedby><cites>FETCH-LOGICAL-c559t-b3a559652350176e6b59707c3d6dd311ea29ae8247385987812028dc522a3c8d3</cites><orcidid>0000-0003-3425-5020 ; 0009-0005-6378-5073 ; 0000-0003-1450-1757 ; 0000-0001-7960-4015 ; 0000-0003-1659-4440 ; 0000-0002-9731-3171 ; 0000-0003-1828-0920 ; 0000-0003-2413-7047</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/2917611503/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2917611503?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,885,12846,21381,21394,25753,27305,27924,27925,30999,33611,33612,33906,33907,34135,37012,37013,43733,43892,44590,74221,74409,75126</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38261378$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Herrmann-Werner, Anne</creatorcontrib><creatorcontrib>Festl-Wietek, Teresa</creatorcontrib><creatorcontrib>Holderried, Friederike</creatorcontrib><creatorcontrib>Herschbach, Lea</creatorcontrib><creatorcontrib>Griewatz, Jan</creatorcontrib><creatorcontrib>Masters, Ken</creatorcontrib><creatorcontrib>Zipfel, Stephan</creatorcontrib><creatorcontrib>Mahling, Moritz</creatorcontrib><title>Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study</title><title>Journal of medical Internet research</title><addtitle>J Med Internet Res</addtitle><description>Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to "hallucinations" (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy.
This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions.
We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy.
GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P=.002 for the detailed prompt and P<.001 for the short prompt). Independent of the prompt, GPT-4's lowest exam performance was 78.9% (15/19), thereby always surpassing the "pass" threshold. Our qualitative analysis of incorrect answers, based on Bloom's taxonomy, showed that errors were primarily in the "remember" (29/68) and "understand" (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines.
GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom's taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.</description><subject>Answers</subject><subject>Anxiety disorders</subject><subject>Application programming interface</subject><subject>Blooms taxonomy</subject><subject>Chatbots</subject><subject>Classification</subject><subject>Cognition & reasoning</subject><subject>Cognitive ability</subject><subject>Data analysis</subject><subject>Education, Medical</subject><subject>Educational objectives</subject><subject>Hallucinations</subject><subject>Health care reform</subject><subject>Heart attacks</subject><subject>Humans</subject><subject>Language</subject><subject>Learning</subject><subject>Medical education</subject><subject>Medical schools</subject><subject>Medical students</subject><subject>Medicine</subject><subject>Methods</subject><subject>Multiple choice</subject><subject>Original Paper</subject><subject>Post traumatic stress disorder</subject><subject>Psychosomatic Medicine</subject><subject>Psychotherapy</subject><subject>Qualitative research</subject><subject>Research Design</subject><subject>Tests</subject><issn>1438-8871</issn><issn>1439-4456</issn><issn>1438-8871</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>7QJ</sourceid><sourceid>ALSLI</sourceid><sourceid>CNYFK</sourceid><sourceid>F2A</sourceid><sourceid>M1O</sourceid><sourceid>PIMPY</sourceid><sourceid>DOA</sourceid><recordid>eNptkm9v0zAQxiMEYlvZV0CREAJeZPhPHNu8QaUao9IKQ-teW659SV0lcYmTqf32uO0YK0J-cdb5d4_vHl2SnGN0QbAsPjKCMX2WnOKcikwIjp8_uZ8kZyGsECIol_hlckIFKTDl4jRZj0OAEFxbpZOl7q9u5u9COtOhh26b-jL9UnvfxNRcb3zrm216t2dvwtYsffCN7p1JZ2CdcS2klxvdpD8HCL3zbfiUztwGbDaDfultSG_7wW5fJS9KXQc4f4ij5O7r5XzyLbv-cTWdjK8zw5jsswXVMRaMUIYwL6BYMMkRN9QW1lKMQROpQZCcU8Gk4AITRIQ1jBBNjbB0lEwPutbrlVp3rtHdVnnt1D7hu0rpLjZfg8I5lgiVloFEOZJELJiVltKc5pwYyaPW54PWelg0YA20fafrI9Hjl9YtVeXvFUYij53JqPD-QaHzv3b-qMYFA3WtW_BDUERiLgrM47yj5M0_6MoPXRu92lMFxgzRv1Sl4wSuLX382OxE1ZgLRONAcqd18R8qHguNM76F0sX8UcGHo4LI9LDpKz2EoKa334_ZtwfWdD6EDspHQzBSu51U-52M3Oun7j1Sf5aQ_gb4H9Zg</recordid><startdate>20240123</startdate><enddate>20240123</enddate><creator>Herrmann-Werner, Anne</creator><creator>Festl-Wietek, Teresa</creator><creator>Holderried, Friederike</creator><creator>Herschbach, Lea</creator><creator>Griewatz, Jan</creator><creator>Masters, Ken</creator><creator>Zipfel, Stephan</creator><creator>Mahling, Moritz</creator><general>Journal of Medical Internet Research</general><general>Gunther Eysenbach MD MPH, Associate Professor</general><general>JMIR Publications</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>ISN</scope><scope>3V.</scope><scope>7QJ</scope><scope>7RV</scope><scope>7X7</scope><scope>7XB</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ALSLI</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>CCPQU</scope><scope>CNYFK</scope><scope>DWQXO</scope><scope>E3H</scope><scope>F2A</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>K9.</scope><scope>KB0</scope><scope>M0S</scope><scope>M1O</scope><scope>NAPCQ</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-3425-5020</orcidid><orcidid>https://orcid.org/0009-0005-6378-5073</orcidid><orcidid>https://orcid.org/0000-0003-1450-1757</orcidid><orcidid>https://orcid.org/0000-0001-7960-4015</orcidid><orcidid>https://orcid.org/0000-0003-1659-4440</orcidid><orcidid>https://orcid.org/0000-0002-9731-3171</orcidid><orcidid>https://orcid.org/0000-0003-1828-0920</orcidid><orcidid>https://orcid.org/0000-0003-2413-7047</orcidid></search><sort><creationdate>20240123</creationdate><title>Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study</title><author>Herrmann-Werner, Anne ; Festl-Wietek, Teresa ; Holderried, Friederike ; Herschbach, Lea ; Griewatz, Jan ; Masters, Ken ; Zipfel, Stephan ; Mahling, Moritz</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c559t-b3a559652350176e6b59707c3d6dd311ea29ae8247385987812028dc522a3c8d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Answers</topic><topic>Anxiety disorders</topic><topic>Application programming interface</topic><topic>Blooms taxonomy</topic><topic>Chatbots</topic><topic>Classification</topic><topic>Cognition & reasoning</topic><topic>Cognitive ability</topic><topic>Data analysis</topic><topic>Education, Medical</topic><topic>Educational objectives</topic><topic>Hallucinations</topic><topic>Health care reform</topic><topic>Heart attacks</topic><topic>Humans</topic><topic>Language</topic><topic>Learning</topic><topic>Medical education</topic><topic>Medical schools</topic><topic>Medical students</topic><topic>Medicine</topic><topic>Methods</topic><topic>Multiple choice</topic><topic>Original Paper</topic><topic>Post traumatic stress disorder</topic><topic>Psychosomatic Medicine</topic><topic>Psychotherapy</topic><topic>Qualitative research</topic><topic>Research Design</topic><topic>Tests</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Herrmann-Werner, Anne</creatorcontrib><creatorcontrib>Festl-Wietek, Teresa</creatorcontrib><creatorcontrib>Holderried, Friederike</creatorcontrib><creatorcontrib>Herschbach, Lea</creatorcontrib><creatorcontrib>Griewatz, Jan</creatorcontrib><creatorcontrib>Masters, Ken</creatorcontrib><creatorcontrib>Zipfel, Stephan</creatorcontrib><creatorcontrib>Mahling, Moritz</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Canada</collection><collection>ProQuest Central (Corporate)</collection><collection>Applied Social Sciences Index & Abstracts (ASSIA)</collection><collection>ProQuest Nursing and Allied Health Journals</collection><collection>Health & Medical Collection (Proquest)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>Social Science Premium Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>ProQuest One Community College</collection><collection>Library & Information Science Collection</collection><collection>ProQuest Central Korea</collection><collection>Library & Information Sciences Abstracts (LISA)</collection><collection>Library & Information Science Abstracts (LISA)</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>Nursing & Allied Health Database (Alumni Edition)</collection><collection>Health & Medical Collection (Alumni Edition)</collection><collection>ProQuest Library Science Database</collection><collection>Nursing & Allied Health Premium</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>Journal of medical Internet research</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Herrmann-Werner, Anne</au><au>Festl-Wietek, Teresa</au><au>Holderried, Friederike</au><au>Herschbach, Lea</au><au>Griewatz, Jan</au><au>Masters, Ken</au><au>Zipfel, Stephan</au><au>Mahling, Moritz</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study</atitle><jtitle>Journal of medical Internet research</jtitle><addtitle>J Med Internet Res</addtitle><date>2024-01-23</date><risdate>2024</risdate><volume>26</volume><issue>4</issue><spage>e52113</spage><pages>e52113-</pages><issn>1438-8871</issn><issn>1439-4456</issn><eissn>1438-8871</eissn><abstract>Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to "hallucinations" (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy.
This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions.
We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy.
GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P=.002 for the detailed prompt and P<.001 for the short prompt). Independent of the prompt, GPT-4's lowest exam performance was 78.9% (15/19), thereby always surpassing the "pass" threshold. Our qualitative analysis of incorrect answers, based on Bloom's taxonomy, showed that errors were primarily in the "remember" (29/68) and "understand" (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines.
GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom's taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.</abstract><cop>Canada</cop><pub>Journal of Medical Internet Research</pub><pmid>38261378</pmid><doi>10.2196/52113</doi><orcidid>https://orcid.org/0000-0003-3425-5020</orcidid><orcidid>https://orcid.org/0009-0005-6378-5073</orcidid><orcidid>https://orcid.org/0000-0003-1450-1757</orcidid><orcidid>https://orcid.org/0000-0001-7960-4015</orcidid><orcidid>https://orcid.org/0000-0003-1659-4440</orcidid><orcidid>https://orcid.org/0000-0002-9731-3171</orcidid><orcidid>https://orcid.org/0000-0003-1828-0920</orcidid><orcidid>https://orcid.org/0000-0003-2413-7047</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1438-8871 |
ispartof | Journal of medical Internet research, 2024-01, Vol.26 (4), p.e52113 |
issn | 1438-8871 1439-4456 1438-8871 |
language | eng |
recordid | cdi_doaj_primary_oai_doaj_org_article_141900fd5e9040928b5d9d3343472c97 |
source | Applied Social Sciences Index & Abstracts (ASSIA); PubMed (Medline); Library & Information Science Abstracts (LISA); Publicly Available Content Database; Social Science Premium Collection; Library & Information Science Collection |
subjects | Answers Anxiety disorders Application programming interface Blooms taxonomy Chatbots Classification Cognition & reasoning Cognitive ability Data analysis Education, Medical Educational objectives Hallucinations Health care reform Heart attacks Humans Language Learning Medical education Medical schools Medical students Medicine Methods Multiple choice Original Paper Post traumatic stress disorder Psychosomatic Medicine Psychotherapy Qualitative research Research Design Tests |
title | Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-30T21%3A53%3A44IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Assessing%20ChatGPT's%20Mastery%20of%20Bloom's%20Taxonomy%20Using%20Psychosomatic%20Medicine%20Exam%20Questions:%20Mixed-Methods%20Study&rft.jtitle=Journal%20of%20medical%20Internet%20research&rft.au=Herrmann-Werner,%20Anne&rft.date=2024-01-23&rft.volume=26&rft.issue=4&rft.spage=e52113&rft.pages=e52113-&rft.issn=1438-8871&rft.eissn=1438-8871&rft_id=info:doi/10.2196/52113&rft_dat=%3Cgale_doaj_%3EA780390092%3C/gale_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c559t-b3a559652350176e6b59707c3d6dd311ea29ae8247385987812028dc522a3c8d3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2917611503&rft_id=info:pmid/38261378&rft_galeid=A780390092&rfr_iscdi=true |