Loading…
Comparison of the Performance of Artificial Intelligence Versus Medical Professionals in the Polish Final Medical Examination
The rapid development of artificial intelligence (AI) technologies like OpenAI's Generative Pretrained Transformer (GPT), particularly ChatGPT, has shown promising applications in various fields, including medicine. This study evaluates ChatGPT's performance on the Polish Final Medical Exa...
Saved in:
Published in: | Curēus (Palo Alto, CA) CA), 2024-08, Vol.16 (8), p.e66011 |
---|---|
Main Authors: | , , , , , , , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | cdi_FETCH-LOGICAL-c300t-39f04b5ca8b08fa6956f0dd3a5aa63a6347fc03175ee612d98210f66035274ff3 |
container_end_page | |
container_issue | 8 |
container_start_page | e66011 |
container_title | Curēus (Palo Alto, CA) |
container_volume | 16 |
creator | Jaworski, Aleksander Jasiński, Dawid Jaworski, Wojciech Hop, Aleksandra Janek, Artur Sławińska, Barbara Konieczniak, Lena Rzepka, Maciej Jung, Maximilian Sysło, Oliwia Jarząbek, Victoria Błecha, Zuzanna Haraziński, Konrad Jasińska, Natalia |
description | The rapid development of artificial intelligence (AI) technologies like OpenAI's Generative Pretrained Transformer (GPT), particularly ChatGPT, has shown promising applications in various fields, including medicine. This study evaluates ChatGPT's performance on the Polish Final Medical Examination (LEK), comparing its efficacy to that of human test-takers.
The study analyzed ChatGPT's ability to answer 196 multiple-choice questions from the spring 2021 LEK. Questions were categorized into "clinical cases" and "other" general medical knowledge, and then divided according to medical fields. Two versions of ChatGPT (3.5 and 4.0) were tested. Statistical analyses, including Pearson's χ
test, and Mann-Whitney U test, were conducted to compare the AI's performance and confidence levels.
ChatGPT 3.5 correctly answered 50.51% of the questions, while ChatGPT 4.0 answered 77.55% correctly, surpassing the 56% passing threshold. Version 3.5 showed significantly higher confidence in correct answers, whereas version 4.0 maintained consistent confidence regardless of answer accuracy. No significant differences in performance were observed across different medical fields.
ChatGPT 4.0 demonstrated the ability to pass the LEK, indicating substantial potential for AI in medical education and assessment. Future improvements in AI models, such as the anticipated ChatGPT 5.0, may enhance further performance, potentially equaling or surpassing human test-takers. |
doi_str_mv | 10.7759/cureus.66011 |
format | article |
fullrecord | <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_11366403</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3111372510</sourcerecordid><originalsourceid>FETCH-LOGICAL-c300t-39f04b5ca8b08fa6956f0dd3a5aa63a6347fc03175ee612d98210f66035274ff3</originalsourceid><addsrcrecordid>eNpdkc1LxDAQxYMoKurNsxS8eHB10rRJcxJZ_AJFD-o1ZNOJG2mbNWlFD_7vZl1dVAgkvPfjMZlHyC6FIyFKeWyGgEM84hwoXSGbOeXVqKJVsfrrvUF2YnwGAAoiBwHrZIPJPKdM8E3yMfbtTAcXfZd5m_VTzO4wWB9a3RmcS6ehd9YZp5vsquuxadwTzq1HDHGI2Q3WziTvLniLMTrf6SZmrltE-cbFaXbukrgkz950m4Q-odtkzSYcd77vLfJwfnY_vhxd315cjU-vR4YB9CMmLRST0uhqApXVXJbcQl0zXWrNWTqFsAYYFSUip3ktq5yCTUthZS4Ka9kWOVnkzoZJi7XBrg-6UbPgWh3elddO_XU6N1VP_lVRyjgvgKWEg--E4F8GjL1qXTRpG7pDP0TFQMqqrKSkCd3_hz77IczXohhNgSIvKSTqcEGZ4GMMaJfTUFDzbtWiW_XVbcL3fv9gCf80yT4Bp36iMw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3111372510</pqid></control><display><type>article</type><title>Comparison of the Performance of Artificial Intelligence Versus Medical Professionals in the Polish Final Medical Examination</title><source>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</source><source>PubMed Central</source><creator>Jaworski, Aleksander ; Jasiński, Dawid ; Jaworski, Wojciech ; Hop, Aleksandra ; Janek, Artur ; Sławińska, Barbara ; Konieczniak, Lena ; Rzepka, Maciej ; Jung, Maximilian ; Sysło, Oliwia ; Jarząbek, Victoria ; Błecha, Zuzanna ; Haraziński, Konrad ; Jasińska, Natalia</creator><creatorcontrib>Jaworski, Aleksander ; Jasiński, Dawid ; Jaworski, Wojciech ; Hop, Aleksandra ; Janek, Artur ; Sławińska, Barbara ; Konieczniak, Lena ; Rzepka, Maciej ; Jung, Maximilian ; Sysło, Oliwia ; Jarząbek, Victoria ; Błecha, Zuzanna ; Haraziński, Konrad ; Jasińska, Natalia</creatorcontrib><description>The rapid development of artificial intelligence (AI) technologies like OpenAI's Generative Pretrained Transformer (GPT), particularly ChatGPT, has shown promising applications in various fields, including medicine. This study evaluates ChatGPT's performance on the Polish Final Medical Examination (LEK), comparing its efficacy to that of human test-takers.
The study analyzed ChatGPT's ability to answer 196 multiple-choice questions from the spring 2021 LEK. Questions were categorized into "clinical cases" and "other" general medical knowledge, and then divided according to medical fields. Two versions of ChatGPT (3.5 and 4.0) were tested. Statistical analyses, including Pearson's χ
test, and Mann-Whitney U test, were conducted to compare the AI's performance and confidence levels.
ChatGPT 3.5 correctly answered 50.51% of the questions, while ChatGPT 4.0 answered 77.55% correctly, surpassing the 56% passing threshold. Version 3.5 showed significantly higher confidence in correct answers, whereas version 4.0 maintained consistent confidence regardless of answer accuracy. No significant differences in performance were observed across different medical fields.
ChatGPT 4.0 demonstrated the ability to pass the LEK, indicating substantial potential for AI in medical education and assessment. Future improvements in AI models, such as the anticipated ChatGPT 5.0, may enhance further performance, potentially equaling or surpassing human test-takers.</description><identifier>ISSN: 2168-8184</identifier><identifier>EISSN: 2168-8184</identifier><identifier>DOI: 10.7759/cureus.66011</identifier><identifier>PMID: 39221376</identifier><language>eng</language><publisher>United States: Cureus Inc</publisher><subject>Artificial intelligence ; Bioethics ; Certification ; Chatbots ; Confidence ; Deep learning ; Emergency medical care ; Gynecology ; Healthcare Technology ; Intensive care ; Internal medicine ; Machine learning ; Mann-Whitney U test ; Medical Education ; Medical research ; Medical Simulation ; Medicine ; Multiple choice ; Natural language ; Obstetrics ; Pediatrics ; Psychiatry ; Public health ; Social networks ; Statistical analysis ; Surgery</subject><ispartof>Curēus (Palo Alto, CA), 2024-08, Vol.16 (8), p.e66011</ispartof><rights>Copyright © 2024, Jaworski et al.</rights><rights>Copyright © 2024, Jaworski et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>Copyright © 2024, Jaworski et al. 2024 Jaworski et al.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c300t-39f04b5ca8b08fa6956f0dd3a5aa63a6347fc03175ee612d98210f66035274ff3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/3111372510/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/3111372510?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,885,25752,27923,27924,37011,37012,44589,53790,53792,74997</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/39221376$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Jaworski, Aleksander</creatorcontrib><creatorcontrib>Jasiński, Dawid</creatorcontrib><creatorcontrib>Jaworski, Wojciech</creatorcontrib><creatorcontrib>Hop, Aleksandra</creatorcontrib><creatorcontrib>Janek, Artur</creatorcontrib><creatorcontrib>Sławińska, Barbara</creatorcontrib><creatorcontrib>Konieczniak, Lena</creatorcontrib><creatorcontrib>Rzepka, Maciej</creatorcontrib><creatorcontrib>Jung, Maximilian</creatorcontrib><creatorcontrib>Sysło, Oliwia</creatorcontrib><creatorcontrib>Jarząbek, Victoria</creatorcontrib><creatorcontrib>Błecha, Zuzanna</creatorcontrib><creatorcontrib>Haraziński, Konrad</creatorcontrib><creatorcontrib>Jasińska, Natalia</creatorcontrib><title>Comparison of the Performance of Artificial Intelligence Versus Medical Professionals in the Polish Final Medical Examination</title><title>Curēus (Palo Alto, CA)</title><addtitle>Cureus</addtitle><description>The rapid development of artificial intelligence (AI) technologies like OpenAI's Generative Pretrained Transformer (GPT), particularly ChatGPT, has shown promising applications in various fields, including medicine. This study evaluates ChatGPT's performance on the Polish Final Medical Examination (LEK), comparing its efficacy to that of human test-takers.
The study analyzed ChatGPT's ability to answer 196 multiple-choice questions from the spring 2021 LEK. Questions were categorized into "clinical cases" and "other" general medical knowledge, and then divided according to medical fields. Two versions of ChatGPT (3.5 and 4.0) were tested. Statistical analyses, including Pearson's χ
test, and Mann-Whitney U test, were conducted to compare the AI's performance and confidence levels.
ChatGPT 3.5 correctly answered 50.51% of the questions, while ChatGPT 4.0 answered 77.55% correctly, surpassing the 56% passing threshold. Version 3.5 showed significantly higher confidence in correct answers, whereas version 4.0 maintained consistent confidence regardless of answer accuracy. No significant differences in performance were observed across different medical fields.
ChatGPT 4.0 demonstrated the ability to pass the LEK, indicating substantial potential for AI in medical education and assessment. Future improvements in AI models, such as the anticipated ChatGPT 5.0, may enhance further performance, potentially equaling or surpassing human test-takers.</description><subject>Artificial intelligence</subject><subject>Bioethics</subject><subject>Certification</subject><subject>Chatbots</subject><subject>Confidence</subject><subject>Deep learning</subject><subject>Emergency medical care</subject><subject>Gynecology</subject><subject>Healthcare Technology</subject><subject>Intensive care</subject><subject>Internal medicine</subject><subject>Machine learning</subject><subject>Mann-Whitney U test</subject><subject>Medical Education</subject><subject>Medical research</subject><subject>Medical Simulation</subject><subject>Medicine</subject><subject>Multiple choice</subject><subject>Natural language</subject><subject>Obstetrics</subject><subject>Pediatrics</subject><subject>Psychiatry</subject><subject>Public health</subject><subject>Social networks</subject><subject>Statistical analysis</subject><subject>Surgery</subject><issn>2168-8184</issn><issn>2168-8184</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNpdkc1LxDAQxYMoKurNsxS8eHB10rRJcxJZ_AJFD-o1ZNOJG2mbNWlFD_7vZl1dVAgkvPfjMZlHyC6FIyFKeWyGgEM84hwoXSGbOeXVqKJVsfrrvUF2YnwGAAoiBwHrZIPJPKdM8E3yMfbtTAcXfZd5m_VTzO4wWB9a3RmcS6ehd9YZp5vsquuxadwTzq1HDHGI2Q3WziTvLniLMTrf6SZmrltE-cbFaXbukrgkz950m4Q-odtkzSYcd77vLfJwfnY_vhxd315cjU-vR4YB9CMmLRST0uhqApXVXJbcQl0zXWrNWTqFsAYYFSUip3ktq5yCTUthZS4Ka9kWOVnkzoZJi7XBrg-6UbPgWh3elddO_XU6N1VP_lVRyjgvgKWEg--E4F8GjL1qXTRpG7pDP0TFQMqqrKSkCd3_hz77IczXohhNgSIvKSTqcEGZ4GMMaJfTUFDzbtWiW_XVbcL3fv9gCf80yT4Bp36iMw</recordid><startdate>20240802</startdate><enddate>20240802</enddate><creator>Jaworski, Aleksander</creator><creator>Jasiński, Dawid</creator><creator>Jaworski, Wojciech</creator><creator>Hop, Aleksandra</creator><creator>Janek, Artur</creator><creator>Sławińska, Barbara</creator><creator>Konieczniak, Lena</creator><creator>Rzepka, Maciej</creator><creator>Jung, Maximilian</creator><creator>Sysło, Oliwia</creator><creator>Jarząbek, Victoria</creator><creator>Błecha, Zuzanna</creator><creator>Haraziński, Konrad</creator><creator>Jasińska, Natalia</creator><general>Cureus Inc</general><general>Cureus</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7X7</scope><scope>7XB</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>K9.</scope><scope>M0S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>7X8</scope><scope>5PM</scope></search><sort><creationdate>20240802</creationdate><title>Comparison of the Performance of Artificial Intelligence Versus Medical Professionals in the Polish Final Medical Examination</title><author>Jaworski, Aleksander ; Jasiński, Dawid ; Jaworski, Wojciech ; Hop, Aleksandra ; Janek, Artur ; Sławińska, Barbara ; Konieczniak, Lena ; Rzepka, Maciej ; Jung, Maximilian ; Sysło, Oliwia ; Jarząbek, Victoria ; Błecha, Zuzanna ; Haraziński, Konrad ; Jasińska, Natalia</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c300t-39f04b5ca8b08fa6956f0dd3a5aa63a6347fc03175ee612d98210f66035274ff3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Artificial intelligence</topic><topic>Bioethics</topic><topic>Certification</topic><topic>Chatbots</topic><topic>Confidence</topic><topic>Deep learning</topic><topic>Emergency medical care</topic><topic>Gynecology</topic><topic>Healthcare Technology</topic><topic>Intensive care</topic><topic>Internal medicine</topic><topic>Machine learning</topic><topic>Mann-Whitney U test</topic><topic>Medical Education</topic><topic>Medical research</topic><topic>Medical Simulation</topic><topic>Medicine</topic><topic>Multiple choice</topic><topic>Natural language</topic><topic>Obstetrics</topic><topic>Pediatrics</topic><topic>Psychiatry</topic><topic>Public health</topic><topic>Social networks</topic><topic>Statistical analysis</topic><topic>Surgery</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Jaworski, Aleksander</creatorcontrib><creatorcontrib>Jasiński, Dawid</creatorcontrib><creatorcontrib>Jaworski, Wojciech</creatorcontrib><creatorcontrib>Hop, Aleksandra</creatorcontrib><creatorcontrib>Janek, Artur</creatorcontrib><creatorcontrib>Sławińska, Barbara</creatorcontrib><creatorcontrib>Konieczniak, Lena</creatorcontrib><creatorcontrib>Rzepka, Maciej</creatorcontrib><creatorcontrib>Jung, Maximilian</creatorcontrib><creatorcontrib>Sysło, Oliwia</creatorcontrib><creatorcontrib>Jarząbek, Victoria</creatorcontrib><creatorcontrib>Błecha, Zuzanna</creatorcontrib><creatorcontrib>Haraziński, Konrad</creatorcontrib><creatorcontrib>Jasińska, Natalia</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Health & Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>Health & Medical Collection (Alumni Edition)</collection><collection>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Curēus (Palo Alto, CA)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Jaworski, Aleksander</au><au>Jasiński, Dawid</au><au>Jaworski, Wojciech</au><au>Hop, Aleksandra</au><au>Janek, Artur</au><au>Sławińska, Barbara</au><au>Konieczniak, Lena</au><au>Rzepka, Maciej</au><au>Jung, Maximilian</au><au>Sysło, Oliwia</au><au>Jarząbek, Victoria</au><au>Błecha, Zuzanna</au><au>Haraziński, Konrad</au><au>Jasińska, Natalia</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Comparison of the Performance of Artificial Intelligence Versus Medical Professionals in the Polish Final Medical Examination</atitle><jtitle>Curēus (Palo Alto, CA)</jtitle><addtitle>Cureus</addtitle><date>2024-08-02</date><risdate>2024</risdate><volume>16</volume><issue>8</issue><spage>e66011</spage><pages>e66011-</pages><issn>2168-8184</issn><eissn>2168-8184</eissn><abstract>The rapid development of artificial intelligence (AI) technologies like OpenAI's Generative Pretrained Transformer (GPT), particularly ChatGPT, has shown promising applications in various fields, including medicine. This study evaluates ChatGPT's performance on the Polish Final Medical Examination (LEK), comparing its efficacy to that of human test-takers.
The study analyzed ChatGPT's ability to answer 196 multiple-choice questions from the spring 2021 LEK. Questions were categorized into "clinical cases" and "other" general medical knowledge, and then divided according to medical fields. Two versions of ChatGPT (3.5 and 4.0) were tested. Statistical analyses, including Pearson's χ
test, and Mann-Whitney U test, were conducted to compare the AI's performance and confidence levels.
ChatGPT 3.5 correctly answered 50.51% of the questions, while ChatGPT 4.0 answered 77.55% correctly, surpassing the 56% passing threshold. Version 3.5 showed significantly higher confidence in correct answers, whereas version 4.0 maintained consistent confidence regardless of answer accuracy. No significant differences in performance were observed across different medical fields.
ChatGPT 4.0 demonstrated the ability to pass the LEK, indicating substantial potential for AI in medical education and assessment. Future improvements in AI models, such as the anticipated ChatGPT 5.0, may enhance further performance, potentially equaling or surpassing human test-takers.</abstract><cop>United States</cop><pub>Cureus Inc</pub><pmid>39221376</pmid><doi>10.7759/cureus.66011</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2168-8184 |
ispartof | Curēus (Palo Alto, CA), 2024-08, Vol.16 (8), p.e66011 |
issn | 2168-8184 2168-8184 |
language | eng |
recordid | cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_11366403 |
source | Publicly Available Content Database (Proquest) (PQ_SDU_P3); PubMed Central |
subjects | Artificial intelligence Bioethics Certification Chatbots Confidence Deep learning Emergency medical care Gynecology Healthcare Technology Intensive care Internal medicine Machine learning Mann-Whitney U test Medical Education Medical research Medical Simulation Medicine Multiple choice Natural language Obstetrics Pediatrics Psychiatry Public health Social networks Statistical analysis Surgery |
title | Comparison of the Performance of Artificial Intelligence Versus Medical Professionals in the Polish Final Medical Examination |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T12%3A45%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Comparison%20of%20the%20Performance%20of%20Artificial%20Intelligence%20Versus%20Medical%20Professionals%20in%20the%20Polish%20Final%20Medical%20Examination&rft.jtitle=Cur%C4%93us%20(Palo%20Alto,%20CA)&rft.au=Jaworski,%20Aleksander&rft.date=2024-08-02&rft.volume=16&rft.issue=8&rft.spage=e66011&rft.pages=e66011-&rft.issn=2168-8184&rft.eissn=2168-8184&rft_id=info:doi/10.7759/cureus.66011&rft_dat=%3Cproquest_pubme%3E3111372510%3C/proquest_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c300t-39f04b5ca8b08fa6956f0dd3a5aa63a6347fc03175ee612d98210f66035274ff3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3111372510&rft_id=info:pmid/39221376&rfr_iscdi=true |