Loading…

Comparison of the Performance of Artificial Intelligence Versus Medical Professionals in the Polish Final Medical Examination

The rapid development of artificial intelligence (AI) technologies like OpenAI's Generative Pretrained Transformer (GPT), particularly ChatGPT, has shown promising applications in various fields, including medicine. This study evaluates ChatGPT's performance on the Polish Final Medical Exa...

Full description

Saved in:
Bibliographic Details
Published in:Curēus (Palo Alto, CA) CA), 2024-08, Vol.16 (8), p.e66011
Main Authors: Jaworski, Aleksander, Jasiński, Dawid, Jaworski, Wojciech, Hop, Aleksandra, Janek, Artur, Sławińska, Barbara, Konieczniak, Lena, Rzepka, Maciej, Jung, Maximilian, Sysło, Oliwia, Jarząbek, Victoria, Błecha, Zuzanna, Haraziński, Konrad, Jasińska, Natalia
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c300t-39f04b5ca8b08fa6956f0dd3a5aa63a6347fc03175ee612d98210f66035274ff3
container_end_page
container_issue 8
container_start_page e66011
container_title Curēus (Palo Alto, CA)
container_volume 16
creator Jaworski, Aleksander
Jasiński, Dawid
Jaworski, Wojciech
Hop, Aleksandra
Janek, Artur
Sławińska, Barbara
Konieczniak, Lena
Rzepka, Maciej
Jung, Maximilian
Sysło, Oliwia
Jarząbek, Victoria
Błecha, Zuzanna
Haraziński, Konrad
Jasińska, Natalia
description The rapid development of artificial intelligence (AI) technologies like OpenAI's Generative Pretrained Transformer (GPT), particularly ChatGPT, has shown promising applications in various fields, including medicine. This study evaluates ChatGPT's performance on the Polish Final Medical Examination (LEK), comparing its efficacy to that of human test-takers. The study analyzed ChatGPT's ability to answer 196 multiple-choice questions from the spring 2021 LEK. Questions were categorized into "clinical cases" and "other" general medical knowledge, and then divided according to medical fields. Two versions of ChatGPT (3.5 and 4.0) were tested. Statistical analyses, including Pearson's χ test, and Mann-Whitney U test, were conducted to compare the AI's performance and confidence levels. ChatGPT 3.5 correctly answered 50.51% of the questions, while ChatGPT 4.0 answered 77.55% correctly, surpassing the 56% passing threshold. Version 3.5 showed significantly higher confidence in correct answers, whereas version 4.0 maintained consistent confidence regardless of answer accuracy. No significant differences in performance were observed across different medical fields. ChatGPT 4.0 demonstrated the ability to pass the LEK, indicating substantial potential for AI in medical education and assessment. Future improvements in AI models, such as the anticipated ChatGPT 5.0, may enhance further performance, potentially equaling or surpassing human test-takers.
doi_str_mv 10.7759/cureus.66011
format article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_11366403</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3111372510</sourcerecordid><originalsourceid>FETCH-LOGICAL-c300t-39f04b5ca8b08fa6956f0dd3a5aa63a6347fc03175ee612d98210f66035274ff3</originalsourceid><addsrcrecordid>eNpdkc1LxDAQxYMoKurNsxS8eHB10rRJcxJZ_AJFD-o1ZNOJG2mbNWlFD_7vZl1dVAgkvPfjMZlHyC6FIyFKeWyGgEM84hwoXSGbOeXVqKJVsfrrvUF2YnwGAAoiBwHrZIPJPKdM8E3yMfbtTAcXfZd5m_VTzO4wWB9a3RmcS6ehd9YZp5vsquuxadwTzq1HDHGI2Q3WziTvLniLMTrf6SZmrltE-cbFaXbukrgkz950m4Q-odtkzSYcd77vLfJwfnY_vhxd315cjU-vR4YB9CMmLRST0uhqApXVXJbcQl0zXWrNWTqFsAYYFSUip3ktq5yCTUthZS4Ka9kWOVnkzoZJi7XBrg-6UbPgWh3elddO_XU6N1VP_lVRyjgvgKWEg--E4F8GjL1qXTRpG7pDP0TFQMqqrKSkCd3_hz77IczXohhNgSIvKSTqcEGZ4GMMaJfTUFDzbtWiW_XVbcL3fv9gCf80yT4Bp36iMw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3111372510</pqid></control><display><type>article</type><title>Comparison of the Performance of Artificial Intelligence Versus Medical Professionals in the Polish Final Medical Examination</title><source>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</source><source>PubMed Central</source><creator>Jaworski, Aleksander ; Jasiński, Dawid ; Jaworski, Wojciech ; Hop, Aleksandra ; Janek, Artur ; Sławińska, Barbara ; Konieczniak, Lena ; Rzepka, Maciej ; Jung, Maximilian ; Sysło, Oliwia ; Jarząbek, Victoria ; Błecha, Zuzanna ; Haraziński, Konrad ; Jasińska, Natalia</creator><creatorcontrib>Jaworski, Aleksander ; Jasiński, Dawid ; Jaworski, Wojciech ; Hop, Aleksandra ; Janek, Artur ; Sławińska, Barbara ; Konieczniak, Lena ; Rzepka, Maciej ; Jung, Maximilian ; Sysło, Oliwia ; Jarząbek, Victoria ; Błecha, Zuzanna ; Haraziński, Konrad ; Jasińska, Natalia</creatorcontrib><description>The rapid development of artificial intelligence (AI) technologies like OpenAI's Generative Pretrained Transformer (GPT), particularly ChatGPT, has shown promising applications in various fields, including medicine. This study evaluates ChatGPT's performance on the Polish Final Medical Examination (LEK), comparing its efficacy to that of human test-takers. The study analyzed ChatGPT's ability to answer 196 multiple-choice questions from the spring 2021 LEK. Questions were categorized into "clinical cases" and "other" general medical knowledge, and then divided according to medical fields. Two versions of ChatGPT (3.5 and 4.0) were tested. Statistical analyses, including Pearson's χ test, and Mann-Whitney U test, were conducted to compare the AI's performance and confidence levels. ChatGPT 3.5 correctly answered 50.51% of the questions, while ChatGPT 4.0 answered 77.55% correctly, surpassing the 56% passing threshold. Version 3.5 showed significantly higher confidence in correct answers, whereas version 4.0 maintained consistent confidence regardless of answer accuracy. No significant differences in performance were observed across different medical fields. ChatGPT 4.0 demonstrated the ability to pass the LEK, indicating substantial potential for AI in medical education and assessment. Future improvements in AI models, such as the anticipated ChatGPT 5.0, may enhance further performance, potentially equaling or surpassing human test-takers.</description><identifier>ISSN: 2168-8184</identifier><identifier>EISSN: 2168-8184</identifier><identifier>DOI: 10.7759/cureus.66011</identifier><identifier>PMID: 39221376</identifier><language>eng</language><publisher>United States: Cureus Inc</publisher><subject>Artificial intelligence ; Bioethics ; Certification ; Chatbots ; Confidence ; Deep learning ; Emergency medical care ; Gynecology ; Healthcare Technology ; Intensive care ; Internal medicine ; Machine learning ; Mann-Whitney U test ; Medical Education ; Medical research ; Medical Simulation ; Medicine ; Multiple choice ; Natural language ; Obstetrics ; Pediatrics ; Psychiatry ; Public health ; Social networks ; Statistical analysis ; Surgery</subject><ispartof>Curēus (Palo Alto, CA), 2024-08, Vol.16 (8), p.e66011</ispartof><rights>Copyright © 2024, Jaworski et al.</rights><rights>Copyright © 2024, Jaworski et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>Copyright © 2024, Jaworski et al. 2024 Jaworski et al.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c300t-39f04b5ca8b08fa6956f0dd3a5aa63a6347fc03175ee612d98210f66035274ff3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/3111372510/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/3111372510?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,885,25752,27923,27924,37011,37012,44589,53790,53792,74997</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/39221376$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Jaworski, Aleksander</creatorcontrib><creatorcontrib>Jasiński, Dawid</creatorcontrib><creatorcontrib>Jaworski, Wojciech</creatorcontrib><creatorcontrib>Hop, Aleksandra</creatorcontrib><creatorcontrib>Janek, Artur</creatorcontrib><creatorcontrib>Sławińska, Barbara</creatorcontrib><creatorcontrib>Konieczniak, Lena</creatorcontrib><creatorcontrib>Rzepka, Maciej</creatorcontrib><creatorcontrib>Jung, Maximilian</creatorcontrib><creatorcontrib>Sysło, Oliwia</creatorcontrib><creatorcontrib>Jarząbek, Victoria</creatorcontrib><creatorcontrib>Błecha, Zuzanna</creatorcontrib><creatorcontrib>Haraziński, Konrad</creatorcontrib><creatorcontrib>Jasińska, Natalia</creatorcontrib><title>Comparison of the Performance of Artificial Intelligence Versus Medical Professionals in the Polish Final Medical Examination</title><title>Curēus (Palo Alto, CA)</title><addtitle>Cureus</addtitle><description>The rapid development of artificial intelligence (AI) technologies like OpenAI's Generative Pretrained Transformer (GPT), particularly ChatGPT, has shown promising applications in various fields, including medicine. This study evaluates ChatGPT's performance on the Polish Final Medical Examination (LEK), comparing its efficacy to that of human test-takers. The study analyzed ChatGPT's ability to answer 196 multiple-choice questions from the spring 2021 LEK. Questions were categorized into "clinical cases" and "other" general medical knowledge, and then divided according to medical fields. Two versions of ChatGPT (3.5 and 4.0) were tested. Statistical analyses, including Pearson's χ test, and Mann-Whitney U test, were conducted to compare the AI's performance and confidence levels. ChatGPT 3.5 correctly answered 50.51% of the questions, while ChatGPT 4.0 answered 77.55% correctly, surpassing the 56% passing threshold. Version 3.5 showed significantly higher confidence in correct answers, whereas version 4.0 maintained consistent confidence regardless of answer accuracy. No significant differences in performance were observed across different medical fields. ChatGPT 4.0 demonstrated the ability to pass the LEK, indicating substantial potential for AI in medical education and assessment. Future improvements in AI models, such as the anticipated ChatGPT 5.0, may enhance further performance, potentially equaling or surpassing human test-takers.</description><subject>Artificial intelligence</subject><subject>Bioethics</subject><subject>Certification</subject><subject>Chatbots</subject><subject>Confidence</subject><subject>Deep learning</subject><subject>Emergency medical care</subject><subject>Gynecology</subject><subject>Healthcare Technology</subject><subject>Intensive care</subject><subject>Internal medicine</subject><subject>Machine learning</subject><subject>Mann-Whitney U test</subject><subject>Medical Education</subject><subject>Medical research</subject><subject>Medical Simulation</subject><subject>Medicine</subject><subject>Multiple choice</subject><subject>Natural language</subject><subject>Obstetrics</subject><subject>Pediatrics</subject><subject>Psychiatry</subject><subject>Public health</subject><subject>Social networks</subject><subject>Statistical analysis</subject><subject>Surgery</subject><issn>2168-8184</issn><issn>2168-8184</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNpdkc1LxDAQxYMoKurNsxS8eHB10rRJcxJZ_AJFD-o1ZNOJG2mbNWlFD_7vZl1dVAgkvPfjMZlHyC6FIyFKeWyGgEM84hwoXSGbOeXVqKJVsfrrvUF2YnwGAAoiBwHrZIPJPKdM8E3yMfbtTAcXfZd5m_VTzO4wWB9a3RmcS6ehd9YZp5vsquuxadwTzq1HDHGI2Q3WziTvLniLMTrf6SZmrltE-cbFaXbukrgkz950m4Q-odtkzSYcd77vLfJwfnY_vhxd315cjU-vR4YB9CMmLRST0uhqApXVXJbcQl0zXWrNWTqFsAYYFSUip3ktq5yCTUthZS4Ka9kWOVnkzoZJi7XBrg-6UbPgWh3elddO_XU6N1VP_lVRyjgvgKWEg--E4F8GjL1qXTRpG7pDP0TFQMqqrKSkCd3_hz77IczXohhNgSIvKSTqcEGZ4GMMaJfTUFDzbtWiW_XVbcL3fv9gCf80yT4Bp36iMw</recordid><startdate>20240802</startdate><enddate>20240802</enddate><creator>Jaworski, Aleksander</creator><creator>Jasiński, Dawid</creator><creator>Jaworski, Wojciech</creator><creator>Hop, Aleksandra</creator><creator>Janek, Artur</creator><creator>Sławińska, Barbara</creator><creator>Konieczniak, Lena</creator><creator>Rzepka, Maciej</creator><creator>Jung, Maximilian</creator><creator>Sysło, Oliwia</creator><creator>Jarząbek, Victoria</creator><creator>Błecha, Zuzanna</creator><creator>Haraziński, Konrad</creator><creator>Jasińska, Natalia</creator><general>Cureus Inc</general><general>Cureus</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7X7</scope><scope>7XB</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>K9.</scope><scope>M0S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>7X8</scope><scope>5PM</scope></search><sort><creationdate>20240802</creationdate><title>Comparison of the Performance of Artificial Intelligence Versus Medical Professionals in the Polish Final Medical Examination</title><author>Jaworski, Aleksander ; Jasiński, Dawid ; Jaworski, Wojciech ; Hop, Aleksandra ; Janek, Artur ; Sławińska, Barbara ; Konieczniak, Lena ; Rzepka, Maciej ; Jung, Maximilian ; Sysło, Oliwia ; Jarząbek, Victoria ; Błecha, Zuzanna ; Haraziński, Konrad ; Jasińska, Natalia</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c300t-39f04b5ca8b08fa6956f0dd3a5aa63a6347fc03175ee612d98210f66035274ff3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Artificial intelligence</topic><topic>Bioethics</topic><topic>Certification</topic><topic>Chatbots</topic><topic>Confidence</topic><topic>Deep learning</topic><topic>Emergency medical care</topic><topic>Gynecology</topic><topic>Healthcare Technology</topic><topic>Intensive care</topic><topic>Internal medicine</topic><topic>Machine learning</topic><topic>Mann-Whitney U test</topic><topic>Medical Education</topic><topic>Medical research</topic><topic>Medical Simulation</topic><topic>Medicine</topic><topic>Multiple choice</topic><topic>Natural language</topic><topic>Obstetrics</topic><topic>Pediatrics</topic><topic>Psychiatry</topic><topic>Public health</topic><topic>Social networks</topic><topic>Statistical analysis</topic><topic>Surgery</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Jaworski, Aleksander</creatorcontrib><creatorcontrib>Jasiński, Dawid</creatorcontrib><creatorcontrib>Jaworski, Wojciech</creatorcontrib><creatorcontrib>Hop, Aleksandra</creatorcontrib><creatorcontrib>Janek, Artur</creatorcontrib><creatorcontrib>Sławińska, Barbara</creatorcontrib><creatorcontrib>Konieczniak, Lena</creatorcontrib><creatorcontrib>Rzepka, Maciej</creatorcontrib><creatorcontrib>Jung, Maximilian</creatorcontrib><creatorcontrib>Sysło, Oliwia</creatorcontrib><creatorcontrib>Jarząbek, Victoria</creatorcontrib><creatorcontrib>Błecha, Zuzanna</creatorcontrib><creatorcontrib>Haraziński, Konrad</creatorcontrib><creatorcontrib>Jasińska, Natalia</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Health &amp; Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>Health &amp; Medical Collection (Alumni Edition)</collection><collection>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Curēus (Palo Alto, CA)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Jaworski, Aleksander</au><au>Jasiński, Dawid</au><au>Jaworski, Wojciech</au><au>Hop, Aleksandra</au><au>Janek, Artur</au><au>Sławińska, Barbara</au><au>Konieczniak, Lena</au><au>Rzepka, Maciej</au><au>Jung, Maximilian</au><au>Sysło, Oliwia</au><au>Jarząbek, Victoria</au><au>Błecha, Zuzanna</au><au>Haraziński, Konrad</au><au>Jasińska, Natalia</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Comparison of the Performance of Artificial Intelligence Versus Medical Professionals in the Polish Final Medical Examination</atitle><jtitle>Curēus (Palo Alto, CA)</jtitle><addtitle>Cureus</addtitle><date>2024-08-02</date><risdate>2024</risdate><volume>16</volume><issue>8</issue><spage>e66011</spage><pages>e66011-</pages><issn>2168-8184</issn><eissn>2168-8184</eissn><abstract>The rapid development of artificial intelligence (AI) technologies like OpenAI's Generative Pretrained Transformer (GPT), particularly ChatGPT, has shown promising applications in various fields, including medicine. This study evaluates ChatGPT's performance on the Polish Final Medical Examination (LEK), comparing its efficacy to that of human test-takers. The study analyzed ChatGPT's ability to answer 196 multiple-choice questions from the spring 2021 LEK. Questions were categorized into "clinical cases" and "other" general medical knowledge, and then divided according to medical fields. Two versions of ChatGPT (3.5 and 4.0) were tested. Statistical analyses, including Pearson's χ test, and Mann-Whitney U test, were conducted to compare the AI's performance and confidence levels. ChatGPT 3.5 correctly answered 50.51% of the questions, while ChatGPT 4.0 answered 77.55% correctly, surpassing the 56% passing threshold. Version 3.5 showed significantly higher confidence in correct answers, whereas version 4.0 maintained consistent confidence regardless of answer accuracy. No significant differences in performance were observed across different medical fields. ChatGPT 4.0 demonstrated the ability to pass the LEK, indicating substantial potential for AI in medical education and assessment. Future improvements in AI models, such as the anticipated ChatGPT 5.0, may enhance further performance, potentially equaling or surpassing human test-takers.</abstract><cop>United States</cop><pub>Cureus Inc</pub><pmid>39221376</pmid><doi>10.7759/cureus.66011</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2168-8184
ispartof Curēus (Palo Alto, CA), 2024-08, Vol.16 (8), p.e66011
issn 2168-8184
2168-8184
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_11366403
source Publicly Available Content Database (Proquest) (PQ_SDU_P3); PubMed Central
subjects Artificial intelligence
Bioethics
Certification
Chatbots
Confidence
Deep learning
Emergency medical care
Gynecology
Healthcare Technology
Intensive care
Internal medicine
Machine learning
Mann-Whitney U test
Medical Education
Medical research
Medical Simulation
Medicine
Multiple choice
Natural language
Obstetrics
Pediatrics
Psychiatry
Public health
Social networks
Statistical analysis
Surgery
title Comparison of the Performance of Artificial Intelligence Versus Medical Professionals in the Polish Final Medical Examination
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T12%3A45%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Comparison%20of%20the%20Performance%20of%20Artificial%20Intelligence%20Versus%20Medical%20Professionals%20in%20the%20Polish%20Final%20Medical%20Examination&rft.jtitle=Cur%C4%93us%20(Palo%20Alto,%20CA)&rft.au=Jaworski,%20Aleksander&rft.date=2024-08-02&rft.volume=16&rft.issue=8&rft.spage=e66011&rft.pages=e66011-&rft.issn=2168-8184&rft.eissn=2168-8184&rft_id=info:doi/10.7759/cureus.66011&rft_dat=%3Cproquest_pubme%3E3111372510%3C/proquest_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c300t-39f04b5ca8b08fa6956f0dd3a5aa63a6347fc03175ee612d98210f66035274ff3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3111372510&rft_id=info:pmid/39221376&rfr_iscdi=true