Loading…

MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering

Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology to assist medical experts for interactive decision support. This potential has been illustrated by the state-of-the-art performance obtained by LLMs in Medical Question Answering, wi...

Full description

Saved in:

Bibliographic Details
Published in:	Artificial intelligence in medicine 2024-09, Vol.155, p.102938, Article 102938
Main Authors:	Alonso, Iñigo, Oronoz, Maite, Agerri, Rodrigo
Format:	Article
Language:	English
Subjects:	Large Language Models Medical Question Answering Multilinguality Natural Language Processing Retrieval Augmented Generation
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites	cdi_FETCH-LOGICAL-c287t-f44e75177dea15c879c198b1a39d35f7f183abc12ad5b4bdd60fc29fb692bdb73
container_end_page
container_issue
container_start_page	102938
container_title	Artificial intelligence in medicine
container_volume	155
creator	Alonso, Iñigo Oronoz, Maite Agerri, Rodrigo
description	Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology to assist medical experts for interactive decision support. This potential has been illustrated by the state-of-the-art performance obtained by LLMs in Medical Question Answering, with striking results such as passing marks in licensing medical exams. However, while impressive, the required quality bar for medical applications remains far from being achieved. Currently, LLMs remain challenged by outdated knowledge and by their tendency to generate hallucinated content. Furthermore, most benchmarks to assess medical knowledge lack reference gold explanations which means that it is not possible to evaluate the reasoning of LLMs predictions. Finally, the situation is particularly grim if we consider benchmarking LLMs for languages other than English which remains, as far as we know, a totally neglected topic. In order to address these shortcomings, in this paper we present MedExpQA, the first multilingual benchmark based on medical exams to evaluate LLMs in Medical Question Answering. To the best of our knowledge, MedExpQA includes for the first time reference gold explanations, written by medical doctors, of the correct and incorrect options in the exams. Comprehensive multilingual experimentation using both the gold reference explanations and Retrieval Augmented Generation (RAG) approaches show that performance of LLMs, with best results around 75 accuracy for English, still has large room for improvement, especially for languages other than English, for which accuracy drops 10 points. Therefore, despite using state-of-the-art RAG methods, our results also demonstrate the difficulty of obtaining and integrating readily available medical knowledge that may positively impact results on downstream evaluations for Medical Question Answering. Data, code, and fine-tuned models will be made publicly available.11https://huggingface.co/datasets/HiTZ/MedExpQA. •MedExpQA: the first multilingual benchmark for MedicalQA including gold reference explanations.•Comparison of gold and automatically extracted medical knowledge via RAG techniques.•Fine-tuning makes redundant the external knowledge obtained via RAG.•Overall performance of LLMs with or without RAG still has large room for improvement.•Performance for French, Italian and Spanish lower for every LLM in every setting.
doi_str_mv	10.1016/j.artmed.2024.102938
format	article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_3091283391</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0933365724001805</els_id><sourcerecordid>3091283391</sourcerecordid><originalsourceid>FETCH-LOGICAL-c287t-f44e75177dea15c879c198b1a39d35f7f183abc12ad5b4bdd60fc29fb692bdb73</originalsourceid><addsrcrecordid>eNp9kMlOwzAURS0EoqXwBwhlySbFQxInLJAqxCS1QpVgxcLyWFzSuNgJw9_jKO2Wjad373u-B4BzBKcIouJqPeW-3Wg1xRBn8QlXpDwAY1RSkuKygIdgDCtCUlLkdAROQlhDCGmGimMwIhXCKM-yMXhbaHX3s13OrpNFV7e2ts2q43UidCPfN9x_xHviTDLnfqXj2lfjYeGUrkNinE9iAyujY9np0FrXJLMmfGsffafgyPA66LPdPgGv93cvt4_p_Pnh6XY2TyUuaZuaLNM0R5QqzVEuS1pJVJUCcVIpkhtqUEm4kAhzlYtMKFVAI3FlRFFhoQQlE3A59N1699n_gm1skLqueaNdFxiBMW5JYugozQap9C4Erw3behtj_jIEWY-VrdmAlfVY2YA12i52EzrR1_amPccouBkEkYr-stqzIG1EGNl4LVumnP1_wh_rSos_</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3091283391</pqid></control><display><type>article</type><title>MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering</title><source>ScienceDirect Journals</source><creator>Alonso, Iñigo ; Oronoz, Maite ; Agerri, Rodrigo</creator><creatorcontrib>Alonso, Iñigo ; Oronoz, Maite ; Agerri, Rodrigo</creatorcontrib><description>Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology to assist medical experts for interactive decision support. This potential has been illustrated by the state-of-the-art performance obtained by LLMs in Medical Question Answering, with striking results such as passing marks in licensing medical exams. However, while impressive, the required quality bar for medical applications remains far from being achieved. Currently, LLMs remain challenged by outdated knowledge and by their tendency to generate hallucinated content. Furthermore, most benchmarks to assess medical knowledge lack reference gold explanations which means that it is not possible to evaluate the reasoning of LLMs predictions. Finally, the situation is particularly grim if we consider benchmarking LLMs for languages other than English which remains, as far as we know, a totally neglected topic. In order to address these shortcomings, in this paper we present MedExpQA, the first multilingual benchmark based on medical exams to evaluate LLMs in Medical Question Answering. To the best of our knowledge, MedExpQA includes for the first time reference gold explanations, written by medical doctors, of the correct and incorrect options in the exams. Comprehensive multilingual experimentation using both the gold reference explanations and Retrieval Augmented Generation (RAG) approaches show that performance of LLMs, with best results around 75 accuracy for English, still has large room for improvement, especially for languages other than English, for which accuracy drops 10 points. Therefore, despite using state-of-the-art RAG methods, our results also demonstrate the difficulty of obtaining and integrating readily available medical knowledge that may positively impact results on downstream evaluations for Medical Question Answering. Data, code, and fine-tuned models will be made publicly available.11https://huggingface.co/datasets/HiTZ/MedExpQA. •MedExpQA: the first multilingual benchmark for MedicalQA including gold reference explanations.•Comparison of gold and automatically extracted medical knowledge via RAG techniques.•Fine-tuning makes redundant the external knowledge obtained via RAG.•Overall performance of LLMs with or without RAG still has large room for improvement.•Performance for French, Italian and Spanish lower for every LLM in every setting.</description><identifier>ISSN: 0933-3657</identifier><identifier>ISSN: 1873-2860</identifier><identifier>EISSN: 1873-2860</identifier><identifier>DOI: 10.1016/j.artmed.2024.102938</identifier><identifier>PMID: 39121544</identifier><language>eng</language><publisher>Netherlands: Elsevier B.V</publisher><subject>Large Language Models ; Medical Question Answering ; Multilinguality ; Natural Language Processing ; Retrieval Augmented Generation</subject><ispartof>Artificial intelligence in medicine, 2024-09, Vol.155, p.102938, Article 102938</ispartof><rights>2024 The Author(s)</rights><rights>Copyright © 2024 The Author(s). Published by Elsevier B.V. All rights reserved.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c287t-f44e75177dea15c879c198b1a39d35f7f183abc12ad5b4bdd60fc29fb692bdb73</cites><orcidid>0000-0001-9097-6047</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/39121544$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Alonso, Iñigo</creatorcontrib><creatorcontrib>Oronoz, Maite</creatorcontrib><creatorcontrib>Agerri, Rodrigo</creatorcontrib><title>MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering</title><title>Artificial intelligence in medicine</title><addtitle>Artif Intell Med</addtitle><description>Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology to assist medical experts for interactive decision support. This potential has been illustrated by the state-of-the-art performance obtained by LLMs in Medical Question Answering, with striking results such as passing marks in licensing medical exams. However, while impressive, the required quality bar for medical applications remains far from being achieved. Currently, LLMs remain challenged by outdated knowledge and by their tendency to generate hallucinated content. Furthermore, most benchmarks to assess medical knowledge lack reference gold explanations which means that it is not possible to evaluate the reasoning of LLMs predictions. Finally, the situation is particularly grim if we consider benchmarking LLMs for languages other than English which remains, as far as we know, a totally neglected topic. In order to address these shortcomings, in this paper we present MedExpQA, the first multilingual benchmark based on medical exams to evaluate LLMs in Medical Question Answering. To the best of our knowledge, MedExpQA includes for the first time reference gold explanations, written by medical doctors, of the correct and incorrect options in the exams. Comprehensive multilingual experimentation using both the gold reference explanations and Retrieval Augmented Generation (RAG) approaches show that performance of LLMs, with best results around 75 accuracy for English, still has large room for improvement, especially for languages other than English, for which accuracy drops 10 points. Therefore, despite using state-of-the-art RAG methods, our results also demonstrate the difficulty of obtaining and integrating readily available medical knowledge that may positively impact results on downstream evaluations for Medical Question Answering. Data, code, and fine-tuned models will be made publicly available.11https://huggingface.co/datasets/HiTZ/MedExpQA. •MedExpQA: the first multilingual benchmark for MedicalQA including gold reference explanations.•Comparison of gold and automatically extracted medical knowledge via RAG techniques.•Fine-tuning makes redundant the external knowledge obtained via RAG.•Overall performance of LLMs with or without RAG still has large room for improvement.•Performance for French, Italian and Spanish lower for every LLM in every setting.</description><subject>Large Language Models</subject><subject>Medical Question Answering</subject><subject>Multilinguality</subject><subject>Natural Language Processing</subject><subject>Retrieval Augmented Generation</subject><issn>0933-3657</issn><issn>1873-2860</issn><issn>1873-2860</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNp9kMlOwzAURS0EoqXwBwhlySbFQxInLJAqxCS1QpVgxcLyWFzSuNgJw9_jKO2Wjad373u-B4BzBKcIouJqPeW-3Wg1xRBn8QlXpDwAY1RSkuKygIdgDCtCUlLkdAROQlhDCGmGimMwIhXCKM-yMXhbaHX3s13OrpNFV7e2ts2q43UidCPfN9x_xHviTDLnfqXj2lfjYeGUrkNinE9iAyujY9np0FrXJLMmfGsffafgyPA66LPdPgGv93cvt4_p_Pnh6XY2TyUuaZuaLNM0R5QqzVEuS1pJVJUCcVIpkhtqUEm4kAhzlYtMKFVAI3FlRFFhoQQlE3A59N1699n_gm1skLqueaNdFxiBMW5JYugozQap9C4Erw3behtj_jIEWY-VrdmAlfVY2YA12i52EzrR1_amPccouBkEkYr-stqzIG1EGNl4LVumnP1_wh_rSos_</recordid><startdate>20240901</startdate><enddate>20240901</enddate><creator>Alonso, Iñigo</creator><creator>Oronoz, Maite</creator><creator>Agerri, Rodrigo</creator><general>Elsevier B.V</general><scope>6I.</scope><scope>AAFTH</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0001-9097-6047</orcidid></search><sort><creationdate>20240901</creationdate><title>MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering</title><author>Alonso, Iñigo ; Oronoz, Maite ; Agerri, Rodrigo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c287t-f44e75177dea15c879c198b1a39d35f7f183abc12ad5b4bdd60fc29fb692bdb73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Large Language Models</topic><topic>Medical Question Answering</topic><topic>Multilinguality</topic><topic>Natural Language Processing</topic><topic>Retrieval Augmented Generation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Alonso, Iñigo</creatorcontrib><creatorcontrib>Oronoz, Maite</creatorcontrib><creatorcontrib>Agerri, Rodrigo</creatorcontrib><collection>ScienceDirect Open Access Titles</collection><collection>Elsevier:ScienceDirect:Open Access</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>Artificial intelligence in medicine</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Alonso, Iñigo</au><au>Oronoz, Maite</au><au>Agerri, Rodrigo</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering</atitle><jtitle>Artificial intelligence in medicine</jtitle><addtitle>Artif Intell Med</addtitle><date>2024-09-01</date><risdate>2024</risdate><volume>155</volume><spage>102938</spage><pages>102938-</pages><artnum>102938</artnum><issn>0933-3657</issn><issn>1873-2860</issn><eissn>1873-2860</eissn><abstract>Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology to assist medical experts for interactive decision support. This potential has been illustrated by the state-of-the-art performance obtained by LLMs in Medical Question Answering, with striking results such as passing marks in licensing medical exams. However, while impressive, the required quality bar for medical applications remains far from being achieved. Currently, LLMs remain challenged by outdated knowledge and by their tendency to generate hallucinated content. Furthermore, most benchmarks to assess medical knowledge lack reference gold explanations which means that it is not possible to evaluate the reasoning of LLMs predictions. Finally, the situation is particularly grim if we consider benchmarking LLMs for languages other than English which remains, as far as we know, a totally neglected topic. In order to address these shortcomings, in this paper we present MedExpQA, the first multilingual benchmark based on medical exams to evaluate LLMs in Medical Question Answering. To the best of our knowledge, MedExpQA includes for the first time reference gold explanations, written by medical doctors, of the correct and incorrect options in the exams. Comprehensive multilingual experimentation using both the gold reference explanations and Retrieval Augmented Generation (RAG) approaches show that performance of LLMs, with best results around 75 accuracy for English, still has large room for improvement, especially for languages other than English, for which accuracy drops 10 points. Therefore, despite using state-of-the-art RAG methods, our results also demonstrate the difficulty of obtaining and integrating readily available medical knowledge that may positively impact results on downstream evaluations for Medical Question Answering. Data, code, and fine-tuned models will be made publicly available.11https://huggingface.co/datasets/HiTZ/MedExpQA. •MedExpQA: the first multilingual benchmark for MedicalQA including gold reference explanations.•Comparison of gold and automatically extracted medical knowledge via RAG techniques.•Fine-tuning makes redundant the external knowledge obtained via RAG.•Overall performance of LLMs with or without RAG still has large room for improvement.•Performance for French, Italian and Spanish lower for every LLM in every setting.</abstract><cop>Netherlands</cop><pub>Elsevier B.V</pub><pmid>39121544</pmid><doi>10.1016/j.artmed.2024.102938</doi><orcidid>https://orcid.org/0000-0001-9097-6047</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0933-3657
ispartof	Artificial intelligence in medicine, 2024-09, Vol.155, p.102938, Article 102938
issn	0933-3657 1873-2860 1873-2860
language	eng
recordid	cdi_proquest_miscellaneous_3091283391
source	ScienceDirect Journals
subjects	Large Language Models Medical Question Answering Multilinguality Natural Language Processing Retrieval Augmented Generation
title	MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T10%3A40%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MedExpQA:%20Multilingual%20benchmarking%20of%20Large%20Language%20Models%20for%20Medical%20Question%20Answering&rft.jtitle=Artificial%20intelligence%20in%20medicine&rft.au=Alonso,%20I%C3%B1igo&rft.date=2024-09-01&rft.volume=155&rft.spage=102938&rft.pages=102938-&rft.artnum=102938&rft.issn=0933-3657&rft.eissn=1873-2860&rft_id=info:doi/10.1016/j.artmed.2024.102938&rft_dat=%3Cproquest_cross%3E3091283391%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c287t-f44e75177dea15c879c198b1a39d35f7f183abc12ad5b4bdd60fc29fb692bdb73%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3091283391&rft_id=info:pmid/39121544&rfr_iscdi=true