Loading…

MediAlbertina: An European Portuguese medical language model

Patient medical information often exists in unstructured text containing abbreviations and acronyms deemed essential to conserve time and space but posing challenges for automated interpretation. Leveraging the efficacy of Transformers in natural language processing, our objective was to use the kno...

Full description

Saved in:
Bibliographic Details
Published in:Computers in biology and medicine 2024-11, Vol.182, p.109233, Article 109233
Main Authors: Nunes, Miguel, Boné, João, Ferreira, João C., Chaves, Pedro, Elvas, Luis B.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c2423-3f3429823f46d642f1e8691202e6f98f54f3225ae381d8e9ff39cbe698a4f7573
container_end_page
container_issue
container_start_page 109233
container_title Computers in biology and medicine
container_volume 182
creator Nunes, Miguel
Boné, João
Ferreira, João C.
Chaves, Pedro
Elvas, Luis B.
description Patient medical information often exists in unstructured text containing abbreviations and acronyms deemed essential to conserve time and space but posing challenges for automated interpretation. Leveraging the efficacy of Transformers in natural language processing, our objective was to use the knowledge acquired by a language model and continue its pre-training to develop an European Portuguese (PT-PT) healthcare-domain language model. After carrying out a filtering process, Albertina PT-PT 900M was selected as our base language model, and we continued its pre-training using more than 2.6 million electronic medical records from Portugal's largest public hospital. MediAlbertina 900M has been created through domain adaptation on this data using masked language modelling. The comparison with our baseline was made through the usage of both perplexity, which decreased from about 20 to 1.6 values, and the fine-tuning and evaluation of information extraction models such as Named Entity Recognition and Assertion Status. MediAlbertina PT-PT outperformed Albertina PT-PT in both tasks by 4–6% on recall and f1-score. This study contributes with the first publicly available medical language model trained with PT-PT data. It underscores the efficacy of domain adaptation and offers a contribution to the scientific community in overcoming obstacles of non-English languages. With MediAlbertina, further steps can be taken to assist physicians, in creating decision support systems or building medical timelines in order to perform profiling, by fine-tuning MediAlbertina for PT- PT medical tasks. Problem: There is a vast amount of medical unstructured text, yet there is no publicly available medical LM trained with PT-PT data. What is Already Known: Domain adaptation is a process that allows achieving better results in NLP tasks compared to general models. Several studies have performed domain adaptation, and there are concentrated efforts in utilizing this technique for non-English languages where data availability and literature are scarce. What This Paper Adds: This study presents the first publicly available medical LM trained with PT-PT EMRs from Portugal's largest public hospital, attempting to overcome the barrier of being a non-English language and providing motivation for other non-English languages to perform similar tasks. With our model, we present an additional tool that can help structure medical information, which will be beneficial for the application of AI m
doi_str_mv 10.1016/j.compbiomed.2024.109233
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_3112856769</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0010482524013180</els_id><sourcerecordid>3112856769</sourcerecordid><originalsourceid>FETCH-LOGICAL-c2423-3f3429823f46d642f1e8691202e6f98f54f3225ae381d8e9ff39cbe698a4f7573</originalsourceid><addsrcrecordid>eNqFkMtKAzEUhoMoWi-vIANu3EzNbTKJuKlSL1DRha5DmjkpKdNJTWYE396UVgQ3rgIn3zn_z4dQQfCYYCKulmMbVuu5DytoxhRTnseKMraHRkTWqsQV4_tohDHBJZe0OkLHKS0xxhwzfIiOmGKCYkxH6OYZGj9p5xB735nrYtIV0yGGNZiueA2xHxYDJChyjremLVrTLQazyIPQQHuKDpxpE5zt3hP0fj99u3ssZy8PT3eTWWkpp6xkjnGqJGWOi0Zw6ghIoUjuDcIp6SruGKWVASZJI0E5x5Sdg1DScFdXNTtBl9u76xg-cp9er3yy0OY2EIakGSFUVqIWKqMXf9BlGGKX22WKVkrVRG0Oyi1lY0gpgtPr6FcmfmmC9cawXupfw3pjWG8N59XzXcAw3_z9LP4ozcDtFoBs5NND1Ml66Gw2GMH2ugn-_5RvCIaO4g</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3125997197</pqid></control><display><type>article</type><title>MediAlbertina: An European Portuguese medical language model</title><source>ScienceDirect Journals</source><creator>Nunes, Miguel ; Boné, João ; Ferreira, João C. ; Chaves, Pedro ; Elvas, Luis B.</creator><creatorcontrib>Nunes, Miguel ; Boné, João ; Ferreira, João C. ; Chaves, Pedro ; Elvas, Luis B.</creatorcontrib><description>Patient medical information often exists in unstructured text containing abbreviations and acronyms deemed essential to conserve time and space but posing challenges for automated interpretation. Leveraging the efficacy of Transformers in natural language processing, our objective was to use the knowledge acquired by a language model and continue its pre-training to develop an European Portuguese (PT-PT) healthcare-domain language model. After carrying out a filtering process, Albertina PT-PT 900M was selected as our base language model, and we continued its pre-training using more than 2.6 million electronic medical records from Portugal's largest public hospital. MediAlbertina 900M has been created through domain adaptation on this data using masked language modelling. The comparison with our baseline was made through the usage of both perplexity, which decreased from about 20 to 1.6 values, and the fine-tuning and evaluation of information extraction models such as Named Entity Recognition and Assertion Status. MediAlbertina PT-PT outperformed Albertina PT-PT in both tasks by 4–6% on recall and f1-score. This study contributes with the first publicly available medical language model trained with PT-PT data. It underscores the efficacy of domain adaptation and offers a contribution to the scientific community in overcoming obstacles of non-English languages. With MediAlbertina, further steps can be taken to assist physicians, in creating decision support systems or building medical timelines in order to perform profiling, by fine-tuning MediAlbertina for PT- PT medical tasks. Problem: There is a vast amount of medical unstructured text, yet there is no publicly available medical LM trained with PT-PT data. What is Already Known: Domain adaptation is a process that allows achieving better results in NLP tasks compared to general models. Several studies have performed domain adaptation, and there are concentrated efforts in utilizing this technique for non-English languages where data availability and literature are scarce. What This Paper Adds: This study presents the first publicly available medical LM trained with PT-PT EMRs from Portugal's largest public hospital, attempting to overcome the barrier of being a non-English language and providing motivation for other non-English languages to perform similar tasks. With our model, we present an additional tool that can help structure medical information, which will be beneficial for the application of AI methods to assist physicians in reducing their workload and, indirectly, improving patients' outcomes.</description><identifier>ISSN: 0010-4825</identifier><identifier>ISSN: 1879-0534</identifier><identifier>EISSN: 1879-0534</identifier><identifier>DOI: 10.1016/j.compbiomed.2024.109233</identifier><identifier>PMID: 39362002</identifier><language>eng</language><publisher>United States: Elsevier Ltd</publisher><subject>Abbreviations ; Adaptation ; Alzheimer's disease ; Artificial intelligence ; Cancer ; Decision support systems ; Domain adaptation ; Domain specific languages ; Effectiveness ; Electronic Health Records ; Electronic medical records ; English language ; European Portuguese ; Health care industry ; Hospitals ; Humans ; Information extraction ; Information processing ; Information retrieval ; Language ; Masked language modelling ; Medical language model ; Natural Language Processing ; Non-English languages ; Patients ; Performance evaluation ; Physicians ; Portugal ; Portuguese language ; Unstructured data</subject><ispartof>Computers in biology and medicine, 2024-11, Vol.182, p.109233, Article 109233</ispartof><rights>2024 The Authors</rights><rights>Copyright © 2024 The Authors. Published by Elsevier Ltd.. All rights reserved.</rights><rights>2024. The Authors</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c2423-3f3429823f46d642f1e8691202e6f98f54f3225ae381d8e9ff39cbe698a4f7573</cites><orcidid>0000-0003-4201-5641 ; 0000-0003-4909-6341 ; 0000-0002-7489-4380</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,777,781,27905,27906</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/39362002$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Nunes, Miguel</creatorcontrib><creatorcontrib>Boné, João</creatorcontrib><creatorcontrib>Ferreira, João C.</creatorcontrib><creatorcontrib>Chaves, Pedro</creatorcontrib><creatorcontrib>Elvas, Luis B.</creatorcontrib><title>MediAlbertina: An European Portuguese medical language model</title><title>Computers in biology and medicine</title><addtitle>Comput Biol Med</addtitle><description>Patient medical information often exists in unstructured text containing abbreviations and acronyms deemed essential to conserve time and space but posing challenges for automated interpretation. Leveraging the efficacy of Transformers in natural language processing, our objective was to use the knowledge acquired by a language model and continue its pre-training to develop an European Portuguese (PT-PT) healthcare-domain language model. After carrying out a filtering process, Albertina PT-PT 900M was selected as our base language model, and we continued its pre-training using more than 2.6 million electronic medical records from Portugal's largest public hospital. MediAlbertina 900M has been created through domain adaptation on this data using masked language modelling. The comparison with our baseline was made through the usage of both perplexity, which decreased from about 20 to 1.6 values, and the fine-tuning and evaluation of information extraction models such as Named Entity Recognition and Assertion Status. MediAlbertina PT-PT outperformed Albertina PT-PT in both tasks by 4–6% on recall and f1-score. This study contributes with the first publicly available medical language model trained with PT-PT data. It underscores the efficacy of domain adaptation and offers a contribution to the scientific community in overcoming obstacles of non-English languages. With MediAlbertina, further steps can be taken to assist physicians, in creating decision support systems or building medical timelines in order to perform profiling, by fine-tuning MediAlbertina for PT- PT medical tasks. Problem: There is a vast amount of medical unstructured text, yet there is no publicly available medical LM trained with PT-PT data. What is Already Known: Domain adaptation is a process that allows achieving better results in NLP tasks compared to general models. Several studies have performed domain adaptation, and there are concentrated efforts in utilizing this technique for non-English languages where data availability and literature are scarce. What This Paper Adds: This study presents the first publicly available medical LM trained with PT-PT EMRs from Portugal's largest public hospital, attempting to overcome the barrier of being a non-English language and providing motivation for other non-English languages to perform similar tasks. With our model, we present an additional tool that can help structure medical information, which will be beneficial for the application of AI methods to assist physicians in reducing their workload and, indirectly, improving patients' outcomes.</description><subject>Abbreviations</subject><subject>Adaptation</subject><subject>Alzheimer's disease</subject><subject>Artificial intelligence</subject><subject>Cancer</subject><subject>Decision support systems</subject><subject>Domain adaptation</subject><subject>Domain specific languages</subject><subject>Effectiveness</subject><subject>Electronic Health Records</subject><subject>Electronic medical records</subject><subject>English language</subject><subject>European Portuguese</subject><subject>Health care industry</subject><subject>Hospitals</subject><subject>Humans</subject><subject>Information extraction</subject><subject>Information processing</subject><subject>Information retrieval</subject><subject>Language</subject><subject>Masked language modelling</subject><subject>Medical language model</subject><subject>Natural Language Processing</subject><subject>Non-English languages</subject><subject>Patients</subject><subject>Performance evaluation</subject><subject>Physicians</subject><subject>Portugal</subject><subject>Portuguese language</subject><subject>Unstructured data</subject><issn>0010-4825</issn><issn>1879-0534</issn><issn>1879-0534</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNqFkMtKAzEUhoMoWi-vIANu3EzNbTKJuKlSL1DRha5DmjkpKdNJTWYE396UVgQ3rgIn3zn_z4dQQfCYYCKulmMbVuu5DytoxhRTnseKMraHRkTWqsQV4_tohDHBJZe0OkLHKS0xxhwzfIiOmGKCYkxH6OYZGj9p5xB735nrYtIV0yGGNZiueA2xHxYDJChyjremLVrTLQazyIPQQHuKDpxpE5zt3hP0fj99u3ssZy8PT3eTWWkpp6xkjnGqJGWOi0Zw6ghIoUjuDcIp6SruGKWVASZJI0E5x5Sdg1DScFdXNTtBl9u76xg-cp9er3yy0OY2EIakGSFUVqIWKqMXf9BlGGKX22WKVkrVRG0Oyi1lY0gpgtPr6FcmfmmC9cawXupfw3pjWG8N59XzXcAw3_z9LP4ozcDtFoBs5NND1Ml66Gw2GMH2ugn-_5RvCIaO4g</recordid><startdate>202411</startdate><enddate>202411</enddate><creator>Nunes, Miguel</creator><creator>Boné, João</creator><creator>Ferreira, João C.</creator><creator>Chaves, Pedro</creator><creator>Elvas, Luis B.</creator><general>Elsevier Ltd</general><general>Elsevier Limited</general><scope>6I.</scope><scope>AAFTH</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>8FD</scope><scope>FR3</scope><scope>JQ2</scope><scope>K9.</scope><scope>M7Z</scope><scope>NAPCQ</scope><scope>P64</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0003-4201-5641</orcidid><orcidid>https://orcid.org/0000-0003-4909-6341</orcidid><orcidid>https://orcid.org/0000-0002-7489-4380</orcidid></search><sort><creationdate>202411</creationdate><title>MediAlbertina: An European Portuguese medical language model</title><author>Nunes, Miguel ; Boné, João ; Ferreira, João C. ; Chaves, Pedro ; Elvas, Luis B.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c2423-3f3429823f46d642f1e8691202e6f98f54f3225ae381d8e9ff39cbe698a4f7573</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Abbreviations</topic><topic>Adaptation</topic><topic>Alzheimer's disease</topic><topic>Artificial intelligence</topic><topic>Cancer</topic><topic>Decision support systems</topic><topic>Domain adaptation</topic><topic>Domain specific languages</topic><topic>Effectiveness</topic><topic>Electronic Health Records</topic><topic>Electronic medical records</topic><topic>English language</topic><topic>European Portuguese</topic><topic>Health care industry</topic><topic>Hospitals</topic><topic>Humans</topic><topic>Information extraction</topic><topic>Information processing</topic><topic>Information retrieval</topic><topic>Language</topic><topic>Masked language modelling</topic><topic>Medical language model</topic><topic>Natural Language Processing</topic><topic>Non-English languages</topic><topic>Patients</topic><topic>Performance evaluation</topic><topic>Physicians</topic><topic>Portugal</topic><topic>Portuguese language</topic><topic>Unstructured data</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Nunes, Miguel</creatorcontrib><creatorcontrib>Boné, João</creatorcontrib><creatorcontrib>Ferreira, João C.</creatorcontrib><creatorcontrib>Chaves, Pedro</creatorcontrib><creatorcontrib>Elvas, Luis B.</creatorcontrib><collection>ScienceDirect Open Access Titles</collection><collection>Elsevier:ScienceDirect:Open Access</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>Biochemistry Abstracts 1</collection><collection>Nursing &amp; Allied Health Premium</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>Computers in biology and medicine</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Nunes, Miguel</au><au>Boné, João</au><au>Ferreira, João C.</au><au>Chaves, Pedro</au><au>Elvas, Luis B.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MediAlbertina: An European Portuguese medical language model</atitle><jtitle>Computers in biology and medicine</jtitle><addtitle>Comput Biol Med</addtitle><date>2024-11</date><risdate>2024</risdate><volume>182</volume><spage>109233</spage><pages>109233-</pages><artnum>109233</artnum><issn>0010-4825</issn><issn>1879-0534</issn><eissn>1879-0534</eissn><abstract>Patient medical information often exists in unstructured text containing abbreviations and acronyms deemed essential to conserve time and space but posing challenges for automated interpretation. Leveraging the efficacy of Transformers in natural language processing, our objective was to use the knowledge acquired by a language model and continue its pre-training to develop an European Portuguese (PT-PT) healthcare-domain language model. After carrying out a filtering process, Albertina PT-PT 900M was selected as our base language model, and we continued its pre-training using more than 2.6 million electronic medical records from Portugal's largest public hospital. MediAlbertina 900M has been created through domain adaptation on this data using masked language modelling. The comparison with our baseline was made through the usage of both perplexity, which decreased from about 20 to 1.6 values, and the fine-tuning and evaluation of information extraction models such as Named Entity Recognition and Assertion Status. MediAlbertina PT-PT outperformed Albertina PT-PT in both tasks by 4–6% on recall and f1-score. This study contributes with the first publicly available medical language model trained with PT-PT data. It underscores the efficacy of domain adaptation and offers a contribution to the scientific community in overcoming obstacles of non-English languages. With MediAlbertina, further steps can be taken to assist physicians, in creating decision support systems or building medical timelines in order to perform profiling, by fine-tuning MediAlbertina for PT- PT medical tasks. Problem: There is a vast amount of medical unstructured text, yet there is no publicly available medical LM trained with PT-PT data. What is Already Known: Domain adaptation is a process that allows achieving better results in NLP tasks compared to general models. Several studies have performed domain adaptation, and there are concentrated efforts in utilizing this technique for non-English languages where data availability and literature are scarce. What This Paper Adds: This study presents the first publicly available medical LM trained with PT-PT EMRs from Portugal's largest public hospital, attempting to overcome the barrier of being a non-English language and providing motivation for other non-English languages to perform similar tasks. With our model, we present an additional tool that can help structure medical information, which will be beneficial for the application of AI methods to assist physicians in reducing their workload and, indirectly, improving patients' outcomes.</abstract><cop>United States</cop><pub>Elsevier Ltd</pub><pmid>39362002</pmid><doi>10.1016/j.compbiomed.2024.109233</doi><orcidid>https://orcid.org/0000-0003-4201-5641</orcidid><orcidid>https://orcid.org/0000-0003-4909-6341</orcidid><orcidid>https://orcid.org/0000-0002-7489-4380</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0010-4825
ispartof Computers in biology and medicine, 2024-11, Vol.182, p.109233, Article 109233
issn 0010-4825
1879-0534
1879-0534
language eng
recordid cdi_proquest_miscellaneous_3112856769
source ScienceDirect Journals
subjects Abbreviations
Adaptation
Alzheimer's disease
Artificial intelligence
Cancer
Decision support systems
Domain adaptation
Domain specific languages
Effectiveness
Electronic Health Records
Electronic medical records
English language
European Portuguese
Health care industry
Hospitals
Humans
Information extraction
Information processing
Information retrieval
Language
Masked language modelling
Medical language model
Natural Language Processing
Non-English languages
Patients
Performance evaluation
Physicians
Portugal
Portuguese language
Unstructured data
title MediAlbertina: An European Portuguese medical language model
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-17T20%3A37%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MediAlbertina:%20An%20European%20Portuguese%20medical%20language%20model&rft.jtitle=Computers%20in%20biology%20and%20medicine&rft.au=Nunes,%20Miguel&rft.date=2024-11&rft.volume=182&rft.spage=109233&rft.pages=109233-&rft.artnum=109233&rft.issn=0010-4825&rft.eissn=1879-0534&rft_id=info:doi/10.1016/j.compbiomed.2024.109233&rft_dat=%3Cproquest_cross%3E3112856769%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c2423-3f3429823f46d642f1e8691202e6f98f54f3225ae381d8e9ff39cbe698a4f7573%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3125997197&rft_id=info:pmid/39362002&rfr_iscdi=true