Loading…
MediAlbertina: An European Portuguese medical language model
Patient medical information often exists in unstructured text containing abbreviations and acronyms deemed essential to conserve time and space but posing challenges for automated interpretation. Leveraging the efficacy of Transformers in natural language processing, our objective was to use the kno...
Saved in:
Published in: | Computers in biology and medicine 2024-11, Vol.182, p.109233, Article 109233 |
---|---|
Main Authors: | , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | cdi_FETCH-LOGICAL-c2423-3f3429823f46d642f1e8691202e6f98f54f3225ae381d8e9ff39cbe698a4f7573 |
container_end_page | |
container_issue | |
container_start_page | 109233 |
container_title | Computers in biology and medicine |
container_volume | 182 |
creator | Nunes, Miguel Boné, João Ferreira, João C. Chaves, Pedro Elvas, Luis B. |
description | Patient medical information often exists in unstructured text containing abbreviations and acronyms deemed essential to conserve time and space but posing challenges for automated interpretation. Leveraging the efficacy of Transformers in natural language processing, our objective was to use the knowledge acquired by a language model and continue its pre-training to develop an European Portuguese (PT-PT) healthcare-domain language model.
After carrying out a filtering process, Albertina PT-PT 900M was selected as our base language model, and we continued its pre-training using more than 2.6 million electronic medical records from Portugal's largest public hospital. MediAlbertina 900M has been created through domain adaptation on this data using masked language modelling.
The comparison with our baseline was made through the usage of both perplexity, which decreased from about 20 to 1.6 values, and the fine-tuning and evaluation of information extraction models such as Named Entity Recognition and Assertion Status. MediAlbertina PT-PT outperformed Albertina PT-PT in both tasks by 4–6% on recall and f1-score.
This study contributes with the first publicly available medical language model trained with PT-PT data. It underscores the efficacy of domain adaptation and offers a contribution to the scientific community in overcoming obstacles of non-English languages. With MediAlbertina, further steps can be taken to assist physicians, in creating decision support systems or building medical timelines in order to perform profiling, by fine-tuning MediAlbertina for PT- PT medical tasks.
Problem: There is a vast amount of medical unstructured text, yet there is no publicly available medical LM trained with PT-PT data.
What is Already Known: Domain adaptation is a process that allows achieving better results in NLP tasks compared to general models. Several studies have performed domain adaptation, and there are concentrated efforts in utilizing this technique for non-English languages where data availability and literature are scarce.
What This Paper Adds: This study presents the first publicly available medical LM trained with PT-PT EMRs from Portugal's largest public hospital, attempting to overcome the barrier of being a non-English language and providing motivation for other non-English languages to perform similar tasks. With our model, we present an additional tool that can help structure medical information, which will be beneficial for the application of AI m |
doi_str_mv | 10.1016/j.compbiomed.2024.109233 |
format | article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_3112856769</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0010482524013180</els_id><sourcerecordid>3112856769</sourcerecordid><originalsourceid>FETCH-LOGICAL-c2423-3f3429823f46d642f1e8691202e6f98f54f3225ae381d8e9ff39cbe698a4f7573</originalsourceid><addsrcrecordid>eNqFkMtKAzEUhoMoWi-vIANu3EzNbTKJuKlSL1DRha5DmjkpKdNJTWYE396UVgQ3rgIn3zn_z4dQQfCYYCKulmMbVuu5DytoxhRTnseKMraHRkTWqsQV4_tohDHBJZe0OkLHKS0xxhwzfIiOmGKCYkxH6OYZGj9p5xB735nrYtIV0yGGNZiueA2xHxYDJChyjremLVrTLQazyIPQQHuKDpxpE5zt3hP0fj99u3ssZy8PT3eTWWkpp6xkjnGqJGWOi0Zw6ghIoUjuDcIp6SruGKWVASZJI0E5x5Sdg1DScFdXNTtBl9u76xg-cp9er3yy0OY2EIakGSFUVqIWKqMXf9BlGGKX22WKVkrVRG0Oyi1lY0gpgtPr6FcmfmmC9cawXupfw3pjWG8N59XzXcAw3_z9LP4ozcDtFoBs5NND1Ml66Gw2GMH2ugn-_5RvCIaO4g</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3125997197</pqid></control><display><type>article</type><title>MediAlbertina: An European Portuguese medical language model</title><source>ScienceDirect Journals</source><creator>Nunes, Miguel ; Boné, João ; Ferreira, João C. ; Chaves, Pedro ; Elvas, Luis B.</creator><creatorcontrib>Nunes, Miguel ; Boné, João ; Ferreira, João C. ; Chaves, Pedro ; Elvas, Luis B.</creatorcontrib><description>Patient medical information often exists in unstructured text containing abbreviations and acronyms deemed essential to conserve time and space but posing challenges for automated interpretation. Leveraging the efficacy of Transformers in natural language processing, our objective was to use the knowledge acquired by a language model and continue its pre-training to develop an European Portuguese (PT-PT) healthcare-domain language model.
After carrying out a filtering process, Albertina PT-PT 900M was selected as our base language model, and we continued its pre-training using more than 2.6 million electronic medical records from Portugal's largest public hospital. MediAlbertina 900M has been created through domain adaptation on this data using masked language modelling.
The comparison with our baseline was made through the usage of both perplexity, which decreased from about 20 to 1.6 values, and the fine-tuning and evaluation of information extraction models such as Named Entity Recognition and Assertion Status. MediAlbertina PT-PT outperformed Albertina PT-PT in both tasks by 4–6% on recall and f1-score.
This study contributes with the first publicly available medical language model trained with PT-PT data. It underscores the efficacy of domain adaptation and offers a contribution to the scientific community in overcoming obstacles of non-English languages. With MediAlbertina, further steps can be taken to assist physicians, in creating decision support systems or building medical timelines in order to perform profiling, by fine-tuning MediAlbertina for PT- PT medical tasks.
Problem: There is a vast amount of medical unstructured text, yet there is no publicly available medical LM trained with PT-PT data.
What is Already Known: Domain adaptation is a process that allows achieving better results in NLP tasks compared to general models. Several studies have performed domain adaptation, and there are concentrated efforts in utilizing this technique for non-English languages where data availability and literature are scarce.
What This Paper Adds: This study presents the first publicly available medical LM trained with PT-PT EMRs from Portugal's largest public hospital, attempting to overcome the barrier of being a non-English language and providing motivation for other non-English languages to perform similar tasks. With our model, we present an additional tool that can help structure medical information, which will be beneficial for the application of AI methods to assist physicians in reducing their workload and, indirectly, improving patients' outcomes.</description><identifier>ISSN: 0010-4825</identifier><identifier>ISSN: 1879-0534</identifier><identifier>EISSN: 1879-0534</identifier><identifier>DOI: 10.1016/j.compbiomed.2024.109233</identifier><identifier>PMID: 39362002</identifier><language>eng</language><publisher>United States: Elsevier Ltd</publisher><subject>Abbreviations ; Adaptation ; Alzheimer's disease ; Artificial intelligence ; Cancer ; Decision support systems ; Domain adaptation ; Domain specific languages ; Effectiveness ; Electronic Health Records ; Electronic medical records ; English language ; European Portuguese ; Health care industry ; Hospitals ; Humans ; Information extraction ; Information processing ; Information retrieval ; Language ; Masked language modelling ; Medical language model ; Natural Language Processing ; Non-English languages ; Patients ; Performance evaluation ; Physicians ; Portugal ; Portuguese language ; Unstructured data</subject><ispartof>Computers in biology and medicine, 2024-11, Vol.182, p.109233, Article 109233</ispartof><rights>2024 The Authors</rights><rights>Copyright © 2024 The Authors. Published by Elsevier Ltd.. All rights reserved.</rights><rights>2024. The Authors</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c2423-3f3429823f46d642f1e8691202e6f98f54f3225ae381d8e9ff39cbe698a4f7573</cites><orcidid>0000-0003-4201-5641 ; 0000-0003-4909-6341 ; 0000-0002-7489-4380</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,777,781,27905,27906</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/39362002$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Nunes, Miguel</creatorcontrib><creatorcontrib>Boné, João</creatorcontrib><creatorcontrib>Ferreira, João C.</creatorcontrib><creatorcontrib>Chaves, Pedro</creatorcontrib><creatorcontrib>Elvas, Luis B.</creatorcontrib><title>MediAlbertina: An European Portuguese medical language model</title><title>Computers in biology and medicine</title><addtitle>Comput Biol Med</addtitle><description>Patient medical information often exists in unstructured text containing abbreviations and acronyms deemed essential to conserve time and space but posing challenges for automated interpretation. Leveraging the efficacy of Transformers in natural language processing, our objective was to use the knowledge acquired by a language model and continue its pre-training to develop an European Portuguese (PT-PT) healthcare-domain language model.
After carrying out a filtering process, Albertina PT-PT 900M was selected as our base language model, and we continued its pre-training using more than 2.6 million electronic medical records from Portugal's largest public hospital. MediAlbertina 900M has been created through domain adaptation on this data using masked language modelling.
The comparison with our baseline was made through the usage of both perplexity, which decreased from about 20 to 1.6 values, and the fine-tuning and evaluation of information extraction models such as Named Entity Recognition and Assertion Status. MediAlbertina PT-PT outperformed Albertina PT-PT in both tasks by 4–6% on recall and f1-score.
This study contributes with the first publicly available medical language model trained with PT-PT data. It underscores the efficacy of domain adaptation and offers a contribution to the scientific community in overcoming obstacles of non-English languages. With MediAlbertina, further steps can be taken to assist physicians, in creating decision support systems or building medical timelines in order to perform profiling, by fine-tuning MediAlbertina for PT- PT medical tasks.
Problem: There is a vast amount of medical unstructured text, yet there is no publicly available medical LM trained with PT-PT data.
What is Already Known: Domain adaptation is a process that allows achieving better results in NLP tasks compared to general models. Several studies have performed domain adaptation, and there are concentrated efforts in utilizing this technique for non-English languages where data availability and literature are scarce.
What This Paper Adds: This study presents the first publicly available medical LM trained with PT-PT EMRs from Portugal's largest public hospital, attempting to overcome the barrier of being a non-English language and providing motivation for other non-English languages to perform similar tasks. With our model, we present an additional tool that can help structure medical information, which will be beneficial for the application of AI methods to assist physicians in reducing their workload and, indirectly, improving patients' outcomes.</description><subject>Abbreviations</subject><subject>Adaptation</subject><subject>Alzheimer's disease</subject><subject>Artificial intelligence</subject><subject>Cancer</subject><subject>Decision support systems</subject><subject>Domain adaptation</subject><subject>Domain specific languages</subject><subject>Effectiveness</subject><subject>Electronic Health Records</subject><subject>Electronic medical records</subject><subject>English language</subject><subject>European Portuguese</subject><subject>Health care industry</subject><subject>Hospitals</subject><subject>Humans</subject><subject>Information extraction</subject><subject>Information processing</subject><subject>Information retrieval</subject><subject>Language</subject><subject>Masked language modelling</subject><subject>Medical language model</subject><subject>Natural Language Processing</subject><subject>Non-English languages</subject><subject>Patients</subject><subject>Performance evaluation</subject><subject>Physicians</subject><subject>Portugal</subject><subject>Portuguese language</subject><subject>Unstructured data</subject><issn>0010-4825</issn><issn>1879-0534</issn><issn>1879-0534</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNqFkMtKAzEUhoMoWi-vIANu3EzNbTKJuKlSL1DRha5DmjkpKdNJTWYE396UVgQ3rgIn3zn_z4dQQfCYYCKulmMbVuu5DytoxhRTnseKMraHRkTWqsQV4_tohDHBJZe0OkLHKS0xxhwzfIiOmGKCYkxH6OYZGj9p5xB735nrYtIV0yGGNZiueA2xHxYDJChyjremLVrTLQazyIPQQHuKDpxpE5zt3hP0fj99u3ssZy8PT3eTWWkpp6xkjnGqJGWOi0Zw6ghIoUjuDcIp6SruGKWVASZJI0E5x5Sdg1DScFdXNTtBl9u76xg-cp9er3yy0OY2EIakGSFUVqIWKqMXf9BlGGKX22WKVkrVRG0Oyi1lY0gpgtPr6FcmfmmC9cawXupfw3pjWG8N59XzXcAw3_z9LP4ozcDtFoBs5NND1Ml66Gw2GMH2ugn-_5RvCIaO4g</recordid><startdate>202411</startdate><enddate>202411</enddate><creator>Nunes, Miguel</creator><creator>Boné, João</creator><creator>Ferreira, João C.</creator><creator>Chaves, Pedro</creator><creator>Elvas, Luis B.</creator><general>Elsevier Ltd</general><general>Elsevier Limited</general><scope>6I.</scope><scope>AAFTH</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>8FD</scope><scope>FR3</scope><scope>JQ2</scope><scope>K9.</scope><scope>M7Z</scope><scope>NAPCQ</scope><scope>P64</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0003-4201-5641</orcidid><orcidid>https://orcid.org/0000-0003-4909-6341</orcidid><orcidid>https://orcid.org/0000-0002-7489-4380</orcidid></search><sort><creationdate>202411</creationdate><title>MediAlbertina: An European Portuguese medical language model</title><author>Nunes, Miguel ; Boné, João ; Ferreira, João C. ; Chaves, Pedro ; Elvas, Luis B.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c2423-3f3429823f46d642f1e8691202e6f98f54f3225ae381d8e9ff39cbe698a4f7573</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Abbreviations</topic><topic>Adaptation</topic><topic>Alzheimer's disease</topic><topic>Artificial intelligence</topic><topic>Cancer</topic><topic>Decision support systems</topic><topic>Domain adaptation</topic><topic>Domain specific languages</topic><topic>Effectiveness</topic><topic>Electronic Health Records</topic><topic>Electronic medical records</topic><topic>English language</topic><topic>European Portuguese</topic><topic>Health care industry</topic><topic>Hospitals</topic><topic>Humans</topic><topic>Information extraction</topic><topic>Information processing</topic><topic>Information retrieval</topic><topic>Language</topic><topic>Masked language modelling</topic><topic>Medical language model</topic><topic>Natural Language Processing</topic><topic>Non-English languages</topic><topic>Patients</topic><topic>Performance evaluation</topic><topic>Physicians</topic><topic>Portugal</topic><topic>Portuguese language</topic><topic>Unstructured data</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Nunes, Miguel</creatorcontrib><creatorcontrib>Boné, João</creatorcontrib><creatorcontrib>Ferreira, João C.</creatorcontrib><creatorcontrib>Chaves, Pedro</creatorcontrib><creatorcontrib>Elvas, Luis B.</creatorcontrib><collection>ScienceDirect Open Access Titles</collection><collection>Elsevier:ScienceDirect:Open Access</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>Biochemistry Abstracts 1</collection><collection>Nursing & Allied Health Premium</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>Computers in biology and medicine</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Nunes, Miguel</au><au>Boné, João</au><au>Ferreira, João C.</au><au>Chaves, Pedro</au><au>Elvas, Luis B.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MediAlbertina: An European Portuguese medical language model</atitle><jtitle>Computers in biology and medicine</jtitle><addtitle>Comput Biol Med</addtitle><date>2024-11</date><risdate>2024</risdate><volume>182</volume><spage>109233</spage><pages>109233-</pages><artnum>109233</artnum><issn>0010-4825</issn><issn>1879-0534</issn><eissn>1879-0534</eissn><abstract>Patient medical information often exists in unstructured text containing abbreviations and acronyms deemed essential to conserve time and space but posing challenges for automated interpretation. Leveraging the efficacy of Transformers in natural language processing, our objective was to use the knowledge acquired by a language model and continue its pre-training to develop an European Portuguese (PT-PT) healthcare-domain language model.
After carrying out a filtering process, Albertina PT-PT 900M was selected as our base language model, and we continued its pre-training using more than 2.6 million electronic medical records from Portugal's largest public hospital. MediAlbertina 900M has been created through domain adaptation on this data using masked language modelling.
The comparison with our baseline was made through the usage of both perplexity, which decreased from about 20 to 1.6 values, and the fine-tuning and evaluation of information extraction models such as Named Entity Recognition and Assertion Status. MediAlbertina PT-PT outperformed Albertina PT-PT in both tasks by 4–6% on recall and f1-score.
This study contributes with the first publicly available medical language model trained with PT-PT data. It underscores the efficacy of domain adaptation and offers a contribution to the scientific community in overcoming obstacles of non-English languages. With MediAlbertina, further steps can be taken to assist physicians, in creating decision support systems or building medical timelines in order to perform profiling, by fine-tuning MediAlbertina for PT- PT medical tasks.
Problem: There is a vast amount of medical unstructured text, yet there is no publicly available medical LM trained with PT-PT data.
What is Already Known: Domain adaptation is a process that allows achieving better results in NLP tasks compared to general models. Several studies have performed domain adaptation, and there are concentrated efforts in utilizing this technique for non-English languages where data availability and literature are scarce.
What This Paper Adds: This study presents the first publicly available medical LM trained with PT-PT EMRs from Portugal's largest public hospital, attempting to overcome the barrier of being a non-English language and providing motivation for other non-English languages to perform similar tasks. With our model, we present an additional tool that can help structure medical information, which will be beneficial for the application of AI methods to assist physicians in reducing their workload and, indirectly, improving patients' outcomes.</abstract><cop>United States</cop><pub>Elsevier Ltd</pub><pmid>39362002</pmid><doi>10.1016/j.compbiomed.2024.109233</doi><orcidid>https://orcid.org/0000-0003-4201-5641</orcidid><orcidid>https://orcid.org/0000-0003-4909-6341</orcidid><orcidid>https://orcid.org/0000-0002-7489-4380</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0010-4825 |
ispartof | Computers in biology and medicine, 2024-11, Vol.182, p.109233, Article 109233 |
issn | 0010-4825 1879-0534 1879-0534 |
language | eng |
recordid | cdi_proquest_miscellaneous_3112856769 |
source | ScienceDirect Journals |
subjects | Abbreviations Adaptation Alzheimer's disease Artificial intelligence Cancer Decision support systems Domain adaptation Domain specific languages Effectiveness Electronic Health Records Electronic medical records English language European Portuguese Health care industry Hospitals Humans Information extraction Information processing Information retrieval Language Masked language modelling Medical language model Natural Language Processing Non-English languages Patients Performance evaluation Physicians Portugal Portuguese language Unstructured data |
title | MediAlbertina: An European Portuguese medical language model |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-17T20%3A37%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MediAlbertina:%20An%20European%20Portuguese%20medical%20language%20model&rft.jtitle=Computers%20in%20biology%20and%20medicine&rft.au=Nunes,%20Miguel&rft.date=2024-11&rft.volume=182&rft.spage=109233&rft.pages=109233-&rft.artnum=109233&rft.issn=0010-4825&rft.eissn=1879-0534&rft_id=info:doi/10.1016/j.compbiomed.2024.109233&rft_dat=%3Cproquest_cross%3E3112856769%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c2423-3f3429823f46d642f1e8691202e6f98f54f3225ae381d8e9ff39cbe698a4f7573%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3125997197&rft_id=info:pmid/39362002&rfr_iscdi=true |