Loading…

MediAlbertina: An European Portuguese medical language model

Patient medical information often exists in unstructured text containing abbreviations and acronyms deemed essential to conserve time and space but posing challenges for automated interpretation. Leveraging the efficacy of Transformers in natural language processing, our objective was to use the kno...

Full description

Saved in:
Bibliographic Details
Published in:Computers in biology and medicine 2024-11, Vol.182, p.109233, Article 109233
Main Authors: Nunes, Miguel, Boné, João, Ferreira, João C., Chaves, Pedro, Elvas, Luis B.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Patient medical information often exists in unstructured text containing abbreviations and acronyms deemed essential to conserve time and space but posing challenges for automated interpretation. Leveraging the efficacy of Transformers in natural language processing, our objective was to use the knowledge acquired by a language model and continue its pre-training to develop an European Portuguese (PT-PT) healthcare-domain language model. After carrying out a filtering process, Albertina PT-PT 900M was selected as our base language model, and we continued its pre-training using more than 2.6 million electronic medical records from Portugal's largest public hospital. MediAlbertina 900M has been created through domain adaptation on this data using masked language modelling. The comparison with our baseline was made through the usage of both perplexity, which decreased from about 20 to 1.6 values, and the fine-tuning and evaluation of information extraction models such as Named Entity Recognition and Assertion Status. MediAlbertina PT-PT outperformed Albertina PT-PT in both tasks by 4–6% on recall and f1-score. This study contributes with the first publicly available medical language model trained with PT-PT data. It underscores the efficacy of domain adaptation and offers a contribution to the scientific community in overcoming obstacles of non-English languages. With MediAlbertina, further steps can be taken to assist physicians, in creating decision support systems or building medical timelines in order to perform profiling, by fine-tuning MediAlbertina for PT- PT medical tasks. Problem: There is a vast amount of medical unstructured text, yet there is no publicly available medical LM trained with PT-PT data. What is Already Known: Domain adaptation is a process that allows achieving better results in NLP tasks compared to general models. Several studies have performed domain adaptation, and there are concentrated efforts in utilizing this technique for non-English languages where data availability and literature are scarce. What This Paper Adds: This study presents the first publicly available medical LM trained with PT-PT EMRs from Portugal's largest public hospital, attempting to overcome the barrier of being a non-English language and providing motivation for other non-English languages to perform similar tasks. With our model, we present an additional tool that can help structure medical information, which will be beneficial for the application of AI m
ISSN:0010-4825
1879-0534
1879-0534
DOI:10.1016/j.compbiomed.2024.109233