Loading…

PathologyBERT - Pre-trained Vs. A New Transformer Language Model for Pathology Domain

Pathology text mining is a challenging task given the reporting variability and constant new findings in cancer sub-type definitions. However, successful text mining of a large pathology database can play a critical role to advance ‘big data’ cancer research like similarity-based treatment selection...

Full description

Saved in:
Bibliographic Details
Published in:AMIA ... Annual Symposium proceedings 2023-04, Vol.2022, p.962-971
Main Authors: Santos, Thiago, Tariq, Amara, Das, Susmita, Vayalpati, Kavyasree, Smith, Geoffrey H., Trivedi, Hari, Banerjee, Imon
Format: Article
Language:English
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Pathology text mining is a challenging task given the reporting variability and constant new findings in cancer sub-type definitions. However, successful text mining of a large pathology database can play a critical role to advance ‘big data’ cancer research like similarity-based treatment selection, case identification, prognostication, surveillance, clinical trial screening, risk stratification, and many others. While there is a growing interest in developing language models for more specific clinical domains, no pathology-specific language space exist to support the rapid data-mining development in pathology space. In literature, a few approaches fine-tuned general transformer models on specialized corpora while maintaining the original tokenizer, but in fields requiring specialized terminology, these models often fail to perform adequately. We propose PathologyBERT - a pre-trained masked language model which was trained on 347,173 histopathology specimen reports and publicly released in the Huggingface 1 repository 2 . Our comprehensive experiments demonstrate that pre-training of transformer model on pathology corpora yields performance improvements on Natural Language Understanding (NLU) and Breast Cancer Diagnose Classification when compared to nonspecific language models.
ISSN:1559-4076