Loading…

Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit

•High performance clinical information extraction supports pertinent clinical research.•Multi-site hospital natural language processing models scale across settings.•Flexible informatics empowers fast clinician lead research and analysis.•Fast, scalable, flexible electronic health record information...

Full description

Saved in:
Bibliographic Details
Published in:Artificial intelligence in medicine 2021-07, Vol.117, p.102083, Article 102083
Main Authors: Kraljevic, Zeljko, Searle, Thomas, Shek, Anthony, Roguski, Lukasz, Noor, Kawsar, Bean, Daniel, Mascio, Aurelie, Zhu, Leilei, Folarin, Amos A., Roberts, Angus, Bendayan, Rebecca, Richardson, Mark P., Stewart, Robert, Shah, Anoop D., Wong, Wai Keong, Ibrahim, Zina, Teo, James T., Dobson, Richard J.B.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•High performance clinical information extraction supports pertinent clinical research.•Multi-site hospital natural language processing models scale across settings.•Flexible informatics empowers fast clinician lead research and analysis.•Fast, scalable, flexible electronic health record information extraction. Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of information extraction (IE) technologies to enable clinical analysis. We present the open source Medical Concept Annotation Toolkit (MedCAT) that provides: (a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; (b) a feature-rich annotation interface for customizing and training IE models; and (c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F1:0.448–0.738 vs 0.429–0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ∼8.8B words from ∼17M clinical records and further fine-tuning with ∼6K clinician annotated examples. We show strong transferability (F1 > 0.94) between hospitals, datasets and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.
ISSN:0933-3657
1873-2860
1873-2860
DOI:10.1016/j.artmed.2021.102083