Loading…

Uncovering condition information loss in medical text extraction: The challenge of non-contiguous spans

•In medical texts, 18.6 % of patient condition phrases include unrelated information.•The accuracy of traditional NER and EL methods is limited.•We identified the types of entities that current extraction techniques often miss. We investigated the limitations of conventional named entity recognition...

Full description

Saved in:

Bibliographic Details
Published in:	Next Research 2024-12, Vol.1 (2), p.100044, Article 100044
Main Authors:	Shinohara, Emiko, Shimamoto, Kiminori, Kawazoe, Yoshimasa
Format:	Article
Language:	English
Subjects:	Data annotation Entity linking Natural language processing (NLP)
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	•In medical texts, 18.6 % of patient condition phrases include unrelated information.•The accuracy of traditional NER and EL methods is limited.•We identified the types of entities that current extraction techniques often miss. We investigated the limitations of conventional named entity recognition (NER) and entity linking (EL) methods in accurately extracting patient condition information from medical texts, focusing on the challenges posed by non-contiguous spans and the potential information loss. We utilized a corpus with entity-relation annotations, analyzing the frequency and nature of non-contiguous spans that include irrelevant entities within gaps. The corpus was further analyzed to pinpoint the types of entity representations predominantly linked with peripheral spans—those not encompassing central symptom-describing terms—with a focus on items, body parts, and clinical tests. Our analysis revealed that 18.6 % of patient condition expressions were non-contiguous spans containing irrelevant entities, suggesting an accuracy ceiling of 81.4 % for conventional NER and EL approaches in the worst-case scenario. The study highlights the importance of entity types such as items, body parts, and clinical tests in these expressions, indicating that conventional extraction methods incur considerable information loss. The findings underscore the need for more sophisticated information extraction techniques capable of handling the complexities of medical texts, including non-contiguous spans. Adapting methods that allow gaps within entities or employing graph-based term assignments can enhance the accuracy and comprehensiveness of medical text annotation.
ISSN:	3050-4759
DOI:	10.1016/j.nexres.2024.100044