Loading…

Biomedical named entity recognition through improved balanced undersampling for addressing class imbalance and preserving contextual information

Biomedical Named Entity Recognition (Bio-NER) identifies and categorises the named entities of biomedical text data such as disease, chemical, protein, and gene. Since most of the biomedical data originates from the real world, the majority of data instances do not pertain to the specific named enti...

Full description

Saved in:
Bibliographic Details
Published in:International journal of information technology (Singapore. Online) 2024-12, Vol.16 (8), p.4995-5003
Main Authors: Archana, S. M., Prakash, Jay
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Biomedical Named Entity Recognition (Bio-NER) identifies and categorises the named entities of biomedical text data such as disease, chemical, protein, and gene. Since most of the biomedical data originates from the real world, the majority of data instances do not pertain to the specific named entity of interest. Therefore, this imbalance of data adversely impacts the performance of Bio-NER using machine learning models, as their learning objective is usually dominated by the majority of non-entity tokens. Various undersampling techniques have been introduced to address this issue. Balanced Undersampling (BUS) is one of the approaches which operates at the sentence level to enhance biomedical NER (Bio-NER). However, BUS lacks in preserving contextual information during the undersampling procedure. To overcome this limitation, we introduce an improved Balanced Undersampling method (iBUS) for Bio-NER. During the undersampling process, iBUS considers the importance of individual instances and generates a balanced dataset while retaining essential instances. To validate the effectiveness of the proposed method over competitive methods, we perform experiments using the NCBI disease dataset, CHEMDNER, and BC5CDR chemical datasets. The experimental results demonstrate the superiority of the proposed method in terms of the F1 score compared to competitive approaches.
ISSN:2511-2104
2511-2112
DOI:10.1007/s41870-024-02137-w