Loading…

Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation

Searching relevant publications for manual database annotation is a tedious task. In this paper, we apply a combination of Natural Language Processing (NLP) and probabilistic classification to re-rank documents returned by PubMed according to their relevance to Swiss-Prot annotation, and to identify...

Full description

Saved in:
Bibliographic Details
Published in:Bioinformatics 2003, Vol.19 Suppl 1 (suppl_1), p.i91-i94
Main Authors: Dobrokhotov, Pavel B, Goutte, Cyril, Veuthey, Anne-Lise, Gaussier, Eric
Format: Article
Language:English
Subjects:
Citations: Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Searching relevant publications for manual database annotation is a tedious task. In this paper, we apply a combination of Natural Language Processing (NLP) and probabilistic classification to re-rank documents returned by PubMed according to their relevance to Swiss-Prot annotation, and to identify significant terms in the documents. With a Probabilistic Latent Categoriser (PLC) we obtained 69% recall and 59% precision for relevant documents in a representative query. As the PLC technique provides the relative contribution of each term to the final document score, we used the Kullback-Leibler symmetric divergence to determine the most discriminating words for Swiss-Prot medical annotation. This information should allow curators to understand classification results better. It also has great value for fine-tuning the linguistic pre-processing of documents, which in turn can improve the overall classifier performance.
ISSN:1367-4803
1367-4811
1460-2059
DOI:10.1093/bioinformatics/btg1011