Loading…

Combining evidence, specificity, and proximity towards the normalization of Gene Ontology terms in text

Structured information provided by manual annotation of proteins with Gene Ontology concepts represents a high-quality reliable data source for the research community. However, a limited scope of proteins is annotated due to the amount of human resources required to fully annotate each individual ge...

Full description

Saved in:
Bibliographic Details
Published in:EURASIP journal on bioinformatics & systems biology 2008-03, Vol.2008 (1), p.342746-342746
Main Authors: Gaudan, S, Jimeno Yepes, A, Lee, V, Rebholz-Schuhmann, D
Format: Article
Language:English
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Structured information provided by manual annotation of proteins with Gene Ontology concepts represents a high-quality reliable data source for the research community. However, a limited scope of proteins is annotated due to the amount of human resources required to fully annotate each individual gene product from the literature. We introduce a novel method for automatic identification of GO terms in natural language text. The method takes into consideration several features: (1) the evidence for a GO term given by the words occurring in text, (2) the proximity between the words, and (3) the specificity of the GO terms based on their information content. The method has been evaluated on the BioCreAtIvE corpus and has been compared to current state of the art methods. The precision reached 0.34 at a recall of 0.34 for the identified terms at rank 1. In our analysis, we observe that the identification of GO terms in the "cellular component" subbranch of GO is more accurate than for terms from the other two subbranches. This observation is explained by the average number of words forming the terminology over the different subbranches.
ISSN:1687-4145
1687-4153
1687-4153
DOI:10.1186/1687-4153-2008-342746