Loading…

Gene–disease association with literature based enrichment

[Display omitted] •Knowledge-based functional enrichment for gene prioritization of high throughput data.•Automatic ontology generation from MEDLINE.•Novel and fully automatic literature-based discovery.•Literature ontologies perform better than expert-derived ones. Gene set enrichment analysis (GSE...

Full description

Saved in:
Bibliographic Details
Published in:Journal of biomedical informatics 2014-06, Vol.49, p.221-226
Main Authors: Tsafnat, Guy, Jasch, Dennis, Misra, Agam, Choong, Miew Keen, Lin, Frank P.-Y., Coiera, Enrico
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:[Display omitted] •Knowledge-based functional enrichment for gene prioritization of high throughput data.•Automatic ontology generation from MEDLINE.•Novel and fully automatic literature-based discovery.•Literature ontologies perform better than expert-derived ones. Gene set enrichment analysis (GSEA) annotates gene microarray data with functional information from the biomedical literature to improve gene–disease association prediction. We hypothesize that supplementing GSEA with comprehensive gene function catalogs built automatically using information extracted from the scientific literature will significantly enhance GSEA prediction quality. Gold standard gene sets for breast cancer (BrCa) and colorectal cancer (CRC) were derived from the literature. Two gene function catalogs (CMeSH and CUMLS) were automatically generated. 1. By using Entrez Gene to associate all recorded human genes with PubMed article IDs. 2. Using the genes mentioned in each PubMed article and associating each with the article’s MeSH terms (in CMeSH) and extracted UMLS concepts (in CUMLS). Microarray data from the Gene Expression Omnibus for BrCa and CRC was then annotated using CMeSH and CUMLS and for comparison, also with several pre-existing catalogs (C2, C4 and C5 from the Molecular Signatures Database). Ranking was done using, a standard GSEA implementation (GSEA-p). Gene function predictions for enriched array data were evaluated against the gold standard by measuring area under the receiver operating characteristic curve (AUC). Comparison of ranking using the literature enrichment catalogs, the pre-existing catalogs as well as five randomly generated catalogs show the literature derived enrichment catalogs are more effective. The AUC for BrCa using the unenriched gene expression dataset was 0.43, increasing to 0.89 after gene set enrichment with CUMLS. The AUC for CRC using the unenriched gene expression dataset was 0.54, increasing to 0.9 after enrichment with CMeSH. C2 increased AUC (BrCa 0.76, CRC 0.71) but C4 and C5 performed poorly (between 0.35 and 0.5). The randomly generated catalogs also performed poorly, equivalent to random guessing. Gene set enrichment significantly improved prediction of gene–disease association. Selection of enrichment catalog had a substantial effect on prediction accuracy. The literature based catalogs performed better than the MSigDB catalogs, possibly because they are more recent. Catalogs generated automatically from the literature can be k
ISSN:1532-0464
1532-0480
DOI:10.1016/j.jbi.2014.03.007