Loading…

Finding optimum width of discretization for gene expressions using functional annotations

Discretizing gene expression values is an important step in data preprocessing as it helps in reducing noise and experimental errors. This in turn provides better results in various tasks such as gene regulatory network analysis and disease prediction. A supervised discretization method for gene exp...

Full description

Saved in:
Bibliographic Details
Published in:Computers in biology and medicine 2017-11, Vol.90, p.59-67
Main Authors: Misra, Sampa, Ray, Shubhra Sankar
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Discretizing gene expression values is an important step in data preprocessing as it helps in reducing noise and experimental errors. This in turn provides better results in various tasks such as gene regulatory network analysis and disease prediction. A supervised discretization method for gene expressions using gene annotation is developed. The method is called “Gene Annotation Based Discretization” (GABD) where the discretization width is determined by maximizing the positive predictive value (PPV), computed using gene annotations, for top 20,000 gene pairs. The method can capture the gene similarity better than those obtained using original expressions. The performance of GABD is compared with some existing discretization methods like equal width discretization, equal frequency discretization and k-means discretization in terms of positive predictive value (PPV). The utility of GABD is also shown by clustering genes using k-medoid algorithm and thereby predicting the function of 23 unclassified Saccharomyces cerevisiae genes using p-value cut off 10−10. The source code for GABD is available at http://www.sampa.droppages.com/GABD.html. •A method (GABD) is developed where annotations of genes are used to find the optimum width of discretization.•Pearson correlation is used to compute similarity between expressions obtained using GABD.•The optimum width is determined by maximizing the PPV of gene pairs having higher expression similarity.•Functions of 23 unclassified Saccharomyces Cerevisiae genes are predicted.
ISSN:0010-4825
1879-0534
DOI:10.1016/j.compbiomed.2017.09.010