Loading…

Protein classification based on text document classification techniques

The need for accurate, automated protein classification methods continues to increase as advances in biotechnology uncover new proteins. G‐protein coupled receptors (GPCRs) are a particularly difficult superfamily of proteins to classify due to extreme diversity among its members. Previous compariso...

Full description

Saved in:
Bibliographic Details
Published in:Proteins, structure, function, and bioinformatics structure, function, and bioinformatics, 2005-03, Vol.58 (4), p.955-970
Main Authors: Cheng, Betty Yee Man, Carbonell, Jaime G., Klein-Seetharaman, Judith
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The need for accurate, automated protein classification methods continues to increase as advances in biotechnology uncover new proteins. G‐protein coupled receptors (GPCRs) are a particularly difficult superfamily of proteins to classify due to extreme diversity among its members. Previous comparisons of BLAST, k‐nearest neighbor (k‐NN), hidden markov model (HMM) and support vector machine (SVM) using alignment‐based features have suggested that classifiers at the complexity of SVM are needed to attain high accuracy. Here, analogous to document classification, we applied Decision Tree and Naïve Bayes classifiers with chi‐square feature selection on counts of n‐grams (i.e. short peptide sequences of length n) to this classification task. Using the GPCR dataset and evaluation protocol from the previous study, the Naïve Bayes classifier attained an accuracy of 93.0 and 92.4% in level I and level II subfamily classification respectively, while SVM has a reported accuracy of 88.4 and 86.3%. This is a 39.7 and 44.5% reduction in residual error for level I and level II subfamily classification, respectively. The Decision Tree, while inferior to SVM, outperforms HMM in both level I and level II subfamily classification. For those GPCR families whose profiles are stored in the Protein FAMilies database of alignments and HMMs (PFAM), our method performs comparably to a search against those profiles. Finally, our method can be generalized to other protein families by applying it to the superfamily of nuclear receptors with 94.5, 97.8 and 93.6% accuracy in family, level I and level II subfamily classification respectively. Proteins 2005. © 2005 Wiley‐Liss, Inc.
ISSN:0887-3585
1097-0134
DOI:10.1002/prot.20373