Loading…

Mismatch string kernels for discriminative protein classification

Motivation: Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training a...

Full description

Saved in:

Bibliographic Details
Published in:	Bioinformatics 2004-03, Vol.20 (4), p.467-476
Main Authors:	Leslie, Christina S., Eskin, Eleazar, Cohen, Adiel, Weston, Jason, Noble, William Stafford
Format:	Article
Language:	English
Subjects:	Algorithms Amino Acid Sequence Artificial Intelligence Biological and medical sciences Fundamental and applied biological sciences. Psychology General aspects Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects) Molecular Sequence Data Nuclear Proteins - chemistry Nuclear Proteins - classification Pattern Recognition, Automated Phosphoprotein Phosphatases Proteins - chemistry Proteins - classification Sequence Alignment - methods Sequence Analysis, Protein - methods Sequence Homology, Amino Acid
Citations:	Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Motivation: Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training and prediction are also important concerns. Results: We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the problem of protein classification and remote homology detection. These kernels measure sequence similarity based on shared occurrences of fixed-length patterns in the data, allowing for mutations between patterns. Thus, the kernels provide a biologically well-motivated way to compare protein sequences without relying on family-based generative models such as hidden Markov models. We compute the kernels efficiently using a mismatch tree data structure, allowing us to calculate the contributions of all patterns occurring in the data in one pass while traversing the tree. When used with an SVM, the kernels enable fast prediction on test sequences. We report experiments on two benchmark SCOP datasets, where we show that the mismatch kernel used with an SVM classifier performs competitively with state-of-the-art methods for homology detection, particularly when very few training examples are available. Examination of the highest-weighted patterns learned by the SVM classifier recovers biologically important motifs in protein families and superfamilies. Availability: SVM software is publicly available at http://microarray.cpmc.columbia.edu/gist. Mismatch kernel software is available upon request.
ISSN:	1367-4803 1460-2059 1367-4811
DOI:	10.1093/bioinformatics/btg431