Loading…
LZW-Kernel: fast kernel utilizing variable length code blocks from LZW compressors for protein sequence classification
Abstract Motivation Bioinformatics studies often rely on similarity measures between sequence pairs, which often pose a bottleneck in large-scale sequence analysis. Results Here, we present a new convolutional kernel function for protein sequences called the Lempel-Ziv-Welch (LZW)-Kernel. It is base...
Saved in:
Published in: | Bioinformatics 2018-10, Vol.34 (19), p.3281-3288 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Abstract
Motivation
Bioinformatics studies often rely on similarity measures between sequence pairs, which often pose a bottleneck in large-scale sequence analysis.
Results
Here, we present a new convolutional kernel function for protein sequences called the Lempel-Ziv-Welch (LZW)-Kernel. It is based on code words identified with the LZW universal text compressor. The LZW-Kernel is an alignment-free method, it is always symmetric, is positive, always provides 1.0 for self-similarity and it can directly be used with Support Vector Machines (SVMs) in classification problems, contrary to normalized compression distance, which often violates the distance metric properties in practice and requires further techniques to be used with SVMs. The LZW-Kernel is a one-pass algorithm, which makes it particularly plausible for big data applications. Our experimental studies on remote protein homology detection and protein classification tasks reveal that the LZW-Kernel closely approaches the performance of the Local Alignment Kernel (LAK) and the SVM-pairwise method combined with Smith-Waterman (SW) scoring at a fraction of the time. Moreover, the LZW-Kernel outperforms the SVM-pairwise method when combined with Basic Local Alignment Search Tool (BLAST) scores, which indicates that the LZW code words might be a better basis for similarity measures than local alignment approximations found with BLAST. In addition, the LZW-Kernel outperforms n-gram based mismatch kernels, hidden Markov model based SAM and Fisher kernel and protein family based PSI-BLAST, among others. Further advantages include the LZW-Kernel's reliance on a simple idea, its ease of implementation, and its high speed, three times faster than BLAST and several magnitudes faster than SW or LAK in our tests.
Availability and implementation
LZW-Kernel is implemented as a standalone C code and is a free open-source program distributed under GPLv3 license and can be downloaded from https://github.com/kfattila/LZW-Kernel.
Supplementary information
Supplementary data are available at Bioinformatics Online. |
---|---|
ISSN: | 1367-4803 1460-2059 1367-4811 |
DOI: | 10.1093/bioinformatics/bty349 |