Loading…

LZW-Kernel: fast kernel utilizing variable length code blocks from LZW compressors for protein sequence classification

Abstract Motivation Bioinformatics studies often rely on similarity measures between sequence pairs, which often pose a bottleneck in large-scale sequence analysis. Results Here, we present a new convolutional kernel function for protein sequences called the Lempel-Ziv-Welch (LZW)-Kernel. It is base...

Full description

Saved in:
Bibliographic Details
Published in:Bioinformatics 2018-10, Vol.34 (19), p.3281-3288
Main Authors: Filatov, Gleb, Bauwens, Bruno, Kertész-Farkas, Attila
Format: Article
Language:English
Citations: Items that this one cites
Items that cite this one
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Motivation Bioinformatics studies often rely on similarity measures between sequence pairs, which often pose a bottleneck in large-scale sequence analysis. Results Here, we present a new convolutional kernel function for protein sequences called the Lempel-Ziv-Welch (LZW)-Kernel. It is based on code words identified with the LZW universal text compressor. The LZW-Kernel is an alignment-free method, it is always symmetric, is positive, always provides 1.0 for self-similarity and it can directly be used with Support Vector Machines (SVMs) in classification problems, contrary to normalized compression distance, which often violates the distance metric properties in practice and requires further techniques to be used with SVMs. The LZW-Kernel is a one-pass algorithm, which makes it particularly plausible for big data applications. Our experimental studies on remote protein homology detection and protein classification tasks reveal that the LZW-Kernel closely approaches the performance of the Local Alignment Kernel (LAK) and the SVM-pairwise method combined with Smith-Waterman (SW) scoring at a fraction of the time. Moreover, the LZW-Kernel outperforms the SVM-pairwise method when combined with Basic Local Alignment Search Tool (BLAST) scores, which indicates that the LZW code words might be a better basis for similarity measures than local alignment approximations found with BLAST. In addition, the LZW-Kernel outperforms n-gram based mismatch kernels, hidden Markov model based SAM and Fisher kernel and protein family based PSI-BLAST, among others. Further advantages include the LZW-Kernel's reliance on a simple idea, its ease of implementation, and its high speed, three times faster than BLAST and several magnitudes faster than SW or LAK in our tests. Availability and implementation LZW-Kernel is implemented as a standalone C code and is a free open-source program distributed under GPLv3 license and can be downloaded from https://github.com/kfattila/LZW-Kernel. Supplementary information Supplementary data are available at Bioinformatics Online.
ISSN:1367-4803
1460-2059
1367-4811
DOI:10.1093/bioinformatics/bty349