Loading…

Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra

In proteomics, liquid chromatography–tandem mass spectrometry (LC–MS/MS) is established for identifying peptides and proteins. Duplicated spectra, that is, multiple spectra of the same peptide, occur both in single MS/MS runs and in large spectral libraries. Clustering tandem mass spectra is used to...

Full description

Saved in:
Bibliographic Details
Published in:Journal of proteome research 2017-11, Vol.16 (11), p.4035-4044
Main Authors: Rieder, Vera, Schork, Karin U, Kerschke, Laura, Blank-Landeshammer, Bernhard, Sickmann, Albert, Rahnenführer, Jörg
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In proteomics, liquid chromatography–tandem mass spectrometry (LC–MS/MS) is established for identifying peptides and proteins. Duplicated spectra, that is, multiple spectra of the same peptide, occur both in single MS/MS runs and in large spectral libraries. Clustering tandem mass spectra is used to find consensus spectra, with manifold applications. First, it speeds up database searches, as performed for instance by Mascot. Second, it helps to identify novel peptides across species. Third, it is used for quality control to detect wrongly annotated spectra. We compare different clustering algorithms based on the cosine distance between spectra. CAST, MS-Cluster, and PRIDE Cluster are popular algorithms to cluster tandem mass spectra. We add well-known algorithms for large data sets, hierarchical clustering, DBSCAN, and connected components of a graph, as well as the new method N-Cluster. All algorithms are evaluated on real data with varied parameter settings. Cluster results are compared with each other and with peptide annotations based on validation measures such as purity. Quality control, regarding the detection of wrongly (un)­annotated spectra, is discussed for exemplary resulting clusters. N-Cluster proves to be highly competitive. All clustering results benefit from the so-called DISMS2 filter that integrates additional information, for example, on precursor mass.
ISSN:1535-3893
1535-3907
DOI:10.1021/acs.jproteome.7b00427