Loading…

Automated Big Data Analysis in Bottom-up and Targeted Proteomics

Similar to other data intensive sciences, analyzing mass spectrometry-based proteomics data involves multiple steps and diverse software using different algorithms and data formats and sizes. Besides that the distributed and evolving nature of the data in online repositories, another challenge is th...

Full description

Saved in:
Bibliographic Details
Published in:Journal of biomolecular techniques 2014-05, Vol.25 (Suppl), p.S7-S7
Main Authors: van der Plas-Duivesteijn, Suzanne, Domański, Dominik, Smith, Derek, Borchers, Christoph, Palmblad, Magnus, Mohamme, Yassene
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Similar to other data intensive sciences, analyzing mass spectrometry-based proteomics data involves multiple steps and diverse software using different algorithms and data formats and sizes. Besides that the distributed and evolving nature of the data in online repositories, another challenge is that a scientists have to deal with many steps of analysis pipelines. A documented data processing is also becoming an essential part for the overall reproducibility of the results. Thanks to different e-Science initiatives, scientific workflow engines have become a means for automated, sharable and reproducible data processing. While these are designed as general tools, they can be employed to solve different challenges that we are facing in handling our Big Data. Here we present three use cases: improving the performance of different spectral search engines by decomposing input data and recomposing the resulting files, building spectral libraries from more than 20 million spectra, and integrating information from multiple resources to select most appropriate peptides for targeted proteomics analyses. The three use cases demonstrate different challenges in exploiting proteomics data analysis. In the first we integrate local and cloud processing resources in order to obtain better performance resulting in more than 30-fold speed improvement. By considering search engines as legacy software our solution is applicable to multiple search algorithms. The second use case is an example of automated processing of many data files of different sizes and locations, starting with raw data and ending with the final, ready-to-use library. This demonstrates the robustness and fault tolerance when dealing with huge amount data stored in multiple files. The third use case demonstrates retrieval and integration of information and data from multiple online repositories. In addition to the diversity of data formats and Web interfaces, this use case also illustrates how to deal with incomplete data.
ISSN:1524-0215
1943-4731