Loading…

A Practical Example of Bringing Computation to Data

The rapid decline in sequencing costs has resulted in an ever increasing number of data sets being generated by next-gen sequencing technologies. Downloading this data from a repository can consume much of an institute's network resources and severely impact overall analysis time. Storing and a...

Full description

Saved in:
Bibliographic Details
Published in:Journal of biomolecular techniques 2014-05, Vol.25 (Suppl), p.S5-S5
Main Authors: Downs, B.N., Opheim, D.M., Hale, W., Xi, L., Donehower, L.A., Kalra, D.
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The rapid decline in sequencing costs has resulted in an ever increasing number of data sets being generated by next-gen sequencing technologies. Downloading this data from a repository can consume much of an institute's network resources and severely impact overall analysis time. Storing and analyzing this “big data” is becoming a challenge not just for independent researchers but for large-scale sequencing centers as well. At Baylor College of Medicine, we had a need to analyze a large amount of data from The Cancer Genome Atlas (TCGA) dataset hosted at the Cancer Genomics Hub (CGHub) of The University of California Santa Cruz. We performed the initial reduction of the data at a compute farm co-located with CGHub. This significantly reduced the time required to download the data, while preserving the final results. We processed 6302 BAM files corresponding to over 4508 samples. The initial data reduction was achieved by using samtools at the co-located compute farm to extract the exon data in question. This reduction was accelerated by combining gtfuse with samtools. Using this approach, we were able to shrink nearly 98TB of data to 6.5GB. Downloading the original data set of ∼98TB would have taken us 8.5 weeks. Our approach reduced the transfer time to 5.5 mins (assuming a transfer rate of 20MB/s). By greatly reducing both the download time and the storage size of our data set, we have demonstrated one way in which the big data paradigm of moving computation to the data can be a practical reality. The rapid decline in sequencing costs has resulted in an ever increasing number of data sets being generated by next-gen sequencing technologies. Downloading this data from a repository can consume much of an institute's network resources and severely impact overall analysis time. Storing and analyzing this “big data” is becoming a challenge not just for independent researchers but for large-scale sequencing centers as well. At Baylor College of Medicine, we had a need to analyze a large amount of data from The Cancer Genome Atlas (TCGA) dataset hosted at the Cancer Genomics Hub (CGHub) of The University of California Santa Cruz. We performed the initial reduction of the data at a compute farm co-located with CGHub. This significantly reduced the time required to download the data, while preserving the final results. We processed 6302 BAM files corresponding to over 4508 samples. The initial data reduction was achieved by using samtools at the co-located compu
ISSN:1524-0215
1943-4731