Loading…
MPEG-G Reference-Based Compression of Unaligned Reads Through Ultra-Fast Alignments
With the widespread application of next generation sequencing technologies, the volume of sequencing data became comparable to that of big data domains. The compression of sequencing reads (nucleotide sequences, quality values, read names), in both raw and aligned data, is a way to alleviate bandwid...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | With the widespread application of next generation sequencing technologies, the volume of sequencing data became comparable to that of big data domains. The compression of sequencing reads (nucleotide sequences, quality values, read names), in both raw and aligned data, is a way to alleviate bandwidth, transfer, and storage requirements of genomics pipelines. ISO/IEC MPEG-G standardizes the compressed representation (i.e. storage and streaming) of structured, indexed sets of genomic sequencing data for both raw and aligned data. For the latter, reference-based compression is a strategy used to compress nucleotide sequences of sequencing reads by using alignment information to a reference sequence, which can be used to represent nucleotide sequences by storing the starting position of the alignment on the reference sequence, and the differences between the reference and the actual read. This general scheme is implemented in different ways by genomic data compressors, such as DeeZ, Quip, and CRAM, which apply to aligned reads. |
---|---|
ISSN: | 2375-0359 |
DOI: | 10.1109/DCC52660.2022.00089 |