Loading…

MPEG-G Reference-Based Compression of Unaligned Reads Through Ultra-Fast Alignments

With the widespread application of next generation sequencing technologies, the volume of sequencing data became comparable to that of big data domains. The compression of sequencing reads (nucleotide sequences, quality values, read names), in both raw and aligned data, is a way to alleviate bandwid...

Full description

Saved in:
Bibliographic Details
Main Authors: Ozturk, U., Casale-Brunet, S., Ribeca, P., Mattavelli, M.
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:With the widespread application of next generation sequencing technologies, the volume of sequencing data became comparable to that of big data domains. The compression of sequencing reads (nucleotide sequences, quality values, read names), in both raw and aligned data, is a way to alleviate bandwidth, transfer, and storage requirements of genomics pipelines. ISO/IEC MPEG-G standardizes the compressed representation (i.e. storage and streaming) of structured, indexed sets of genomic sequencing data for both raw and aligned data. For the latter, reference-based compression is a strategy used to compress nucleotide sequences of sequencing reads by using alignment information to a reference sequence, which can be used to represent nucleotide sequences by storing the starting position of the alignment on the reference sequence, and the differences between the reference and the actual read. This general scheme is implemented in different ways by genomic data compressors, such as DeeZ, Quip, and CRAM, which apply to aligned reads.
ISSN:2375-0359
DOI:10.1109/DCC52660.2022.00089