Loading…

Rethinking Learning-Based Method for Lossless Genome Compression

Lossless genome compression plays a vital role in genomic analysis procedure. The main challenges are from long range of the genome sequence and high frequency of genome variants. However, existing learning-based methods almost all rely on local DNA fragments in small windows, which makes them unabl...

Full description

Saved in:
Bibliographic Details
Main Authors: Yang, Han, Gu, Fei, Ye, Jieping
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Lossless genome compression plays a vital role in genomic analysis procedure. The main challenges are from long range of the genome sequence and high frequency of genome variants. However, existing learning-based methods almost all rely on local DNA fragments in small windows, which makes them unable to capture deep regularities of genome sequence and may lead to unsatisfactory performance because of the large variability in different individuals. In this paper, we redesign the deep learning model and propose a simple yet effective position-driven transformer for genome data compression. Our approach, called CompressBERT, is based on two core designs. First, we introduce global position of the complete genome sequence into our deep model, which can make the genome sequence distinguishable in base level. Second, we pre-train our deep model by identifying SNP genome variants, which can further facilitate genome compression task. Furthermore, the proposed CompressBERT is validated on three datasets from different species. Experimental results show that our approach outperforms state-of-the-art methods.
ISSN:2379-190X
DOI:10.1109/ICASSP49357.2023.10096124