Loading…

Rethinking Learning-Based Method for Lossless Genome Compression

Lossless genome compression plays a vital role in genomic analysis procedure. The main challenges are from long range of the genome sequence and high frequency of genome variants. However, existing learning-based methods almost all rely on local DNA fragments in small windows, which makes them unabl...

Full description

Saved in:
Bibliographic Details
Main Authors: Yang, Han, Gu, Fei, Ye, Jieping
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page 5
container_issue
container_start_page 1
container_title
container_volume
creator Yang, Han
Gu, Fei
Ye, Jieping
description Lossless genome compression plays a vital role in genomic analysis procedure. The main challenges are from long range of the genome sequence and high frequency of genome variants. However, existing learning-based methods almost all rely on local DNA fragments in small windows, which makes them unable to capture deep regularities of genome sequence and may lead to unsatisfactory performance because of the large variability in different individuals. In this paper, we redesign the deep learning model and propose a simple yet effective position-driven transformer for genome data compression. Our approach, called CompressBERT, is based on two core designs. First, we introduce global position of the complete genome sequence into our deep model, which can make the genome sequence distinguishable in base level. Second, we pre-train our deep model by identifying SNP genome variants, which can further facilitate genome compression task. Furthermore, the proposed CompressBERT is validated on three datasets from different species. Experimental results show that our approach outperforms state-of-the-art methods.
doi_str_mv 10.1109/ICASSP49357.2023.10096124
format conference_proceeding
fullrecord <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10096124</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10096124</ieee_id><sourcerecordid>10096124</sourcerecordid><originalsourceid>FETCH-LOGICAL-i1704-e0963af2a3a96062a880e3be4c6a7c173bd7674c821443244a69b9cb3c357a2c3</originalsourceid><addsrcrecordid>eNo1j91KxDAUhKMguLv6Bl7EB-h6khPzc6cWXYWK4ip4t6TpqUa3zdJ449sbUK9mhoHhG8ZOBSyFAHd2V1-u14_K4blZSpC4FABOC6n22FwYaYVGacw-m0k0rhIOXg_ZPOcPALBG2Rm7eKKv9zh-xvGNN-SnsZjqymfq-H1pUsf7NPEm5bylnPmKxjQQr9Owm0qOaTxiB73fZjr-0wV7ubl-rm-r5mFV6JoqCgOqooKFvpcevdOgpbcWCFtSQXsThMG2M9qoYKVQCqVSXrvWhRZDueZlwAU7-d2NRLTZTXHw0_fm_y7-AFFkSkY</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Rethinking Learning-Based Method for Lossless Genome Compression</title><source>IEEE Xplore All Conference Series</source><creator>Yang, Han ; Gu, Fei ; Ye, Jieping</creator><creatorcontrib>Yang, Han ; Gu, Fei ; Ye, Jieping</creatorcontrib><description>Lossless genome compression plays a vital role in genomic analysis procedure. The main challenges are from long range of the genome sequence and high frequency of genome variants. However, existing learning-based methods almost all rely on local DNA fragments in small windows, which makes them unable to capture deep regularities of genome sequence and may lead to unsatisfactory performance because of the large variability in different individuals. In this paper, we redesign the deep learning model and propose a simple yet effective position-driven transformer for genome data compression. Our approach, called CompressBERT, is based on two core designs. First, we introduce global position of the complete genome sequence into our deep model, which can make the genome sequence distinguishable in base level. Second, we pre-train our deep model by identifying SNP genome variants, which can further facilitate genome compression task. Furthermore, the proposed CompressBERT is validated on three datasets from different species. Experimental results show that our approach outperforms state-of-the-art methods.</description><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 1728163277</identifier><identifier>EISBN: 9781728163277</identifier><identifier>DOI: 10.1109/ICASSP49357.2023.10096124</identifier><language>eng</language><publisher>IEEE</publisher><subject>Deep learning ; DNA ; genome variants ; Genomics ; High frequency ; Learning systems ; lossless genome compression ; Signal processing ; transformer ; Transformer cores ; Transformers</subject><ispartof>ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, p.1-5</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10096124$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,4050,4051,23930,23931,25140,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10096124$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Yang, Han</creatorcontrib><creatorcontrib>Gu, Fei</creatorcontrib><creatorcontrib>Ye, Jieping</creatorcontrib><title>Rethinking Learning-Based Method for Lossless Genome Compression</title><title>ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title><addtitle>ICASSP</addtitle><description>Lossless genome compression plays a vital role in genomic analysis procedure. The main challenges are from long range of the genome sequence and high frequency of genome variants. However, existing learning-based methods almost all rely on local DNA fragments in small windows, which makes them unable to capture deep regularities of genome sequence and may lead to unsatisfactory performance because of the large variability in different individuals. In this paper, we redesign the deep learning model and propose a simple yet effective position-driven transformer for genome data compression. Our approach, called CompressBERT, is based on two core designs. First, we introduce global position of the complete genome sequence into our deep model, which can make the genome sequence distinguishable in base level. Second, we pre-train our deep model by identifying SNP genome variants, which can further facilitate genome compression task. Furthermore, the proposed CompressBERT is validated on three datasets from different species. Experimental results show that our approach outperforms state-of-the-art methods.</description><subject>Deep learning</subject><subject>DNA</subject><subject>genome variants</subject><subject>Genomics</subject><subject>High frequency</subject><subject>Learning systems</subject><subject>lossless genome compression</subject><subject>Signal processing</subject><subject>transformer</subject><subject>Transformer cores</subject><subject>Transformers</subject><issn>2379-190X</issn><isbn>1728163277</isbn><isbn>9781728163277</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2023</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo1j91KxDAUhKMguLv6Bl7EB-h6khPzc6cWXYWK4ip4t6TpqUa3zdJ449sbUK9mhoHhG8ZOBSyFAHd2V1-u14_K4blZSpC4FABOC6n22FwYaYVGacw-m0k0rhIOXg_ZPOcPALBG2Rm7eKKv9zh-xvGNN-SnsZjqymfq-H1pUsf7NPEm5bylnPmKxjQQr9Owm0qOaTxiB73fZjr-0wV7ubl-rm-r5mFV6JoqCgOqooKFvpcevdOgpbcWCFtSQXsThMG2M9qoYKVQCqVSXrvWhRZDueZlwAU7-d2NRLTZTXHw0_fm_y7-AFFkSkY</recordid><startdate>2023</startdate><enddate>2023</enddate><creator>Yang, Han</creator><creator>Gu, Fei</creator><creator>Ye, Jieping</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>2023</creationdate><title>Rethinking Learning-Based Method for Lossless Genome Compression</title><author>Yang, Han ; Gu, Fei ; Ye, Jieping</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i1704-e0963af2a3a96062a880e3be4c6a7c173bd7674c821443244a69b9cb3c357a2c3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Deep learning</topic><topic>DNA</topic><topic>genome variants</topic><topic>Genomics</topic><topic>High frequency</topic><topic>Learning systems</topic><topic>lossless genome compression</topic><topic>Signal processing</topic><topic>transformer</topic><topic>Transformer cores</topic><topic>Transformers</topic><toplevel>online_resources</toplevel><creatorcontrib>Yang, Han</creatorcontrib><creatorcontrib>Gu, Fei</creatorcontrib><creatorcontrib>Ye, Jieping</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Yang, Han</au><au>Gu, Fei</au><au>Ye, Jieping</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Rethinking Learning-Based Method for Lossless Genome Compression</atitle><btitle>ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</btitle><stitle>ICASSP</stitle><date>2023</date><risdate>2023</risdate><spage>1</spage><epage>5</epage><pages>1-5</pages><eissn>2379-190X</eissn><eisbn>1728163277</eisbn><eisbn>9781728163277</eisbn><abstract>Lossless genome compression plays a vital role in genomic analysis procedure. The main challenges are from long range of the genome sequence and high frequency of genome variants. However, existing learning-based methods almost all rely on local DNA fragments in small windows, which makes them unable to capture deep regularities of genome sequence and may lead to unsatisfactory performance because of the large variability in different individuals. In this paper, we redesign the deep learning model and propose a simple yet effective position-driven transformer for genome data compression. Our approach, called CompressBERT, is based on two core designs. First, we introduce global position of the complete genome sequence into our deep model, which can make the genome sequence distinguishable in base level. Second, we pre-train our deep model by identifying SNP genome variants, which can further facilitate genome compression task. Furthermore, the proposed CompressBERT is validated on three datasets from different species. Experimental results show that our approach outperforms state-of-the-art methods.</abstract><pub>IEEE</pub><doi>10.1109/ICASSP49357.2023.10096124</doi><tpages>5</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier EISSN: 2379-190X
ispartof ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, p.1-5
issn 2379-190X
language eng
recordid cdi_ieee_primary_10096124
source IEEE Xplore All Conference Series
subjects Deep learning
DNA
genome variants
Genomics
High frequency
Learning systems
lossless genome compression
Signal processing
transformer
Transformer cores
Transformers
title Rethinking Learning-Based Method for Lossless Genome Compression
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-27T22%3A14%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Rethinking%20Learning-Based%20Method%20for%20Lossless%20Genome%20Compression&rft.btitle=ICASSP%202023%20-%202023%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing%20(ICASSP)&rft.au=Yang,%20Han&rft.date=2023&rft.spage=1&rft.epage=5&rft.pages=1-5&rft.eissn=2379-190X&rft_id=info:doi/10.1109/ICASSP49357.2023.10096124&rft.eisbn=1728163277&rft.eisbn_list=9781728163277&rft_dat=%3Cieee_CHZPO%3E10096124%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i1704-e0963af2a3a96062a880e3be4c6a7c173bd7674c821443244a69b9cb3c357a2c3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10096124&rfr_iscdi=true