Loading…
Rethinking Learning-Based Method for Lossless Genome Compression
Lossless genome compression plays a vital role in genomic analysis procedure. The main challenges are from long range of the genome sequence and high frequency of genome variants. However, existing learning-based methods almost all rely on local DNA fragments in small windows, which makes them unabl...
Saved in:
Main Authors: | , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | 5 |
container_issue | |
container_start_page | 1 |
container_title | |
container_volume | |
creator | Yang, Han Gu, Fei Ye, Jieping |
description | Lossless genome compression plays a vital role in genomic analysis procedure. The main challenges are from long range of the genome sequence and high frequency of genome variants. However, existing learning-based methods almost all rely on local DNA fragments in small windows, which makes them unable to capture deep regularities of genome sequence and may lead to unsatisfactory performance because of the large variability in different individuals. In this paper, we redesign the deep learning model and propose a simple yet effective position-driven transformer for genome data compression. Our approach, called CompressBERT, is based on two core designs. First, we introduce global position of the complete genome sequence into our deep model, which can make the genome sequence distinguishable in base level. Second, we pre-train our deep model by identifying SNP genome variants, which can further facilitate genome compression task. Furthermore, the proposed CompressBERT is validated on three datasets from different species. Experimental results show that our approach outperforms state-of-the-art methods. |
doi_str_mv | 10.1109/ICASSP49357.2023.10096124 |
format | conference_proceeding |
fullrecord | <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10096124</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10096124</ieee_id><sourcerecordid>10096124</sourcerecordid><originalsourceid>FETCH-LOGICAL-i1704-e0963af2a3a96062a880e3be4c6a7c173bd7674c821443244a69b9cb3c357a2c3</originalsourceid><addsrcrecordid>eNo1j91KxDAUhKMguLv6Bl7EB-h6khPzc6cWXYWK4ip4t6TpqUa3zdJ449sbUK9mhoHhG8ZOBSyFAHd2V1-u14_K4blZSpC4FABOC6n22FwYaYVGacw-m0k0rhIOXg_ZPOcPALBG2Rm7eKKv9zh-xvGNN-SnsZjqymfq-H1pUsf7NPEm5bylnPmKxjQQr9Owm0qOaTxiB73fZjr-0wV7ubl-rm-r5mFV6JoqCgOqooKFvpcevdOgpbcWCFtSQXsThMG2M9qoYKVQCqVSXrvWhRZDueZlwAU7-d2NRLTZTXHw0_fm_y7-AFFkSkY</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Rethinking Learning-Based Method for Lossless Genome Compression</title><source>IEEE Xplore All Conference Series</source><creator>Yang, Han ; Gu, Fei ; Ye, Jieping</creator><creatorcontrib>Yang, Han ; Gu, Fei ; Ye, Jieping</creatorcontrib><description>Lossless genome compression plays a vital role in genomic analysis procedure. The main challenges are from long range of the genome sequence and high frequency of genome variants. However, existing learning-based methods almost all rely on local DNA fragments in small windows, which makes them unable to capture deep regularities of genome sequence and may lead to unsatisfactory performance because of the large variability in different individuals. In this paper, we redesign the deep learning model and propose a simple yet effective position-driven transformer for genome data compression. Our approach, called CompressBERT, is based on two core designs. First, we introduce global position of the complete genome sequence into our deep model, which can make the genome sequence distinguishable in base level. Second, we pre-train our deep model by identifying SNP genome variants, which can further facilitate genome compression task. Furthermore, the proposed CompressBERT is validated on three datasets from different species. Experimental results show that our approach outperforms state-of-the-art methods.</description><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 1728163277</identifier><identifier>EISBN: 9781728163277</identifier><identifier>DOI: 10.1109/ICASSP49357.2023.10096124</identifier><language>eng</language><publisher>IEEE</publisher><subject>Deep learning ; DNA ; genome variants ; Genomics ; High frequency ; Learning systems ; lossless genome compression ; Signal processing ; transformer ; Transformer cores ; Transformers</subject><ispartof>ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, p.1-5</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10096124$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,4050,4051,23930,23931,25140,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10096124$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Yang, Han</creatorcontrib><creatorcontrib>Gu, Fei</creatorcontrib><creatorcontrib>Ye, Jieping</creatorcontrib><title>Rethinking Learning-Based Method for Lossless Genome Compression</title><title>ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title><addtitle>ICASSP</addtitle><description>Lossless genome compression plays a vital role in genomic analysis procedure. The main challenges are from long range of the genome sequence and high frequency of genome variants. However, existing learning-based methods almost all rely on local DNA fragments in small windows, which makes them unable to capture deep regularities of genome sequence and may lead to unsatisfactory performance because of the large variability in different individuals. In this paper, we redesign the deep learning model and propose a simple yet effective position-driven transformer for genome data compression. Our approach, called CompressBERT, is based on two core designs. First, we introduce global position of the complete genome sequence into our deep model, which can make the genome sequence distinguishable in base level. Second, we pre-train our deep model by identifying SNP genome variants, which can further facilitate genome compression task. Furthermore, the proposed CompressBERT is validated on three datasets from different species. Experimental results show that our approach outperforms state-of-the-art methods.</description><subject>Deep learning</subject><subject>DNA</subject><subject>genome variants</subject><subject>Genomics</subject><subject>High frequency</subject><subject>Learning systems</subject><subject>lossless genome compression</subject><subject>Signal processing</subject><subject>transformer</subject><subject>Transformer cores</subject><subject>Transformers</subject><issn>2379-190X</issn><isbn>1728163277</isbn><isbn>9781728163277</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2023</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo1j91KxDAUhKMguLv6Bl7EB-h6khPzc6cWXYWK4ip4t6TpqUa3zdJ449sbUK9mhoHhG8ZOBSyFAHd2V1-u14_K4blZSpC4FABOC6n22FwYaYVGacw-m0k0rhIOXg_ZPOcPALBG2Rm7eKKv9zh-xvGNN-SnsZjqymfq-H1pUsf7NPEm5bylnPmKxjQQr9Owm0qOaTxiB73fZjr-0wV7ubl-rm-r5mFV6JoqCgOqooKFvpcevdOgpbcWCFtSQXsThMG2M9qoYKVQCqVSXrvWhRZDueZlwAU7-d2NRLTZTXHw0_fm_y7-AFFkSkY</recordid><startdate>2023</startdate><enddate>2023</enddate><creator>Yang, Han</creator><creator>Gu, Fei</creator><creator>Ye, Jieping</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>2023</creationdate><title>Rethinking Learning-Based Method for Lossless Genome Compression</title><author>Yang, Han ; Gu, Fei ; Ye, Jieping</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i1704-e0963af2a3a96062a880e3be4c6a7c173bd7674c821443244a69b9cb3c357a2c3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Deep learning</topic><topic>DNA</topic><topic>genome variants</topic><topic>Genomics</topic><topic>High frequency</topic><topic>Learning systems</topic><topic>lossless genome compression</topic><topic>Signal processing</topic><topic>transformer</topic><topic>Transformer cores</topic><topic>Transformers</topic><toplevel>online_resources</toplevel><creatorcontrib>Yang, Han</creatorcontrib><creatorcontrib>Gu, Fei</creatorcontrib><creatorcontrib>Ye, Jieping</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Yang, Han</au><au>Gu, Fei</au><au>Ye, Jieping</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Rethinking Learning-Based Method for Lossless Genome Compression</atitle><btitle>ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</btitle><stitle>ICASSP</stitle><date>2023</date><risdate>2023</risdate><spage>1</spage><epage>5</epage><pages>1-5</pages><eissn>2379-190X</eissn><eisbn>1728163277</eisbn><eisbn>9781728163277</eisbn><abstract>Lossless genome compression plays a vital role in genomic analysis procedure. The main challenges are from long range of the genome sequence and high frequency of genome variants. However, existing learning-based methods almost all rely on local DNA fragments in small windows, which makes them unable to capture deep regularities of genome sequence and may lead to unsatisfactory performance because of the large variability in different individuals. In this paper, we redesign the deep learning model and propose a simple yet effective position-driven transformer for genome data compression. Our approach, called CompressBERT, is based on two core designs. First, we introduce global position of the complete genome sequence into our deep model, which can make the genome sequence distinguishable in base level. Second, we pre-train our deep model by identifying SNP genome variants, which can further facilitate genome compression task. Furthermore, the proposed CompressBERT is validated on three datasets from different species. Experimental results show that our approach outperforms state-of-the-art methods.</abstract><pub>IEEE</pub><doi>10.1109/ICASSP49357.2023.10096124</doi><tpages>5</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | EISSN: 2379-190X |
ispartof | ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, p.1-5 |
issn | 2379-190X |
language | eng |
recordid | cdi_ieee_primary_10096124 |
source | IEEE Xplore All Conference Series |
subjects | Deep learning DNA genome variants Genomics High frequency Learning systems lossless genome compression Signal processing transformer Transformer cores Transformers |
title | Rethinking Learning-Based Method for Lossless Genome Compression |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-27T22%3A14%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Rethinking%20Learning-Based%20Method%20for%20Lossless%20Genome%20Compression&rft.btitle=ICASSP%202023%20-%202023%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing%20(ICASSP)&rft.au=Yang,%20Han&rft.date=2023&rft.spage=1&rft.epage=5&rft.pages=1-5&rft.eissn=2379-190X&rft_id=info:doi/10.1109/ICASSP49357.2023.10096124&rft.eisbn=1728163277&rft.eisbn_list=9781728163277&rft_dat=%3Cieee_CHZPO%3E10096124%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i1704-e0963af2a3a96062a880e3be4c6a7c173bd7674c821443244a69b9cb3c357a2c3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10096124&rfr_iscdi=true |