Loading…

Compressing DNA sequence databases with coil

Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then c...

Full description

Saved in:
Bibliographic Details
Published in:BMC bioinformatics 2008-05, Vol.9 (1), p.242-242, Article 242
Main Authors: White, W Timothy J, Hendy, Michael D
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-b648t-409cb3869ec8558745c41b12c8e927a32ca6e0934a91c984868d9c660a892c0c3
cites cdi_FETCH-LOGICAL-b648t-409cb3869ec8558745c41b12c8e927a32ca6e0934a91c984868d9c660a892c0c3
container_end_page 242
container_issue 1
container_start_page 242
container_title BMC bioinformatics
container_volume 9
creator White, W Timothy J
Hendy, Michael D
description Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression - an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression - the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work.
doi_str_mv 10.1186/1471-2105-9-242
format article
fullrecord <record><control><sourceid>gale_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_34aad50b97cc4d6babb705c06c77567b</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A179991602</galeid><doaj_id>oai_doaj_org_article_34aad50b97cc4d6babb705c06c77567b</doaj_id><sourcerecordid>A179991602</sourcerecordid><originalsourceid>FETCH-LOGICAL-b648t-409cb3869ec8558745c41b12c8e927a32ca6e0934a91c984868d9c660a892c0c3</originalsourceid><addsrcrecordid>eNqFkluL1DAYhoMo7kGvvZOCICzY3aTN8UYYx9PAouDhOiRfM90sbTMmHQ__3tQO6xZWJBcJX548Sd4EoScEnxMi-QWhgpQVwaxUZUWre-j4pnL_1vgInaR0jTERErOH6IhIKpVQ9Bi9WId-F11KfmiL1x9WRXLf9m4AVzRmNNYkl4offrwqIPjuEXqwNV1yjw_9Kfr69s2X9fvy8uO7zXp1WVpO5VhSrMDWkisHkjEpKANKLKlAOlUJU1dguMOqpkYRUPkoXDYKOMdGqgow1KdoM3ubYK71LvrexF86GK__FEJstYmjh87pLDENw1YJANpwa6wVmAHmIATjwmbXy9m129veNeCGMZpuIV3ODP5Kt-G7znFygUUWvJoF1od_CJYzEHo9Ja-n5LWaRFny_HCKGHK-adS9T-C6zgwu7JMWhPOcRP1fkCiuWL5aBp_NYGtyDH7Yhrw5TLBeEaGUIhxP-57fQeXWuN5DGNzW5_piwdliQWZG93NszT4lvfn8aclezCzEkFJ025tQCNbT97wjhqe3H-Mvf_iP9W-ElNx2</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>19695775</pqid></control><display><type>article</type><title>Compressing DNA sequence databases with coil</title><source>PubMed Central (Open access)</source><creator>White, W Timothy J ; Hendy, Michael D</creator><creatorcontrib>White, W Timothy J ; Hendy, Michael D</creatorcontrib><description>Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression - an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression - the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work.</description><identifier>ISSN: 1471-2105</identifier><identifier>EISSN: 1471-2105</identifier><identifier>DOI: 10.1186/1471-2105-9-242</identifier><identifier>PMID: 18489794</identifier><language>eng</language><publisher>England: BioMed Central Ltd</publisher><subject>Analysis ; Animals ; Data compression ; Data Compression - methods ; Database Management Systems ; Databases, Nucleic Acid ; Evolution, Molecular ; Expressed Sequence Tags ; Humans ; Methods ; Neural Networks (Computer) ; Nucleotide sequence ; Phylogeny ; Physiological aspects ; Point Mutation ; Sequence Analysis, DNA ; Software ; Species Specificity</subject><ispartof>BMC bioinformatics, 2008-05, Vol.9 (1), p.242-242, Article 242</ispartof><rights>COPYRIGHT 2008 BioMed Central Ltd.</rights><rights>Copyright © 2008 White and Hendy; licensee BioMed Central Ltd. 2008 White and Hendy; licensee BioMed Central Ltd.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-b648t-409cb3869ec8558745c41b12c8e927a32ca6e0934a91c984868d9c660a892c0c3</citedby><cites>FETCH-LOGICAL-b648t-409cb3869ec8558745c41b12c8e927a32ca6e0934a91c984868d9c660a892c0c3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC2426707/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC2426707/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,885,27923,27924,53790,53792</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/18489794$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>White, W Timothy J</creatorcontrib><creatorcontrib>Hendy, Michael D</creatorcontrib><title>Compressing DNA sequence databases with coil</title><title>BMC bioinformatics</title><addtitle>BMC Bioinformatics</addtitle><description>Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression - an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression - the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work.</description><subject>Analysis</subject><subject>Animals</subject><subject>Data compression</subject><subject>Data Compression - methods</subject><subject>Database Management Systems</subject><subject>Databases, Nucleic Acid</subject><subject>Evolution, Molecular</subject><subject>Expressed Sequence Tags</subject><subject>Humans</subject><subject>Methods</subject><subject>Neural Networks (Computer)</subject><subject>Nucleotide sequence</subject><subject>Phylogeny</subject><subject>Physiological aspects</subject><subject>Point Mutation</subject><subject>Sequence Analysis, DNA</subject><subject>Software</subject><subject>Species Specificity</subject><issn>1471-2105</issn><issn>1471-2105</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2008</creationdate><recordtype>article</recordtype><sourceid>DOA</sourceid><recordid>eNqFkluL1DAYhoMo7kGvvZOCICzY3aTN8UYYx9PAouDhOiRfM90sbTMmHQ__3tQO6xZWJBcJX548Sd4EoScEnxMi-QWhgpQVwaxUZUWre-j4pnL_1vgInaR0jTERErOH6IhIKpVQ9Bi9WId-F11KfmiL1x9WRXLf9m4AVzRmNNYkl4offrwqIPjuEXqwNV1yjw_9Kfr69s2X9fvy8uO7zXp1WVpO5VhSrMDWkisHkjEpKANKLKlAOlUJU1dguMOqpkYRUPkoXDYKOMdGqgow1KdoM3ubYK71LvrexF86GK__FEJstYmjh87pLDENw1YJANpwa6wVmAHmIATjwmbXy9m129veNeCGMZpuIV3ODP5Kt-G7znFygUUWvJoF1od_CJYzEHo9Ja-n5LWaRFny_HCKGHK-adS9T-C6zgwu7JMWhPOcRP1fkCiuWL5aBp_NYGtyDH7Yhrw5TLBeEaGUIhxP-57fQeXWuN5DGNzW5_piwdliQWZG93NszT4lvfn8aclezCzEkFJ025tQCNbT97wjhqe3H-Mvf_iP9W-ElNx2</recordid><startdate>20080520</startdate><enddate>20080520</enddate><creator>White, W Timothy J</creator><creator>Hendy, Michael D</creator><general>BioMed Central Ltd</general><general>BioMed Central</general><general>BMC</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>ISR</scope><scope>7QO</scope><scope>7TM</scope><scope>8FD</scope><scope>FR3</scope><scope>P64</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope></search><sort><creationdate>20080520</creationdate><title>Compressing DNA sequence databases with coil</title><author>White, W Timothy J ; Hendy, Michael D</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-b648t-409cb3869ec8558745c41b12c8e927a32ca6e0934a91c984868d9c660a892c0c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2008</creationdate><topic>Analysis</topic><topic>Animals</topic><topic>Data compression</topic><topic>Data Compression - methods</topic><topic>Database Management Systems</topic><topic>Databases, Nucleic Acid</topic><topic>Evolution, Molecular</topic><topic>Expressed Sequence Tags</topic><topic>Humans</topic><topic>Methods</topic><topic>Neural Networks (Computer)</topic><topic>Nucleotide sequence</topic><topic>Phylogeny</topic><topic>Physiological aspects</topic><topic>Point Mutation</topic><topic>Sequence Analysis, DNA</topic><topic>Software</topic><topic>Species Specificity</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>White, W Timothy J</creatorcontrib><creatorcontrib>Hendy, Michael D</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Science</collection><collection>Biotechnology Research Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>BMC bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>White, W Timothy J</au><au>Hendy, Michael D</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Compressing DNA sequence databases with coil</atitle><jtitle>BMC bioinformatics</jtitle><addtitle>BMC Bioinformatics</addtitle><date>2008-05-20</date><risdate>2008</risdate><volume>9</volume><issue>1</issue><spage>242</spage><epage>242</epage><pages>242-242</pages><artnum>242</artnum><issn>1471-2105</issn><eissn>1471-2105</eissn><abstract>Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression - an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression - the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work.</abstract><cop>England</cop><pub>BioMed Central Ltd</pub><pmid>18489794</pmid><doi>10.1186/1471-2105-9-242</doi><tpages>1</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1471-2105
ispartof BMC bioinformatics, 2008-05, Vol.9 (1), p.242-242, Article 242
issn 1471-2105
1471-2105
language eng
recordid cdi_doaj_primary_oai_doaj_org_article_34aad50b97cc4d6babb705c06c77567b
source PubMed Central (Open access)
subjects Analysis
Animals
Data compression
Data Compression - methods
Database Management Systems
Databases, Nucleic Acid
Evolution, Molecular
Expressed Sequence Tags
Humans
Methods
Neural Networks (Computer)
Nucleotide sequence
Phylogeny
Physiological aspects
Point Mutation
Sequence Analysis, DNA
Software
Species Specificity
title Compressing DNA sequence databases with coil
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T18%3A39%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Compressing%20DNA%20sequence%20databases%20with%20coil&rft.jtitle=BMC%20bioinformatics&rft.au=White,%20W%20Timothy%20J&rft.date=2008-05-20&rft.volume=9&rft.issue=1&rft.spage=242&rft.epage=242&rft.pages=242-242&rft.artnum=242&rft.issn=1471-2105&rft.eissn=1471-2105&rft_id=info:doi/10.1186/1471-2105-9-242&rft_dat=%3Cgale_doaj_%3EA179991602%3C/gale_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-b648t-409cb3869ec8558745c41b12c8e927a32ca6e0934a91c984868d9c660a892c0c3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=19695775&rft_id=info:pmid/18489794&rft_galeid=A179991602&rfr_iscdi=true