Loading…
Compressing DNA sequence databases with coil
Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then c...
Saved in:
Published in: | BMC bioinformatics 2008-05, Vol.9 (1), p.242-242, Article 242 |
---|---|
Main Authors: | , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-b648t-409cb3869ec8558745c41b12c8e927a32ca6e0934a91c984868d9c660a892c0c3 |
---|---|
cites | cdi_FETCH-LOGICAL-b648t-409cb3869ec8558745c41b12c8e927a32ca6e0934a91c984868d9c660a892c0c3 |
container_end_page | 242 |
container_issue | 1 |
container_start_page | 242 |
container_title | BMC bioinformatics |
container_volume | 9 |
creator | White, W Timothy J Hendy, Michael D |
description | Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression - an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil.
We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression - the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database.
coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work. |
doi_str_mv | 10.1186/1471-2105-9-242 |
format | article |
fullrecord | <record><control><sourceid>gale_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_34aad50b97cc4d6babb705c06c77567b</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A179991602</galeid><doaj_id>oai_doaj_org_article_34aad50b97cc4d6babb705c06c77567b</doaj_id><sourcerecordid>A179991602</sourcerecordid><originalsourceid>FETCH-LOGICAL-b648t-409cb3869ec8558745c41b12c8e927a32ca6e0934a91c984868d9c660a892c0c3</originalsourceid><addsrcrecordid>eNqFkluL1DAYhoMo7kGvvZOCICzY3aTN8UYYx9PAouDhOiRfM90sbTMmHQ__3tQO6xZWJBcJX548Sd4EoScEnxMi-QWhgpQVwaxUZUWre-j4pnL_1vgInaR0jTERErOH6IhIKpVQ9Bi9WId-F11KfmiL1x9WRXLf9m4AVzRmNNYkl4offrwqIPjuEXqwNV1yjw_9Kfr69s2X9fvy8uO7zXp1WVpO5VhSrMDWkisHkjEpKANKLKlAOlUJU1dguMOqpkYRUPkoXDYKOMdGqgow1KdoM3ubYK71LvrexF86GK__FEJstYmjh87pLDENw1YJANpwa6wVmAHmIATjwmbXy9m129veNeCGMZpuIV3ODP5Kt-G7znFygUUWvJoF1od_CJYzEHo9Ja-n5LWaRFny_HCKGHK-adS9T-C6zgwu7JMWhPOcRP1fkCiuWL5aBp_NYGtyDH7Yhrw5TLBeEaGUIhxP-57fQeXWuN5DGNzW5_piwdliQWZG93NszT4lvfn8aclezCzEkFJ025tQCNbT97wjhqe3H-Mvf_iP9W-ElNx2</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>19695775</pqid></control><display><type>article</type><title>Compressing DNA sequence databases with coil</title><source>PubMed Central (Open access)</source><creator>White, W Timothy J ; Hendy, Michael D</creator><creatorcontrib>White, W Timothy J ; Hendy, Michael D</creatorcontrib><description>Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression - an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil.
We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression - the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database.
coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work.</description><identifier>ISSN: 1471-2105</identifier><identifier>EISSN: 1471-2105</identifier><identifier>DOI: 10.1186/1471-2105-9-242</identifier><identifier>PMID: 18489794</identifier><language>eng</language><publisher>England: BioMed Central Ltd</publisher><subject>Analysis ; Animals ; Data compression ; Data Compression - methods ; Database Management Systems ; Databases, Nucleic Acid ; Evolution, Molecular ; Expressed Sequence Tags ; Humans ; Methods ; Neural Networks (Computer) ; Nucleotide sequence ; Phylogeny ; Physiological aspects ; Point Mutation ; Sequence Analysis, DNA ; Software ; Species Specificity</subject><ispartof>BMC bioinformatics, 2008-05, Vol.9 (1), p.242-242, Article 242</ispartof><rights>COPYRIGHT 2008 BioMed Central Ltd.</rights><rights>Copyright © 2008 White and Hendy; licensee BioMed Central Ltd. 2008 White and Hendy; licensee BioMed Central Ltd.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-b648t-409cb3869ec8558745c41b12c8e927a32ca6e0934a91c984868d9c660a892c0c3</citedby><cites>FETCH-LOGICAL-b648t-409cb3869ec8558745c41b12c8e927a32ca6e0934a91c984868d9c660a892c0c3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC2426707/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC2426707/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,885,27923,27924,53790,53792</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/18489794$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>White, W Timothy J</creatorcontrib><creatorcontrib>Hendy, Michael D</creatorcontrib><title>Compressing DNA sequence databases with coil</title><title>BMC bioinformatics</title><addtitle>BMC Bioinformatics</addtitle><description>Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression - an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil.
We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression - the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database.
coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work.</description><subject>Analysis</subject><subject>Animals</subject><subject>Data compression</subject><subject>Data Compression - methods</subject><subject>Database Management Systems</subject><subject>Databases, Nucleic Acid</subject><subject>Evolution, Molecular</subject><subject>Expressed Sequence Tags</subject><subject>Humans</subject><subject>Methods</subject><subject>Neural Networks (Computer)</subject><subject>Nucleotide sequence</subject><subject>Phylogeny</subject><subject>Physiological aspects</subject><subject>Point Mutation</subject><subject>Sequence Analysis, DNA</subject><subject>Software</subject><subject>Species Specificity</subject><issn>1471-2105</issn><issn>1471-2105</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2008</creationdate><recordtype>article</recordtype><sourceid>DOA</sourceid><recordid>eNqFkluL1DAYhoMo7kGvvZOCICzY3aTN8UYYx9PAouDhOiRfM90sbTMmHQ__3tQO6xZWJBcJX548Sd4EoScEnxMi-QWhgpQVwaxUZUWre-j4pnL_1vgInaR0jTERErOH6IhIKpVQ9Bi9WId-F11KfmiL1x9WRXLf9m4AVzRmNNYkl4offrwqIPjuEXqwNV1yjw_9Kfr69s2X9fvy8uO7zXp1WVpO5VhSrMDWkisHkjEpKANKLKlAOlUJU1dguMOqpkYRUPkoXDYKOMdGqgow1KdoM3ubYK71LvrexF86GK__FEJstYmjh87pLDENw1YJANpwa6wVmAHmIATjwmbXy9m129veNeCGMZpuIV3ODP5Kt-G7znFygUUWvJoF1od_CJYzEHo9Ja-n5LWaRFny_HCKGHK-adS9T-C6zgwu7JMWhPOcRP1fkCiuWL5aBp_NYGtyDH7Yhrw5TLBeEaGUIhxP-57fQeXWuN5DGNzW5_piwdliQWZG93NszT4lvfn8aclezCzEkFJ025tQCNbT97wjhqe3H-Mvf_iP9W-ElNx2</recordid><startdate>20080520</startdate><enddate>20080520</enddate><creator>White, W Timothy J</creator><creator>Hendy, Michael D</creator><general>BioMed Central Ltd</general><general>BioMed Central</general><general>BMC</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>ISR</scope><scope>7QO</scope><scope>7TM</scope><scope>8FD</scope><scope>FR3</scope><scope>P64</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope></search><sort><creationdate>20080520</creationdate><title>Compressing DNA sequence databases with coil</title><author>White, W Timothy J ; Hendy, Michael D</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-b648t-409cb3869ec8558745c41b12c8e927a32ca6e0934a91c984868d9c660a892c0c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2008</creationdate><topic>Analysis</topic><topic>Animals</topic><topic>Data compression</topic><topic>Data Compression - methods</topic><topic>Database Management Systems</topic><topic>Databases, Nucleic Acid</topic><topic>Evolution, Molecular</topic><topic>Expressed Sequence Tags</topic><topic>Humans</topic><topic>Methods</topic><topic>Neural Networks (Computer)</topic><topic>Nucleotide sequence</topic><topic>Phylogeny</topic><topic>Physiological aspects</topic><topic>Point Mutation</topic><topic>Sequence Analysis, DNA</topic><topic>Software</topic><topic>Species Specificity</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>White, W Timothy J</creatorcontrib><creatorcontrib>Hendy, Michael D</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Science</collection><collection>Biotechnology Research Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>BMC bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>White, W Timothy J</au><au>Hendy, Michael D</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Compressing DNA sequence databases with coil</atitle><jtitle>BMC bioinformatics</jtitle><addtitle>BMC Bioinformatics</addtitle><date>2008-05-20</date><risdate>2008</risdate><volume>9</volume><issue>1</issue><spage>242</spage><epage>242</epage><pages>242-242</pages><artnum>242</artnum><issn>1471-2105</issn><eissn>1471-2105</eissn><abstract>Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression - an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil.
We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression - the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database.
coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work.</abstract><cop>England</cop><pub>BioMed Central Ltd</pub><pmid>18489794</pmid><doi>10.1186/1471-2105-9-242</doi><tpages>1</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1471-2105 |
ispartof | BMC bioinformatics, 2008-05, Vol.9 (1), p.242-242, Article 242 |
issn | 1471-2105 1471-2105 |
language | eng |
recordid | cdi_doaj_primary_oai_doaj_org_article_34aad50b97cc4d6babb705c06c77567b |
source | PubMed Central (Open access) |
subjects | Analysis Animals Data compression Data Compression - methods Database Management Systems Databases, Nucleic Acid Evolution, Molecular Expressed Sequence Tags Humans Methods Neural Networks (Computer) Nucleotide sequence Phylogeny Physiological aspects Point Mutation Sequence Analysis, DNA Software Species Specificity |
title | Compressing DNA sequence databases with coil |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T18%3A39%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Compressing%20DNA%20sequence%20databases%20with%20coil&rft.jtitle=BMC%20bioinformatics&rft.au=White,%20W%20Timothy%20J&rft.date=2008-05-20&rft.volume=9&rft.issue=1&rft.spage=242&rft.epage=242&rft.pages=242-242&rft.artnum=242&rft.issn=1471-2105&rft.eissn=1471-2105&rft_id=info:doi/10.1186/1471-2105-9-242&rft_dat=%3Cgale_doaj_%3EA179991602%3C/gale_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-b648t-409cb3869ec8558745c41b12c8e927a32ca6e0934a91c984868d9c660a892c0c3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=19695775&rft_id=info:pmid/18489794&rft_galeid=A179991602&rfr_iscdi=true |