Loading…

CIndex: compressed indexes for fast retrieval of FASTQ files

Abstract Motivation Ultrahigh-throughput next-generation sequencing instruments continue to generate vast amounts of genomic data. These data are generally stored in FASTQ format. Two important simultaneous goals are space-efficient compressed storage of the genomic data and fast query performance....

Full description

Saved in:
Bibliographic Details
Published in:Bioinformatics 2022-01, Vol.38 (2), p.335-343
Main Authors: Huo, Hongwei, Liu, Pengfei, Wang, Chenhui, Jiang, Hongbo, Vitter, Jeffrey Scott
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c353t-f1ad5456f8115b98b8682c4ef6161a167fa38d2f29265300e4171b5af3144cfe3
cites cdi_FETCH-LOGICAL-c353t-f1ad5456f8115b98b8682c4ef6161a167fa38d2f29265300e4171b5af3144cfe3
container_end_page 343
container_issue 2
container_start_page 335
container_title Bioinformatics
container_volume 38
creator Huo, Hongwei
Liu, Pengfei
Wang, Chenhui
Jiang, Hongbo
Vitter, Jeffrey Scott
description Abstract Motivation Ultrahigh-throughput next-generation sequencing instruments continue to generate vast amounts of genomic data. These data are generally stored in FASTQ format. Two important simultaneous goals are space-efficient compressed storage of the genomic data and fast query performance. Toward that end, we introduce compressed indexing to store and retrieve FASTQ files. Results We propose a compressed index for FASTQ files called CIndex. CIndex uses the Burrows–Wheeler transform and the wavelet tree, combined with hybrid encoding, succinct data structures and tables REF and Rγ, to achieve minimal space usage and fast retrieval on the compressed FASTQ files. Experiments conducted over real publicly available datasets from various sequencing instruments demonstrate that our proposed index substantially outperforms existing state-of-the-art solutions. For count, locate and extract queries on reads, our method uses 2.7–41.66% points less space and provides a speedup of 70–167.16 times, 1.44–35.57 times and 1.3–55.4 times. For extracting records in FASTQ files, our method uses 2.86–14.88% points less space and provides a speedup of 3.13–20.1 times. CIndex has an additional advantage in that it can be readily adapted to work as a general-purpose text index; experiments show that it performs very well in practice. Availability and implementation The software is available on Github: https://github.com/Hongweihuo-Lab/CIndex. Supplementary information Supplementary data are available at Bioinformatics online.
doi_str_mv 10.1093/bioinformatics/btab655
format article
fullrecord <record><control><sourceid>proquest_TOX</sourceid><recordid>TN_cdi_proquest_miscellaneous_2572934266</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><oup_id>10.1093/bioinformatics/btab655</oup_id><sourcerecordid>2572934266</sourcerecordid><originalsourceid>FETCH-LOGICAL-c353t-f1ad5456f8115b98b8682c4ef6161a167fa38d2f29265300e4171b5af3144cfe3</originalsourceid><addsrcrecordid>eNqNkE1Lw0AQhhdRbK3-hbJHL7E7-9VEvJRitVAQsZ7DJpmFlaQbdxPRf29Kq-DN0wzDM-8MDyFTYDfAMjErnHc760NjOlfGWdGZQit1QsYgNUs4U9np0As9T2TKxIhcxPjGmAIp5TkZCam4lKDH5G653lX4eUtL37QBY8SKuv0EIx3iqTWxowG74PDD1NRbulq8bJ-pdTXGS3JmTR3x6lgn5HV1v10-Jpunh_VysUlKoUSXWDCVkkrbFEAVWVqkOuWlRKtBgwE9t0akFbc841oJxlDCHAplrBjeLS2KCbk-5LbBv_cYu7xxscS6Njv0fcy5mvNMSK71gOoDWgYfY0Cbt8E1JnzlwPK9ufyvufxoblicHm_0RYPV79qPqgGAA-D79r-h33Fxf8g</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2572934266</pqid></control><display><type>article</type><title>CIndex: compressed indexes for fast retrieval of FASTQ files</title><source>Oxford Open Access Journals</source><creator>Huo, Hongwei ; Liu, Pengfei ; Wang, Chenhui ; Jiang, Hongbo ; Vitter, Jeffrey Scott</creator><contributor>Wren, Jonathan</contributor><creatorcontrib>Huo, Hongwei ; Liu, Pengfei ; Wang, Chenhui ; Jiang, Hongbo ; Vitter, Jeffrey Scott ; Wren, Jonathan</creatorcontrib><description>Abstract Motivation Ultrahigh-throughput next-generation sequencing instruments continue to generate vast amounts of genomic data. These data are generally stored in FASTQ format. Two important simultaneous goals are space-efficient compressed storage of the genomic data and fast query performance. Toward that end, we introduce compressed indexing to store and retrieve FASTQ files. Results We propose a compressed index for FASTQ files called CIndex. CIndex uses the Burrows–Wheeler transform and the wavelet tree, combined with hybrid encoding, succinct data structures and tables REF and Rγ, to achieve minimal space usage and fast retrieval on the compressed FASTQ files. Experiments conducted over real publicly available datasets from various sequencing instruments demonstrate that our proposed index substantially outperforms existing state-of-the-art solutions. For count, locate and extract queries on reads, our method uses 2.7–41.66% points less space and provides a speedup of 70–167.16 times, 1.44–35.57 times and 1.3–55.4 times. For extracting records in FASTQ files, our method uses 2.86–14.88% points less space and provides a speedup of 3.13–20.1 times. CIndex has an additional advantage in that it can be readily adapted to work as a general-purpose text index; experiments show that it performs very well in practice. Availability and implementation The software is available on Github: https://github.com/Hongweihuo-Lab/CIndex. Supplementary information Supplementary data are available at Bioinformatics online.</description><identifier>ISSN: 1367-4803</identifier><identifier>EISSN: 1460-2059</identifier><identifier>EISSN: 1367-4811</identifier><identifier>DOI: 10.1093/bioinformatics/btab655</identifier><identifier>PMID: 34524416</identifier><language>eng</language><publisher>England: Oxford University Press</publisher><subject>Algorithms ; Data Compression - methods ; Genome ; Genomics - methods ; High-Throughput Nucleotide Sequencing - methods ; Sequence Analysis, DNA - methods ; Software</subject><ispartof>Bioinformatics, 2022-01, Vol.38 (2), p.335-343</ispartof><rights>The Author(s) 2021. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com 2021</rights><rights>The Author(s) 2021. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c353t-f1ad5456f8115b98b8682c4ef6161a167fa38d2f29265300e4171b5af3144cfe3</citedby><cites>FETCH-LOGICAL-c353t-f1ad5456f8115b98b8682c4ef6161a167fa38d2f29265300e4171b5af3144cfe3</cites><orcidid>0000-0001-7970-6118 ; 0000-0002-5436-1851</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,1598,27901,27902</link.rule.ids><linktorsrc>$$Uhttps://dx.doi.org/10.1093/bioinformatics/btab655$$EView_record_in_Oxford_University_Press$$FView_record_in_$$GOxford_University_Press</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/34524416$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><contributor>Wren, Jonathan</contributor><creatorcontrib>Huo, Hongwei</creatorcontrib><creatorcontrib>Liu, Pengfei</creatorcontrib><creatorcontrib>Wang, Chenhui</creatorcontrib><creatorcontrib>Jiang, Hongbo</creatorcontrib><creatorcontrib>Vitter, Jeffrey Scott</creatorcontrib><title>CIndex: compressed indexes for fast retrieval of FASTQ files</title><title>Bioinformatics</title><addtitle>Bioinformatics</addtitle><description>Abstract Motivation Ultrahigh-throughput next-generation sequencing instruments continue to generate vast amounts of genomic data. These data are generally stored in FASTQ format. Two important simultaneous goals are space-efficient compressed storage of the genomic data and fast query performance. Toward that end, we introduce compressed indexing to store and retrieve FASTQ files. Results We propose a compressed index for FASTQ files called CIndex. CIndex uses the Burrows–Wheeler transform and the wavelet tree, combined with hybrid encoding, succinct data structures and tables REF and Rγ, to achieve minimal space usage and fast retrieval on the compressed FASTQ files. Experiments conducted over real publicly available datasets from various sequencing instruments demonstrate that our proposed index substantially outperforms existing state-of-the-art solutions. For count, locate and extract queries on reads, our method uses 2.7–41.66% points less space and provides a speedup of 70–167.16 times, 1.44–35.57 times and 1.3–55.4 times. For extracting records in FASTQ files, our method uses 2.86–14.88% points less space and provides a speedup of 3.13–20.1 times. CIndex has an additional advantage in that it can be readily adapted to work as a general-purpose text index; experiments show that it performs very well in practice. Availability and implementation The software is available on Github: https://github.com/Hongweihuo-Lab/CIndex. Supplementary information Supplementary data are available at Bioinformatics online.</description><subject>Algorithms</subject><subject>Data Compression - methods</subject><subject>Genome</subject><subject>Genomics - methods</subject><subject>High-Throughput Nucleotide Sequencing - methods</subject><subject>Sequence Analysis, DNA - methods</subject><subject>Software</subject><issn>1367-4803</issn><issn>1460-2059</issn><issn>1367-4811</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNqNkE1Lw0AQhhdRbK3-hbJHL7E7-9VEvJRitVAQsZ7DJpmFlaQbdxPRf29Kq-DN0wzDM-8MDyFTYDfAMjErnHc760NjOlfGWdGZQit1QsYgNUs4U9np0As9T2TKxIhcxPjGmAIp5TkZCam4lKDH5G653lX4eUtL37QBY8SKuv0EIx3iqTWxowG74PDD1NRbulq8bJ-pdTXGS3JmTR3x6lgn5HV1v10-Jpunh_VysUlKoUSXWDCVkkrbFEAVWVqkOuWlRKtBgwE9t0akFbc841oJxlDCHAplrBjeLS2KCbk-5LbBv_cYu7xxscS6Njv0fcy5mvNMSK71gOoDWgYfY0Cbt8E1JnzlwPK9ufyvufxoblicHm_0RYPV79qPqgGAA-D79r-h33Fxf8g</recordid><startdate>20220103</startdate><enddate>20220103</enddate><creator>Huo, Hongwei</creator><creator>Liu, Pengfei</creator><creator>Wang, Chenhui</creator><creator>Jiang, Hongbo</creator><creator>Vitter, Jeffrey Scott</creator><general>Oxford University Press</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0001-7970-6118</orcidid><orcidid>https://orcid.org/0000-0002-5436-1851</orcidid></search><sort><creationdate>20220103</creationdate><title>CIndex: compressed indexes for fast retrieval of FASTQ files</title><author>Huo, Hongwei ; Liu, Pengfei ; Wang, Chenhui ; Jiang, Hongbo ; Vitter, Jeffrey Scott</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c353t-f1ad5456f8115b98b8682c4ef6161a167fa38d2f29265300e4171b5af3144cfe3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Data Compression - methods</topic><topic>Genome</topic><topic>Genomics - methods</topic><topic>High-Throughput Nucleotide Sequencing - methods</topic><topic>Sequence Analysis, DNA - methods</topic><topic>Software</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Huo, Hongwei</creatorcontrib><creatorcontrib>Liu, Pengfei</creatorcontrib><creatorcontrib>Wang, Chenhui</creatorcontrib><creatorcontrib>Jiang, Hongbo</creatorcontrib><creatorcontrib>Vitter, Jeffrey Scott</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>Bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Huo, Hongwei</au><au>Liu, Pengfei</au><au>Wang, Chenhui</au><au>Jiang, Hongbo</au><au>Vitter, Jeffrey Scott</au><au>Wren, Jonathan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>CIndex: compressed indexes for fast retrieval of FASTQ files</atitle><jtitle>Bioinformatics</jtitle><addtitle>Bioinformatics</addtitle><date>2022-01-03</date><risdate>2022</risdate><volume>38</volume><issue>2</issue><spage>335</spage><epage>343</epage><pages>335-343</pages><issn>1367-4803</issn><eissn>1460-2059</eissn><eissn>1367-4811</eissn><abstract>Abstract Motivation Ultrahigh-throughput next-generation sequencing instruments continue to generate vast amounts of genomic data. These data are generally stored in FASTQ format. Two important simultaneous goals are space-efficient compressed storage of the genomic data and fast query performance. Toward that end, we introduce compressed indexing to store and retrieve FASTQ files. Results We propose a compressed index for FASTQ files called CIndex. CIndex uses the Burrows–Wheeler transform and the wavelet tree, combined with hybrid encoding, succinct data structures and tables REF and Rγ, to achieve minimal space usage and fast retrieval on the compressed FASTQ files. Experiments conducted over real publicly available datasets from various sequencing instruments demonstrate that our proposed index substantially outperforms existing state-of-the-art solutions. For count, locate and extract queries on reads, our method uses 2.7–41.66% points less space and provides a speedup of 70–167.16 times, 1.44–35.57 times and 1.3–55.4 times. For extracting records in FASTQ files, our method uses 2.86–14.88% points less space and provides a speedup of 3.13–20.1 times. CIndex has an additional advantage in that it can be readily adapted to work as a general-purpose text index; experiments show that it performs very well in practice. Availability and implementation The software is available on Github: https://github.com/Hongweihuo-Lab/CIndex. Supplementary information Supplementary data are available at Bioinformatics online.</abstract><cop>England</cop><pub>Oxford University Press</pub><pmid>34524416</pmid><doi>10.1093/bioinformatics/btab655</doi><tpages>9</tpages><orcidid>https://orcid.org/0000-0001-7970-6118</orcidid><orcidid>https://orcid.org/0000-0002-5436-1851</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1367-4803
ispartof Bioinformatics, 2022-01, Vol.38 (2), p.335-343
issn 1367-4803
1460-2059
1367-4811
language eng
recordid cdi_proquest_miscellaneous_2572934266
source Oxford Open Access Journals
subjects Algorithms
Data Compression - methods
Genome
Genomics - methods
High-Throughput Nucleotide Sequencing - methods
Sequence Analysis, DNA - methods
Software
title CIndex: compressed indexes for fast retrieval of FASTQ files
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-30T22%3A45%3A17IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_TOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=CIndex:%20compressed%20indexes%20for%20fast%20retrieval%20of%20FASTQ%20files&rft.jtitle=Bioinformatics&rft.au=Huo,%20Hongwei&rft.date=2022-01-03&rft.volume=38&rft.issue=2&rft.spage=335&rft.epage=343&rft.pages=335-343&rft.issn=1367-4803&rft.eissn=1460-2059&rft_id=info:doi/10.1093/bioinformatics/btab655&rft_dat=%3Cproquest_TOX%3E2572934266%3C/proquest_TOX%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c353t-f1ad5456f8115b98b8682c4ef6161a167fa38d2f29265300e4171b5af3144cfe3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2572934266&rft_id=info:pmid/34524416&rft_oup_id=10.1093/bioinformatics/btab655&rfr_iscdi=true