Loading…

Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates

Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or tw...

Full description

Saved in:
Bibliographic Details
Published in:PNAS nexus 2022-11, Vol.1 (5), p.pgac252-pgac252
Main Author: Press, William H
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c418t-c6a78decd84dd72a9425d7f54e5233ccd41dae4bcc505afefb956912d9f6ede63
container_end_page pgac252
container_issue 5
container_start_page pgac252
container_title PNAS nexus
container_volume 1
creator Press, William H
description Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or two errors per barcode in sets of typically ≲10 barcodes. We here consider the use of random barcodes of sufficient length that they remain accurately decodable even with ≳6 errors and even at [Formula: see text] or 20% nucleotide error rates. We show that length ∼34 nt is sufficient even with ≳10 barcodes. The obvious objection to this scheme is that it requires comparing every read to every possible barcode by a slow Levenshtein or Needleman-Wunsch comparison. We show that several orders of magnitude speedup can be achieved by (i) a fast triage method that compares only trimer (three consecutive nucleotide) occurence statistics, precomputed in linear time for both reads and barcodes, and (ii) the massive parallelism available on today's even commodity-grade Graphics Processing Units (GPUs). With 10 barcodes of length 34 and 10% DNA errors (substitutions and indels), we achieve in simulation 99.9% precision (decode accuracy) with 98.8% recall (read acceptance rate). Similarly high precision with somewhat smaller recall is achievable even with 20% DNA errors. The amortized computation cost on a commodity workstation with two GPUs (2022 capability and price) is estimated as between US$ 0.15 and US$ 0.60 per million decoded reads.
doi_str_mv 10.1093/pnasnexus/pgac252
format article
fullrecord <record><control><sourceid>gale_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_9802387</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A779046382</galeid><sourcerecordid>A779046382</sourcerecordid><originalsourceid>FETCH-LOGICAL-c418t-c6a78decd84dd72a9425d7f54e5233ccd41dae4bcc505afefb956912d9f6ede63</originalsourceid><addsrcrecordid>eNptUkFvFCEYnRiNbWp_gBdD4sXLtsDAMHMx2VSrTRq96Jl8Cx8jZgZWYJv238u466ZNGg588L334MFrmreMXjA6tJfbADng_S5fbkcwXPIXzSlXkq86KfjLR_VJc57zb0opV4oxIV83J22nGG-VPG0eriEXUpKfMZFcoPhcvMnEgfGTr2skYMwuLYVFE60PI4mOTJBGJAmCjTP59G1NNpBqF0nGkgneYSBQDqiMf3YYzMLElGIii1p-07xyMGU8P8xnzc_rzz-uvq5uv3-5uVrfroxgfVmZDlRfT7a9sFZxGASXVjkpUPK2NcYKZgHFxhhJJTh0m0F2A-N2cB1a7Nqz5uNed7vbzGgNhpJg0ttqGdKDjuD1007wv_QY7_TQU972qgp8OAikWI3komefDU4TBIy7rJdnpb0Yel6h7_fQESbUPrhYFc0C12ulBiq69h_q4hlUHRZnb2JA5-v-EwLbE0yKOSd0x9szqpcw6GMY9CEMlfPuse0j4__Xt38BE_63AQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2771084982</pqid></control><display><type>article</type><title>Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates</title><source>Open Access: Oxford University Press Open Journals</source><source>PubMed Central</source><creator>Press, William H</creator><contributor>Yooseph, Shibu</contributor><creatorcontrib>Press, William H ; Yooseph, Shibu</creatorcontrib><description>Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or two errors per barcode in sets of typically ≲10 barcodes. We here consider the use of random barcodes of sufficient length that they remain accurately decodable even with ≳6 errors and even at [Formula: see text] or 20% nucleotide error rates. We show that length ∼34 nt is sufficient even with ≳10 barcodes. The obvious objection to this scheme is that it requires comparing every read to every possible barcode by a slow Levenshtein or Needleman-Wunsch comparison. We show that several orders of magnitude speedup can be achieved by (i) a fast triage method that compares only trimer (three consecutive nucleotide) occurence statistics, precomputed in linear time for both reads and barcodes, and (ii) the massive parallelism available on today's even commodity-grade Graphics Processing Units (GPUs). With 10 barcodes of length 34 and 10% DNA errors (substitutions and indels), we achieve in simulation 99.9% precision (decode accuracy) with 98.8% recall (read acceptance rate). Similarly high precision with somewhat smaller recall is achievable even with 20% DNA errors. The amortized computation cost on a commodity workstation with two GPUs (2022 capability and price) is estimated as between US$ 0.15 and US$ 0.60 per million decoded reads.</description><identifier>ISSN: 2752-6542</identifier><identifier>EISSN: 2752-6542</identifier><identifier>DOI: 10.1093/pnasnexus/pgac252</identifier><identifier>PMID: 36712375</identifier><language>eng</language><publisher>England: Oxford University Press</publisher><subject>Analysis ; Bar codes ; Biological, Health, and Medical Sciences ; DNA ; DNA sequencing ; Error-correcting codes ; Multiprocessing ; Nucleotide sequencing ; Statistics</subject><ispartof>PNAS nexus, 2022-11, Vol.1 (5), p.pgac252-pgac252</ispartof><rights>The Author(s) 2022. Published by Oxford University Press on behalf of the National Academy of Sciences.</rights><rights>COPYRIGHT 2022 Oxford University Press</rights><rights>The Author(s) 2022. Published by Oxford University Press on behalf of the National Academy of Sciences. 2022</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c418t-c6a78decd84dd72a9425d7f54e5233ccd41dae4bcc505afefb956912d9f6ede63</cites><orcidid>0000-0003-0771-0841</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC9802387/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC9802387/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,885,27924,27925,53791,53793</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/36712375$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><contributor>Yooseph, Shibu</contributor><creatorcontrib>Press, William H</creatorcontrib><title>Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates</title><title>PNAS nexus</title><addtitle>PNAS Nexus</addtitle><description>Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or two errors per barcode in sets of typically ≲10 barcodes. We here consider the use of random barcodes of sufficient length that they remain accurately decodable even with ≳6 errors and even at [Formula: see text] or 20% nucleotide error rates. We show that length ∼34 nt is sufficient even with ≳10 barcodes. The obvious objection to this scheme is that it requires comparing every read to every possible barcode by a slow Levenshtein or Needleman-Wunsch comparison. We show that several orders of magnitude speedup can be achieved by (i) a fast triage method that compares only trimer (three consecutive nucleotide) occurence statistics, precomputed in linear time for both reads and barcodes, and (ii) the massive parallelism available on today's even commodity-grade Graphics Processing Units (GPUs). With 10 barcodes of length 34 and 10% DNA errors (substitutions and indels), we achieve in simulation 99.9% precision (decode accuracy) with 98.8% recall (read acceptance rate). Similarly high precision with somewhat smaller recall is achievable even with 20% DNA errors. The amortized computation cost on a commodity workstation with two GPUs (2022 capability and price) is estimated as between US$ 0.15 and US$ 0.60 per million decoded reads.</description><subject>Analysis</subject><subject>Bar codes</subject><subject>Biological, Health, and Medical Sciences</subject><subject>DNA</subject><subject>DNA sequencing</subject><subject>Error-correcting codes</subject><subject>Multiprocessing</subject><subject>Nucleotide sequencing</subject><subject>Statistics</subject><issn>2752-6542</issn><issn>2752-6542</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNptUkFvFCEYnRiNbWp_gBdD4sXLtsDAMHMx2VSrTRq96Jl8Cx8jZgZWYJv238u466ZNGg588L334MFrmreMXjA6tJfbADng_S5fbkcwXPIXzSlXkq86KfjLR_VJc57zb0opV4oxIV83J22nGG-VPG0eriEXUpKfMZFcoPhcvMnEgfGTr2skYMwuLYVFE60PI4mOTJBGJAmCjTP59G1NNpBqF0nGkgneYSBQDqiMf3YYzMLElGIii1p-07xyMGU8P8xnzc_rzz-uvq5uv3-5uVrfroxgfVmZDlRfT7a9sFZxGASXVjkpUPK2NcYKZgHFxhhJJTh0m0F2A-N2cB1a7Nqz5uNed7vbzGgNhpJg0ttqGdKDjuD1007wv_QY7_TQU972qgp8OAikWI3komefDU4TBIy7rJdnpb0Yel6h7_fQESbUPrhYFc0C12ulBiq69h_q4hlUHRZnb2JA5-v-EwLbE0yKOSd0x9szqpcw6GMY9CEMlfPuse0j4__Xt38BE_63AQ</recordid><startdate>20221101</startdate><enddate>20221101</enddate><creator>Press, William H</creator><general>Oxford University Press</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0003-0771-0841</orcidid></search><sort><creationdate>20221101</creationdate><title>Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates</title><author>Press, William H</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c418t-c6a78decd84dd72a9425d7f54e5233ccd41dae4bcc505afefb956912d9f6ede63</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Analysis</topic><topic>Bar codes</topic><topic>Biological, Health, and Medical Sciences</topic><topic>DNA</topic><topic>DNA sequencing</topic><topic>Error-correcting codes</topic><topic>Multiprocessing</topic><topic>Nucleotide sequencing</topic><topic>Statistics</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Press, William H</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>PNAS nexus</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Press, William H</au><au>Yooseph, Shibu</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates</atitle><jtitle>PNAS nexus</jtitle><addtitle>PNAS Nexus</addtitle><date>2022-11-01</date><risdate>2022</risdate><volume>1</volume><issue>5</issue><spage>pgac252</spage><epage>pgac252</epage><pages>pgac252-pgac252</pages><issn>2752-6542</issn><eissn>2752-6542</eissn><abstract>Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or two errors per barcode in sets of typically ≲10 barcodes. We here consider the use of random barcodes of sufficient length that they remain accurately decodable even with ≳6 errors and even at [Formula: see text] or 20% nucleotide error rates. We show that length ∼34 nt is sufficient even with ≳10 barcodes. The obvious objection to this scheme is that it requires comparing every read to every possible barcode by a slow Levenshtein or Needleman-Wunsch comparison. We show that several orders of magnitude speedup can be achieved by (i) a fast triage method that compares only trimer (three consecutive nucleotide) occurence statistics, precomputed in linear time for both reads and barcodes, and (ii) the massive parallelism available on today's even commodity-grade Graphics Processing Units (GPUs). With 10 barcodes of length 34 and 10% DNA errors (substitutions and indels), we achieve in simulation 99.9% precision (decode accuracy) with 98.8% recall (read acceptance rate). Similarly high precision with somewhat smaller recall is achievable even with 20% DNA errors. The amortized computation cost on a commodity workstation with two GPUs (2022 capability and price) is estimated as between US$ 0.15 and US$ 0.60 per million decoded reads.</abstract><cop>England</cop><pub>Oxford University Press</pub><pmid>36712375</pmid><doi>10.1093/pnasnexus/pgac252</doi><orcidid>https://orcid.org/0000-0003-0771-0841</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2752-6542
ispartof PNAS nexus, 2022-11, Vol.1 (5), p.pgac252-pgac252
issn 2752-6542
2752-6542
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_9802387
source Open Access: Oxford University Press Open Journals; PubMed Central
subjects Analysis
Bar codes
Biological, Health, and Medical Sciences
DNA
DNA sequencing
Error-correcting codes
Multiprocessing
Nucleotide sequencing
Statistics
title Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T22%3A00%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Fast%20trimer%20statistics%20facilitate%20accurate%20decoding%20of%20large%20random%20DNA%20barcode%20sets%20even%20at%20large%20sequencing%20error%20rates&rft.jtitle=PNAS%20nexus&rft.au=Press,%20William%20H&rft.date=2022-11-01&rft.volume=1&rft.issue=5&rft.spage=pgac252&rft.epage=pgac252&rft.pages=pgac252-pgac252&rft.issn=2752-6542&rft.eissn=2752-6542&rft_id=info:doi/10.1093/pnasnexus/pgac252&rft_dat=%3Cgale_pubme%3EA779046382%3C/gale_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c418t-c6a78decd84dd72a9425d7f54e5233ccd41dae4bcc505afefb956912d9f6ede63%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2771084982&rft_id=info:pmid/36712375&rft_galeid=A779046382&rfr_iscdi=true