Loading…
Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates
Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or tw...
Saved in:
Published in: | PNAS nexus 2022-11, Vol.1 (5), p.pgac252-pgac252 |
---|---|
Main Author: | |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | cdi_FETCH-LOGICAL-c418t-c6a78decd84dd72a9425d7f54e5233ccd41dae4bcc505afefb956912d9f6ede63 |
container_end_page | pgac252 |
container_issue | 5 |
container_start_page | pgac252 |
container_title | PNAS nexus |
container_volume | 1 |
creator | Press, William H |
description | Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or two errors per barcode in sets of typically ≲10
barcodes. We here consider the use of random barcodes of sufficient length that they remain accurately decodable even with ≳6 errors and even at [Formula: see text] or 20% nucleotide error rates. We show that length ∼34 nt is sufficient even with ≳10
barcodes. The obvious objection to this scheme is that it requires comparing every read to every possible barcode by a slow Levenshtein or Needleman-Wunsch comparison. We show that several orders of magnitude speedup can be achieved by (i) a fast triage method that compares only trimer (three consecutive nucleotide) occurence statistics, precomputed in linear time for both reads and barcodes, and (ii) the massive parallelism available on today's even commodity-grade Graphics Processing Units (GPUs). With 10
barcodes of length 34 and 10% DNA errors (substitutions and indels), we achieve in simulation 99.9% precision (decode accuracy) with 98.8% recall (read acceptance rate). Similarly high precision with somewhat smaller recall is achievable even with 20% DNA errors. The amortized computation cost on a commodity workstation with two GPUs (2022 capability and price) is estimated as between US$ 0.15 and US$ 0.60 per million decoded reads. |
doi_str_mv | 10.1093/pnasnexus/pgac252 |
format | article |
fullrecord | <record><control><sourceid>gale_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_9802387</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A779046382</galeid><sourcerecordid>A779046382</sourcerecordid><originalsourceid>FETCH-LOGICAL-c418t-c6a78decd84dd72a9425d7f54e5233ccd41dae4bcc505afefb956912d9f6ede63</originalsourceid><addsrcrecordid>eNptUkFvFCEYnRiNbWp_gBdD4sXLtsDAMHMx2VSrTRq96Jl8Cx8jZgZWYJv238u466ZNGg588L334MFrmreMXjA6tJfbADng_S5fbkcwXPIXzSlXkq86KfjLR_VJc57zb0opV4oxIV83J22nGG-VPG0eriEXUpKfMZFcoPhcvMnEgfGTr2skYMwuLYVFE60PI4mOTJBGJAmCjTP59G1NNpBqF0nGkgneYSBQDqiMf3YYzMLElGIii1p-07xyMGU8P8xnzc_rzz-uvq5uv3-5uVrfroxgfVmZDlRfT7a9sFZxGASXVjkpUPK2NcYKZgHFxhhJJTh0m0F2A-N2cB1a7Nqz5uNed7vbzGgNhpJg0ttqGdKDjuD1007wv_QY7_TQU972qgp8OAikWI3komefDU4TBIy7rJdnpb0Yel6h7_fQESbUPrhYFc0C12ulBiq69h_q4hlUHRZnb2JA5-v-EwLbE0yKOSd0x9szqpcw6GMY9CEMlfPuse0j4__Xt38BE_63AQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2771084982</pqid></control><display><type>article</type><title>Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates</title><source>Open Access: Oxford University Press Open Journals</source><source>PubMed Central</source><creator>Press, William H</creator><contributor>Yooseph, Shibu</contributor><creatorcontrib>Press, William H ; Yooseph, Shibu</creatorcontrib><description>Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or two errors per barcode in sets of typically ≲10
barcodes. We here consider the use of random barcodes of sufficient length that they remain accurately decodable even with ≳6 errors and even at [Formula: see text] or 20% nucleotide error rates. We show that length ∼34 nt is sufficient even with ≳10
barcodes. The obvious objection to this scheme is that it requires comparing every read to every possible barcode by a slow Levenshtein or Needleman-Wunsch comparison. We show that several orders of magnitude speedup can be achieved by (i) a fast triage method that compares only trimer (three consecutive nucleotide) occurence statistics, precomputed in linear time for both reads and barcodes, and (ii) the massive parallelism available on today's even commodity-grade Graphics Processing Units (GPUs). With 10
barcodes of length 34 and 10% DNA errors (substitutions and indels), we achieve in simulation 99.9% precision (decode accuracy) with 98.8% recall (read acceptance rate). Similarly high precision with somewhat smaller recall is achievable even with 20% DNA errors. The amortized computation cost on a commodity workstation with two GPUs (2022 capability and price) is estimated as between US$ 0.15 and US$ 0.60 per million decoded reads.</description><identifier>ISSN: 2752-6542</identifier><identifier>EISSN: 2752-6542</identifier><identifier>DOI: 10.1093/pnasnexus/pgac252</identifier><identifier>PMID: 36712375</identifier><language>eng</language><publisher>England: Oxford University Press</publisher><subject>Analysis ; Bar codes ; Biological, Health, and Medical Sciences ; DNA ; DNA sequencing ; Error-correcting codes ; Multiprocessing ; Nucleotide sequencing ; Statistics</subject><ispartof>PNAS nexus, 2022-11, Vol.1 (5), p.pgac252-pgac252</ispartof><rights>The Author(s) 2022. Published by Oxford University Press on behalf of the National Academy of Sciences.</rights><rights>COPYRIGHT 2022 Oxford University Press</rights><rights>The Author(s) 2022. Published by Oxford University Press on behalf of the National Academy of Sciences. 2022</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c418t-c6a78decd84dd72a9425d7f54e5233ccd41dae4bcc505afefb956912d9f6ede63</cites><orcidid>0000-0003-0771-0841</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC9802387/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC9802387/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,885,27924,27925,53791,53793</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/36712375$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><contributor>Yooseph, Shibu</contributor><creatorcontrib>Press, William H</creatorcontrib><title>Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates</title><title>PNAS nexus</title><addtitle>PNAS Nexus</addtitle><description>Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or two errors per barcode in sets of typically ≲10
barcodes. We here consider the use of random barcodes of sufficient length that they remain accurately decodable even with ≳6 errors and even at [Formula: see text] or 20% nucleotide error rates. We show that length ∼34 nt is sufficient even with ≳10
barcodes. The obvious objection to this scheme is that it requires comparing every read to every possible barcode by a slow Levenshtein or Needleman-Wunsch comparison. We show that several orders of magnitude speedup can be achieved by (i) a fast triage method that compares only trimer (three consecutive nucleotide) occurence statistics, precomputed in linear time for both reads and barcodes, and (ii) the massive parallelism available on today's even commodity-grade Graphics Processing Units (GPUs). With 10
barcodes of length 34 and 10% DNA errors (substitutions and indels), we achieve in simulation 99.9% precision (decode accuracy) with 98.8% recall (read acceptance rate). Similarly high precision with somewhat smaller recall is achievable even with 20% DNA errors. The amortized computation cost on a commodity workstation with two GPUs (2022 capability and price) is estimated as between US$ 0.15 and US$ 0.60 per million decoded reads.</description><subject>Analysis</subject><subject>Bar codes</subject><subject>Biological, Health, and Medical Sciences</subject><subject>DNA</subject><subject>DNA sequencing</subject><subject>Error-correcting codes</subject><subject>Multiprocessing</subject><subject>Nucleotide sequencing</subject><subject>Statistics</subject><issn>2752-6542</issn><issn>2752-6542</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNptUkFvFCEYnRiNbWp_gBdD4sXLtsDAMHMx2VSrTRq96Jl8Cx8jZgZWYJv238u466ZNGg588L334MFrmreMXjA6tJfbADng_S5fbkcwXPIXzSlXkq86KfjLR_VJc57zb0opV4oxIV83J22nGG-VPG0eriEXUpKfMZFcoPhcvMnEgfGTr2skYMwuLYVFE60PI4mOTJBGJAmCjTP59G1NNpBqF0nGkgneYSBQDqiMf3YYzMLElGIii1p-07xyMGU8P8xnzc_rzz-uvq5uv3-5uVrfroxgfVmZDlRfT7a9sFZxGASXVjkpUPK2NcYKZgHFxhhJJTh0m0F2A-N2cB1a7Nqz5uNed7vbzGgNhpJg0ttqGdKDjuD1007wv_QY7_TQU972qgp8OAikWI3komefDU4TBIy7rJdnpb0Yel6h7_fQESbUPrhYFc0C12ulBiq69h_q4hlUHRZnb2JA5-v-EwLbE0yKOSd0x9szqpcw6GMY9CEMlfPuse0j4__Xt38BE_63AQ</recordid><startdate>20221101</startdate><enddate>20221101</enddate><creator>Press, William H</creator><general>Oxford University Press</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0003-0771-0841</orcidid></search><sort><creationdate>20221101</creationdate><title>Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates</title><author>Press, William H</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c418t-c6a78decd84dd72a9425d7f54e5233ccd41dae4bcc505afefb956912d9f6ede63</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Analysis</topic><topic>Bar codes</topic><topic>Biological, Health, and Medical Sciences</topic><topic>DNA</topic><topic>DNA sequencing</topic><topic>Error-correcting codes</topic><topic>Multiprocessing</topic><topic>Nucleotide sequencing</topic><topic>Statistics</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Press, William H</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>PNAS nexus</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Press, William H</au><au>Yooseph, Shibu</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates</atitle><jtitle>PNAS nexus</jtitle><addtitle>PNAS Nexus</addtitle><date>2022-11-01</date><risdate>2022</risdate><volume>1</volume><issue>5</issue><spage>pgac252</spage><epage>pgac252</epage><pages>pgac252-pgac252</pages><issn>2752-6542</issn><eissn>2752-6542</eissn><abstract>Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or two errors per barcode in sets of typically ≲10
barcodes. We here consider the use of random barcodes of sufficient length that they remain accurately decodable even with ≳6 errors and even at [Formula: see text] or 20% nucleotide error rates. We show that length ∼34 nt is sufficient even with ≳10
barcodes. The obvious objection to this scheme is that it requires comparing every read to every possible barcode by a slow Levenshtein or Needleman-Wunsch comparison. We show that several orders of magnitude speedup can be achieved by (i) a fast triage method that compares only trimer (three consecutive nucleotide) occurence statistics, precomputed in linear time for both reads and barcodes, and (ii) the massive parallelism available on today's even commodity-grade Graphics Processing Units (GPUs). With 10
barcodes of length 34 and 10% DNA errors (substitutions and indels), we achieve in simulation 99.9% precision (decode accuracy) with 98.8% recall (read acceptance rate). Similarly high precision with somewhat smaller recall is achievable even with 20% DNA errors. The amortized computation cost on a commodity workstation with two GPUs (2022 capability and price) is estimated as between US$ 0.15 and US$ 0.60 per million decoded reads.</abstract><cop>England</cop><pub>Oxford University Press</pub><pmid>36712375</pmid><doi>10.1093/pnasnexus/pgac252</doi><orcidid>https://orcid.org/0000-0003-0771-0841</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2752-6542 |
ispartof | PNAS nexus, 2022-11, Vol.1 (5), p.pgac252-pgac252 |
issn | 2752-6542 2752-6542 |
language | eng |
recordid | cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_9802387 |
source | Open Access: Oxford University Press Open Journals; PubMed Central |
subjects | Analysis Bar codes Biological, Health, and Medical Sciences DNA DNA sequencing Error-correcting codes Multiprocessing Nucleotide sequencing Statistics |
title | Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T22%3A00%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Fast%20trimer%20statistics%20facilitate%20accurate%20decoding%20of%20large%20random%20DNA%20barcode%20sets%20even%20at%20large%20sequencing%20error%20rates&rft.jtitle=PNAS%20nexus&rft.au=Press,%20William%20H&rft.date=2022-11-01&rft.volume=1&rft.issue=5&rft.spage=pgac252&rft.epage=pgac252&rft.pages=pgac252-pgac252&rft.issn=2752-6542&rft.eissn=2752-6542&rft_id=info:doi/10.1093/pnasnexus/pgac252&rft_dat=%3Cgale_pubme%3EA779046382%3C/gale_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c418t-c6a78decd84dd72a9425d7f54e5233ccd41dae4bcc505afefb956912d9f6ede63%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2771084982&rft_id=info:pmid/36712375&rft_galeid=A779046382&rfr_iscdi=true |