Loading…

k-mer-based approaches to bridging pangenomics and population genetics

Many commonly studied species now have more than one chromosome-scale genome assembly, revealing a large amount of genetic diversity previously missed by approaches that map short reads to a single reference. However, many species still lack multiple reference genomes and correctly aligning referenc...

Full description

Saved in:
Bibliographic Details
Published in:ArXiv.org 2024-09
Main Authors: Roberts, Miles D, Davis, Olivia, Josephs, Emily B, Williamson, Robert J
Format: Article
Language:English
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title ArXiv.org
container_volume
creator Roberts, Miles D
Davis, Olivia
Josephs, Emily B
Williamson, Robert J
description Many commonly studied species now have more than one chromosome-scale genome assembly, revealing a large amount of genetic diversity previously missed by approaches that map short reads to a single reference. However, many species still lack multiple reference genomes and correctly aligning references to build pangenomes is challenging, limiting our ability to study this missing genomic variation in population genetics. Here, we argue that $k$-mers are a crucial stepping stone to bridging the reference-focused paradigms of population genetics with the reference-free paradigms of pangenomics. We review current literature on the uses of $k$-mers for performing three core components of most population genetics analyses: identifying, measuring, and explaining patterns of genetic variation. We also demonstrate how different $k$-mer-based measures of genetic variation behave in population genetic simulations according to the choice of $k$, depth of sequencing coverage, and degree of data compression. Overall, we find that $k$-mer-based measures of genetic diversity scale consistently with pairwise nucleotide diversity ($\pi$) up to values of about $\pi = 0.025$ ($R^2 = 0.97$) for neutrally evolving populations. For populations with even more variation, using shorter $k$-mers will maintain the scalability up to at least $\pi = 0.1$. Furthermore, in our simulated populations, $k$-mer dissimilarity values can be reliably approximated from counting bloom filters, highlighting a potential avenue to decreasing the memory burden of $k$-mer based genomic dissimilarity analyses. For future studies, there is a great opportunity to further develop methods to identifying selected loci using $k$-mers.
format article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_11468241</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3116335917</sourcerecordid><originalsourceid>FETCH-LOGICAL-p1121-dd42a49b65494c97e33c80cafbd00751ef570c55869d58d00dcf662c5d6cf6273</originalsourceid><addsrcrecordid>eNpVkF1LwzAYhYMobsz9BcmlN4V8t7kSGU4HA2_0OqRJ2kXbJDat4L-34pR59R7OgeeB9wwsCaW4qBgh5yd5AdY5vyKEiCgJ5_QSLKiksiIILcH2rejdUNQ6Owt1SkPU5uAyHCOsB29bH1qYdGhdiL03GepgYYpp6vToY4Bz78a5vwIXje6yWx_vCrxs7583j8X-6WG3udsXCWOCC2sZ0UzWgjPJjCwdpaZCRje1Rajk2DW8RIbzSkjLq7mzphGCGG7FHEhJV-D2h5umunfWuDAOulNp8L0ePlXUXv1fgj-oNn4ojJmoCMMz4eZIGOL75PKoep-N6zodXJyyohgLSrnE37LrU9mf5fd79Au68G4v</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3116335917</pqid></control><display><type>article</type><title>k-mer-based approaches to bridging pangenomics and population genetics</title><source>Publicly Available Content Database</source><creator>Roberts, Miles D ; Davis, Olivia ; Josephs, Emily B ; Williamson, Robert J</creator><creatorcontrib>Roberts, Miles D ; Davis, Olivia ; Josephs, Emily B ; Williamson, Robert J</creatorcontrib><description>Many commonly studied species now have more than one chromosome-scale genome assembly, revealing a large amount of genetic diversity previously missed by approaches that map short reads to a single reference. However, many species still lack multiple reference genomes and correctly aligning references to build pangenomes is challenging, limiting our ability to study this missing genomic variation in population genetics. Here, we argue that $k$-mers are a crucial stepping stone to bridging the reference-focused paradigms of population genetics with the reference-free paradigms of pangenomics. We review current literature on the uses of $k$-mers for performing three core components of most population genetics analyses: identifying, measuring, and explaining patterns of genetic variation. We also demonstrate how different $k$-mer-based measures of genetic variation behave in population genetic simulations according to the choice of $k$, depth of sequencing coverage, and degree of data compression. Overall, we find that $k$-mer-based measures of genetic diversity scale consistently with pairwise nucleotide diversity ($\pi$) up to values of about $\pi = 0.025$ ($R^2 = 0.97$) for neutrally evolving populations. For populations with even more variation, using shorter $k$-mers will maintain the scalability up to at least $\pi = 0.1$. Furthermore, in our simulated populations, $k$-mer dissimilarity values can be reliably approximated from counting bloom filters, highlighting a potential avenue to decreasing the memory burden of $k$-mer based genomic dissimilarity analyses. For future studies, there is a great opportunity to further develop methods to identifying selected loci using $k$-mers.</description><identifier>ISSN: 2331-8422</identifier><identifier>EISSN: 2331-8422</identifier><identifier>PMID: 39398200</identifier><language>eng</language><publisher>United States: Cornell University</publisher><ispartof>ArXiv.org, 2024-09</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,314,780,784,885,37013</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/39398200$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Roberts, Miles D</creatorcontrib><creatorcontrib>Davis, Olivia</creatorcontrib><creatorcontrib>Josephs, Emily B</creatorcontrib><creatorcontrib>Williamson, Robert J</creatorcontrib><title>k-mer-based approaches to bridging pangenomics and population genetics</title><title>ArXiv.org</title><addtitle>ArXiv</addtitle><description>Many commonly studied species now have more than one chromosome-scale genome assembly, revealing a large amount of genetic diversity previously missed by approaches that map short reads to a single reference. However, many species still lack multiple reference genomes and correctly aligning references to build pangenomes is challenging, limiting our ability to study this missing genomic variation in population genetics. Here, we argue that $k$-mers are a crucial stepping stone to bridging the reference-focused paradigms of population genetics with the reference-free paradigms of pangenomics. We review current literature on the uses of $k$-mers for performing three core components of most population genetics analyses: identifying, measuring, and explaining patterns of genetic variation. We also demonstrate how different $k$-mer-based measures of genetic variation behave in population genetic simulations according to the choice of $k$, depth of sequencing coverage, and degree of data compression. Overall, we find that $k$-mer-based measures of genetic diversity scale consistently with pairwise nucleotide diversity ($\pi$) up to values of about $\pi = 0.025$ ($R^2 = 0.97$) for neutrally evolving populations. For populations with even more variation, using shorter $k$-mers will maintain the scalability up to at least $\pi = 0.1$. Furthermore, in our simulated populations, $k$-mer dissimilarity values can be reliably approximated from counting bloom filters, highlighting a potential avenue to decreasing the memory burden of $k$-mer based genomic dissimilarity analyses. For future studies, there is a great opportunity to further develop methods to identifying selected loci using $k$-mers.</description><issn>2331-8422</issn><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNpVkF1LwzAYhYMobsz9BcmlN4V8t7kSGU4HA2_0OqRJ2kXbJDat4L-34pR59R7OgeeB9wwsCaW4qBgh5yd5AdY5vyKEiCgJ5_QSLKiksiIILcH2rejdUNQ6Owt1SkPU5uAyHCOsB29bH1qYdGhdiL03GepgYYpp6vToY4Bz78a5vwIXje6yWx_vCrxs7583j8X-6WG3udsXCWOCC2sZ0UzWgjPJjCwdpaZCRje1Rajk2DW8RIbzSkjLq7mzphGCGG7FHEhJV-D2h5umunfWuDAOulNp8L0ePlXUXv1fgj-oNn4ojJmoCMMz4eZIGOL75PKoep-N6zodXJyyohgLSrnE37LrU9mf5fd79Au68G4v</recordid><startdate>20240918</startdate><enddate>20240918</enddate><creator>Roberts, Miles D</creator><creator>Davis, Olivia</creator><creator>Josephs, Emily B</creator><creator>Williamson, Robert J</creator><general>Cornell University</general><scope>NPM</scope><scope>7X8</scope><scope>5PM</scope></search><sort><creationdate>20240918</creationdate><title>k-mer-based approaches to bridging pangenomics and population genetics</title><author>Roberts, Miles D ; Davis, Olivia ; Josephs, Emily B ; Williamson, Robert J</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-p1121-dd42a49b65494c97e33c80cafbd00751ef570c55869d58d00dcf662c5d6cf6273</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><toplevel>online_resources</toplevel><creatorcontrib>Roberts, Miles D</creatorcontrib><creatorcontrib>Davis, Olivia</creatorcontrib><creatorcontrib>Josephs, Emily B</creatorcontrib><creatorcontrib>Williamson, Robert J</creatorcontrib><collection>PubMed</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>ArXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Roberts, Miles D</au><au>Davis, Olivia</au><au>Josephs, Emily B</au><au>Williamson, Robert J</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>k-mer-based approaches to bridging pangenomics and population genetics</atitle><jtitle>ArXiv.org</jtitle><addtitle>ArXiv</addtitle><date>2024-09-18</date><risdate>2024</risdate><issn>2331-8422</issn><eissn>2331-8422</eissn><abstract>Many commonly studied species now have more than one chromosome-scale genome assembly, revealing a large amount of genetic diversity previously missed by approaches that map short reads to a single reference. However, many species still lack multiple reference genomes and correctly aligning references to build pangenomes is challenging, limiting our ability to study this missing genomic variation in population genetics. Here, we argue that $k$-mers are a crucial stepping stone to bridging the reference-focused paradigms of population genetics with the reference-free paradigms of pangenomics. We review current literature on the uses of $k$-mers for performing three core components of most population genetics analyses: identifying, measuring, and explaining patterns of genetic variation. We also demonstrate how different $k$-mer-based measures of genetic variation behave in population genetic simulations according to the choice of $k$, depth of sequencing coverage, and degree of data compression. Overall, we find that $k$-mer-based measures of genetic diversity scale consistently with pairwise nucleotide diversity ($\pi$) up to values of about $\pi = 0.025$ ($R^2 = 0.97$) for neutrally evolving populations. For populations with even more variation, using shorter $k$-mers will maintain the scalability up to at least $\pi = 0.1$. Furthermore, in our simulated populations, $k$-mer dissimilarity values can be reliably approximated from counting bloom filters, highlighting a potential avenue to decreasing the memory burden of $k$-mer based genomic dissimilarity analyses. For future studies, there is a great opportunity to further develop methods to identifying selected loci using $k$-mers.</abstract><cop>United States</cop><pub>Cornell University</pub><pmid>39398200</pmid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2331-8422
ispartof ArXiv.org, 2024-09
issn 2331-8422
2331-8422
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_11468241
source Publicly Available Content Database
title k-mer-based approaches to bridging pangenomics and population genetics
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T04%3A17%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=k-mer-based%20approaches%20to%20bridging%20pangenomics%20and%20population%20genetics&rft.jtitle=ArXiv.org&rft.au=Roberts,%20Miles%20D&rft.date=2024-09-18&rft.issn=2331-8422&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest_pubme%3E3116335917%3C/proquest_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-p1121-dd42a49b65494c97e33c80cafbd00751ef570c55869d58d00dcf662c5d6cf6273%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3116335917&rft_id=info:pmid/39398200&rfr_iscdi=true