Loading…
Batch effects in a multiyear sequencing study: False biological trends due to changes in read lengths
High‐throughput sequencing is a powerful tool, but suffers biases and errors that must be accounted for to prevent false biological conclusions. Such errors include batch effects; technical errors only present in subsets of data due to procedural changes within a study. If overlooked and multiple ba...
Saved in:
Published in: | Molecular ecology resources 2018-07, Vol.18 (4), p.778-788 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c4129-ca820399bb458e6ebf3683fe1120a1680da1587e7cdf4763b260bbadb60cd7143 |
---|---|
cites | cdi_FETCH-LOGICAL-c4129-ca820399bb458e6ebf3683fe1120a1680da1587e7cdf4763b260bbadb60cd7143 |
container_end_page | 788 |
container_issue | 4 |
container_start_page | 778 |
container_title | Molecular ecology resources |
container_volume | 18 |
creator | Leigh, D. M. Lischer, H. E. L. Grossen, C. Keller, L. F. |
description | High‐throughput sequencing is a powerful tool, but suffers biases and errors that must be accounted for to prevent false biological conclusions. Such errors include batch effects; technical errors only present in subsets of data due to procedural changes within a study. If overlooked and multiple batches of data are combined, spurious biological signals can arise, particularly if batches of data are correlated with biological variables. Batch effects can be minimized through randomization of sample groups across batches. However, in long‐term or multiyear studies where data are added incrementally, full randomization is impossible, and batch effects may be a common feature. Here, we present a case study where false signals of selection were detected due to a batch effect in a multiyear study of Alpine ibex (Capra ibex). The batch effect arose because sequencing read length changed over the course of the project and populations were added incrementally to the study, resulting in nonrandom distributions of populations across read lengths. The differences in read length caused small misalignments in a subset of the data, leading to false variant alleles and thus false SNPs. Pronounced allele frequency differences between populations arose at these SNPs because of the correlation between read length and population. This created highly statistically significant, but biologically spurious, signals of selection and false associations between allele frequencies and the environment. We highlight the risk of batch effects and discuss strategies to reduce the impacts of batch effects in multiyear high‐throughput sequencing studies. |
doi_str_mv | 10.1111/1755-0998.12779 |
format | article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2018024190</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2018024190</sourcerecordid><originalsourceid>FETCH-LOGICAL-c4129-ca820399bb458e6ebf3683fe1120a1680da1587e7cdf4763b260bbadb60cd7143</originalsourceid><addsrcrecordid>eNqFkbtPwzAQhy0EgvKY2ZAlFpYW20n8YAPESyqwgMRm-XFpU6VJsROh_ve4LXRg4RafrO8-nX6H0CklI5rqkoqiGBKl5IgyIdQOGmx_dre9_DhAhzHOCOFEiXwfHTBViIzKfIDgxnRuiqEswXURVw02eN7XXbUEE3CEzx4aVzUTHLveL6_wvakjYFu1dTupnKlxF6DxEfsecNdiNzXNBNaeAMbjGppJN43HaK9cDZ78vEfo_f7u7fZxOH59eLq9Hg9dTpkaOiMZyZSyNi8kcLBlxmVWAqWMGMol8YYWUoBwvswFzyzjxFrjLSfOC5pnR-hi412ENm0eOz2vooO6Ng20fdSMUElYThVJ6PkfdNb2oUnbJYpTwRPDEnW5oVxoYwxQ6kWo5iYsNSV6dQG9yliv8tbrC6SJsx9vb-fgt_xv5AkoNsBXVcPyP59-vnvZiL8BXkOPrw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2061769032</pqid></control><display><type>article</type><title>Batch effects in a multiyear sequencing study: False biological trends due to changes in read lengths</title><source>Wiley-Blackwell Read & Publish Collection</source><creator>Leigh, D. M. ; Lischer, H. E. L. ; Grossen, C. ; Keller, L. F.</creator><creatorcontrib>Leigh, D. M. ; Lischer, H. E. L. ; Grossen, C. ; Keller, L. F.</creatorcontrib><description>High‐throughput sequencing is a powerful tool, but suffers biases and errors that must be accounted for to prevent false biological conclusions. Such errors include batch effects; technical errors only present in subsets of data due to procedural changes within a study. If overlooked and multiple batches of data are combined, spurious biological signals can arise, particularly if batches of data are correlated with biological variables. Batch effects can be minimized through randomization of sample groups across batches. However, in long‐term or multiyear studies where data are added incrementally, full randomization is impossible, and batch effects may be a common feature. Here, we present a case study where false signals of selection were detected due to a batch effect in a multiyear study of Alpine ibex (Capra ibex). The batch effect arose because sequencing read length changed over the course of the project and populations were added incrementally to the study, resulting in nonrandom distributions of populations across read lengths. The differences in read length caused small misalignments in a subset of the data, leading to false variant alleles and thus false SNPs. Pronounced allele frequency differences between populations arose at these SNPs because of the correlation between read length and population. This created highly statistically significant, but biologically spurious, signals of selection and false associations between allele frequencies and the environment. We highlight the risk of batch effects and discuss strategies to reduce the impacts of batch effects in multiyear high‐throughput sequencing studies.</description><identifier>ISSN: 1755-098X</identifier><identifier>EISSN: 1755-0998</identifier><identifier>DOI: 10.1111/1755-0998.12779</identifier><identifier>PMID: 29573184</identifier><language>eng</language><publisher>England: Wiley Subscription Services, Inc</publisher><subject>Alleles ; Alpine environments ; Animals ; Biological effects ; Case studies ; Gene Frequency ; genotyping error ; Goats - genetics ; GWAS ; High-Throughput Nucleotide Sequencing ; long‐term data ; Next-generation sequencing ; outlier ; Polymorphism, Single Nucleotide ; Population genetics ; Populations ; RADseq ; Randomization ; Selection, Genetic ; sequencing error ; Statistical analysis</subject><ispartof>Molecular ecology resources, 2018-07, Vol.18 (4), p.778-788</ispartof><rights>2018 John Wiley & Sons Ltd</rights><rights>2018 John Wiley & Sons Ltd.</rights><rights>Copyright © 2018 John Wiley & Sons Ltd</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c4129-ca820399bb458e6ebf3683fe1120a1680da1587e7cdf4763b260bbadb60cd7143</citedby><cites>FETCH-LOGICAL-c4129-ca820399bb458e6ebf3683fe1120a1680da1587e7cdf4763b260bbadb60cd7143</cites><orcidid>0000-0003-3902-2568</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/29573184$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Leigh, D. M.</creatorcontrib><creatorcontrib>Lischer, H. E. L.</creatorcontrib><creatorcontrib>Grossen, C.</creatorcontrib><creatorcontrib>Keller, L. F.</creatorcontrib><title>Batch effects in a multiyear sequencing study: False biological trends due to changes in read lengths</title><title>Molecular ecology resources</title><addtitle>Mol Ecol Resour</addtitle><description>High‐throughput sequencing is a powerful tool, but suffers biases and errors that must be accounted for to prevent false biological conclusions. Such errors include batch effects; technical errors only present in subsets of data due to procedural changes within a study. If overlooked and multiple batches of data are combined, spurious biological signals can arise, particularly if batches of data are correlated with biological variables. Batch effects can be minimized through randomization of sample groups across batches. However, in long‐term or multiyear studies where data are added incrementally, full randomization is impossible, and batch effects may be a common feature. Here, we present a case study where false signals of selection were detected due to a batch effect in a multiyear study of Alpine ibex (Capra ibex). The batch effect arose because sequencing read length changed over the course of the project and populations were added incrementally to the study, resulting in nonrandom distributions of populations across read lengths. The differences in read length caused small misalignments in a subset of the data, leading to false variant alleles and thus false SNPs. Pronounced allele frequency differences between populations arose at these SNPs because of the correlation between read length and population. This created highly statistically significant, but biologically spurious, signals of selection and false associations between allele frequencies and the environment. We highlight the risk of batch effects and discuss strategies to reduce the impacts of batch effects in multiyear high‐throughput sequencing studies.</description><subject>Alleles</subject><subject>Alpine environments</subject><subject>Animals</subject><subject>Biological effects</subject><subject>Case studies</subject><subject>Gene Frequency</subject><subject>genotyping error</subject><subject>Goats - genetics</subject><subject>GWAS</subject><subject>High-Throughput Nucleotide Sequencing</subject><subject>long‐term data</subject><subject>Next-generation sequencing</subject><subject>outlier</subject><subject>Polymorphism, Single Nucleotide</subject><subject>Population genetics</subject><subject>Populations</subject><subject>RADseq</subject><subject>Randomization</subject><subject>Selection, Genetic</subject><subject>sequencing error</subject><subject>Statistical analysis</subject><issn>1755-098X</issn><issn>1755-0998</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><recordid>eNqFkbtPwzAQhy0EgvKY2ZAlFpYW20n8YAPESyqwgMRm-XFpU6VJsROh_ve4LXRg4RafrO8-nX6H0CklI5rqkoqiGBKl5IgyIdQOGmx_dre9_DhAhzHOCOFEiXwfHTBViIzKfIDgxnRuiqEswXURVw02eN7XXbUEE3CEzx4aVzUTHLveL6_wvakjYFu1dTupnKlxF6DxEfsecNdiNzXNBNaeAMbjGppJN43HaK9cDZ78vEfo_f7u7fZxOH59eLq9Hg9dTpkaOiMZyZSyNi8kcLBlxmVWAqWMGMol8YYWUoBwvswFzyzjxFrjLSfOC5pnR-hi412ENm0eOz2vooO6Ng20fdSMUElYThVJ6PkfdNb2oUnbJYpTwRPDEnW5oVxoYwxQ6kWo5iYsNSV6dQG9yliv8tbrC6SJsx9vb-fgt_xv5AkoNsBXVcPyP59-vnvZiL8BXkOPrw</recordid><startdate>201807</startdate><enddate>201807</enddate><creator>Leigh, D. M.</creator><creator>Lischer, H. E. L.</creator><creator>Grossen, C.</creator><creator>Keller, L. F.</creator><general>Wiley Subscription Services, Inc</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SN</scope><scope>7SS</scope><scope>8FD</scope><scope>C1K</scope><scope>FR3</scope><scope>M7N</scope><scope>P64</scope><scope>RC3</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0003-3902-2568</orcidid></search><sort><creationdate>201807</creationdate><title>Batch effects in a multiyear sequencing study: False biological trends due to changes in read lengths</title><author>Leigh, D. M. ; Lischer, H. E. L. ; Grossen, C. ; Keller, L. F.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c4129-ca820399bb458e6ebf3683fe1120a1680da1587e7cdf4763b260bbadb60cd7143</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Alleles</topic><topic>Alpine environments</topic><topic>Animals</topic><topic>Biological effects</topic><topic>Case studies</topic><topic>Gene Frequency</topic><topic>genotyping error</topic><topic>Goats - genetics</topic><topic>GWAS</topic><topic>High-Throughput Nucleotide Sequencing</topic><topic>long‐term data</topic><topic>Next-generation sequencing</topic><topic>outlier</topic><topic>Polymorphism, Single Nucleotide</topic><topic>Population genetics</topic><topic>Populations</topic><topic>RADseq</topic><topic>Randomization</topic><topic>Selection, Genetic</topic><topic>sequencing error</topic><topic>Statistical analysis</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Leigh, D. M.</creatorcontrib><creatorcontrib>Lischer, H. E. L.</creatorcontrib><creatorcontrib>Grossen, C.</creatorcontrib><creatorcontrib>Keller, L. F.</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Ecology Abstracts</collection><collection>Entomology Abstracts (Full archive)</collection><collection>Technology Research Database</collection><collection>Environmental Sciences and Pollution Management</collection><collection>Engineering Research Database</collection><collection>Algology Mycology and Protozoology Abstracts (Microbiology C)</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>Molecular ecology resources</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Leigh, D. M.</au><au>Lischer, H. E. L.</au><au>Grossen, C.</au><au>Keller, L. F.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Batch effects in a multiyear sequencing study: False biological trends due to changes in read lengths</atitle><jtitle>Molecular ecology resources</jtitle><addtitle>Mol Ecol Resour</addtitle><date>2018-07</date><risdate>2018</risdate><volume>18</volume><issue>4</issue><spage>778</spage><epage>788</epage><pages>778-788</pages><issn>1755-098X</issn><eissn>1755-0998</eissn><abstract>High‐throughput sequencing is a powerful tool, but suffers biases and errors that must be accounted for to prevent false biological conclusions. Such errors include batch effects; technical errors only present in subsets of data due to procedural changes within a study. If overlooked and multiple batches of data are combined, spurious biological signals can arise, particularly if batches of data are correlated with biological variables. Batch effects can be minimized through randomization of sample groups across batches. However, in long‐term or multiyear studies where data are added incrementally, full randomization is impossible, and batch effects may be a common feature. Here, we present a case study where false signals of selection were detected due to a batch effect in a multiyear study of Alpine ibex (Capra ibex). The batch effect arose because sequencing read length changed over the course of the project and populations were added incrementally to the study, resulting in nonrandom distributions of populations across read lengths. The differences in read length caused small misalignments in a subset of the data, leading to false variant alleles and thus false SNPs. Pronounced allele frequency differences between populations arose at these SNPs because of the correlation between read length and population. This created highly statistically significant, but biologically spurious, signals of selection and false associations between allele frequencies and the environment. We highlight the risk of batch effects and discuss strategies to reduce the impacts of batch effects in multiyear high‐throughput sequencing studies.</abstract><cop>England</cop><pub>Wiley Subscription Services, Inc</pub><pmid>29573184</pmid><doi>10.1111/1755-0998.12779</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0003-3902-2568</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1755-098X |
ispartof | Molecular ecology resources, 2018-07, Vol.18 (4), p.778-788 |
issn | 1755-098X 1755-0998 |
language | eng |
recordid | cdi_proquest_miscellaneous_2018024190 |
source | Wiley-Blackwell Read & Publish Collection |
subjects | Alleles Alpine environments Animals Biological effects Case studies Gene Frequency genotyping error Goats - genetics GWAS High-Throughput Nucleotide Sequencing long‐term data Next-generation sequencing outlier Polymorphism, Single Nucleotide Population genetics Populations RADseq Randomization Selection, Genetic sequencing error Statistical analysis |
title | Batch effects in a multiyear sequencing study: False biological trends due to changes in read lengths |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T08%3A58%3A45IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Batch%20effects%20in%20a%20multiyear%20sequencing%20study:%20False%20biological%20trends%20due%20to%20changes%20in%20read%20lengths&rft.jtitle=Molecular%20ecology%20resources&rft.au=Leigh,%20D.%20M.&rft.date=2018-07&rft.volume=18&rft.issue=4&rft.spage=778&rft.epage=788&rft.pages=778-788&rft.issn=1755-098X&rft.eissn=1755-0998&rft_id=info:doi/10.1111/1755-0998.12779&rft_dat=%3Cproquest_cross%3E2018024190%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c4129-ca820399bb458e6ebf3683fe1120a1680da1587e7cdf4763b260bbadb60cd7143%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2061769032&rft_id=info:pmid/29573184&rfr_iscdi=true |