Loading…

Batch effects in a multiyear sequencing study: False biological trends due to changes in read lengths

High‐throughput sequencing is a powerful tool, but suffers biases and errors that must be accounted for to prevent false biological conclusions. Such errors include batch effects; technical errors only present in subsets of data due to procedural changes within a study. If overlooked and multiple ba...

Full description

Saved in:
Bibliographic Details
Published in:Molecular ecology resources 2018-07, Vol.18 (4), p.778-788
Main Authors: Leigh, D. M., Lischer, H. E. L., Grossen, C., Keller, L. F.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c4129-ca820399bb458e6ebf3683fe1120a1680da1587e7cdf4763b260bbadb60cd7143
cites cdi_FETCH-LOGICAL-c4129-ca820399bb458e6ebf3683fe1120a1680da1587e7cdf4763b260bbadb60cd7143
container_end_page 788
container_issue 4
container_start_page 778
container_title Molecular ecology resources
container_volume 18
creator Leigh, D. M.
Lischer, H. E. L.
Grossen, C.
Keller, L. F.
description High‐throughput sequencing is a powerful tool, but suffers biases and errors that must be accounted for to prevent false biological conclusions. Such errors include batch effects; technical errors only present in subsets of data due to procedural changes within a study. If overlooked and multiple batches of data are combined, spurious biological signals can arise, particularly if batches of data are correlated with biological variables. Batch effects can be minimized through randomization of sample groups across batches. However, in long‐term or multiyear studies where data are added incrementally, full randomization is impossible, and batch effects may be a common feature. Here, we present a case study where false signals of selection were detected due to a batch effect in a multiyear study of Alpine ibex (Capra ibex). The batch effect arose because sequencing read length changed over the course of the project and populations were added incrementally to the study, resulting in nonrandom distributions of populations across read lengths. The differences in read length caused small misalignments in a subset of the data, leading to false variant alleles and thus false SNPs. Pronounced allele frequency differences between populations arose at these SNPs because of the correlation between read length and population. This created highly statistically significant, but biologically spurious, signals of selection and false associations between allele frequencies and the environment. We highlight the risk of batch effects and discuss strategies to reduce the impacts of batch effects in multiyear high‐throughput sequencing studies.
doi_str_mv 10.1111/1755-0998.12779
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2018024190</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2018024190</sourcerecordid><originalsourceid>FETCH-LOGICAL-c4129-ca820399bb458e6ebf3683fe1120a1680da1587e7cdf4763b260bbadb60cd7143</originalsourceid><addsrcrecordid>eNqFkbtPwzAQhy0EgvKY2ZAlFpYW20n8YAPESyqwgMRm-XFpU6VJsROh_ve4LXRg4RafrO8-nX6H0CklI5rqkoqiGBKl5IgyIdQOGmx_dre9_DhAhzHOCOFEiXwfHTBViIzKfIDgxnRuiqEswXURVw02eN7XXbUEE3CEzx4aVzUTHLveL6_wvakjYFu1dTupnKlxF6DxEfsecNdiNzXNBNaeAMbjGppJN43HaK9cDZ78vEfo_f7u7fZxOH59eLq9Hg9dTpkaOiMZyZSyNi8kcLBlxmVWAqWMGMol8YYWUoBwvswFzyzjxFrjLSfOC5pnR-hi412ENm0eOz2vooO6Ng20fdSMUElYThVJ6PkfdNb2oUnbJYpTwRPDEnW5oVxoYwxQ6kWo5iYsNSV6dQG9yliv8tbrC6SJsx9vb-fgt_xv5AkoNsBXVcPyP59-vnvZiL8BXkOPrw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2061769032</pqid></control><display><type>article</type><title>Batch effects in a multiyear sequencing study: False biological trends due to changes in read lengths</title><source>Wiley-Blackwell Read &amp; Publish Collection</source><creator>Leigh, D. M. ; Lischer, H. E. L. ; Grossen, C. ; Keller, L. F.</creator><creatorcontrib>Leigh, D. M. ; Lischer, H. E. L. ; Grossen, C. ; Keller, L. F.</creatorcontrib><description>High‐throughput sequencing is a powerful tool, but suffers biases and errors that must be accounted for to prevent false biological conclusions. Such errors include batch effects; technical errors only present in subsets of data due to procedural changes within a study. If overlooked and multiple batches of data are combined, spurious biological signals can arise, particularly if batches of data are correlated with biological variables. Batch effects can be minimized through randomization of sample groups across batches. However, in long‐term or multiyear studies where data are added incrementally, full randomization is impossible, and batch effects may be a common feature. Here, we present a case study where false signals of selection were detected due to a batch effect in a multiyear study of Alpine ibex (Capra ibex). The batch effect arose because sequencing read length changed over the course of the project and populations were added incrementally to the study, resulting in nonrandom distributions of populations across read lengths. The differences in read length caused small misalignments in a subset of the data, leading to false variant alleles and thus false SNPs. Pronounced allele frequency differences between populations arose at these SNPs because of the correlation between read length and population. This created highly statistically significant, but biologically spurious, signals of selection and false associations between allele frequencies and the environment. We highlight the risk of batch effects and discuss strategies to reduce the impacts of batch effects in multiyear high‐throughput sequencing studies.</description><identifier>ISSN: 1755-098X</identifier><identifier>EISSN: 1755-0998</identifier><identifier>DOI: 10.1111/1755-0998.12779</identifier><identifier>PMID: 29573184</identifier><language>eng</language><publisher>England: Wiley Subscription Services, Inc</publisher><subject>Alleles ; Alpine environments ; Animals ; Biological effects ; Case studies ; Gene Frequency ; genotyping error ; Goats - genetics ; GWAS ; High-Throughput Nucleotide Sequencing ; long‐term data ; Next-generation sequencing ; outlier ; Polymorphism, Single Nucleotide ; Population genetics ; Populations ; RADseq ; Randomization ; Selection, Genetic ; sequencing error ; Statistical analysis</subject><ispartof>Molecular ecology resources, 2018-07, Vol.18 (4), p.778-788</ispartof><rights>2018 John Wiley &amp; Sons Ltd</rights><rights>2018 John Wiley &amp; Sons Ltd.</rights><rights>Copyright © 2018 John Wiley &amp; Sons Ltd</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c4129-ca820399bb458e6ebf3683fe1120a1680da1587e7cdf4763b260bbadb60cd7143</citedby><cites>FETCH-LOGICAL-c4129-ca820399bb458e6ebf3683fe1120a1680da1587e7cdf4763b260bbadb60cd7143</cites><orcidid>0000-0003-3902-2568</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/29573184$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Leigh, D. M.</creatorcontrib><creatorcontrib>Lischer, H. E. L.</creatorcontrib><creatorcontrib>Grossen, C.</creatorcontrib><creatorcontrib>Keller, L. F.</creatorcontrib><title>Batch effects in a multiyear sequencing study: False biological trends due to changes in read lengths</title><title>Molecular ecology resources</title><addtitle>Mol Ecol Resour</addtitle><description>High‐throughput sequencing is a powerful tool, but suffers biases and errors that must be accounted for to prevent false biological conclusions. Such errors include batch effects; technical errors only present in subsets of data due to procedural changes within a study. If overlooked and multiple batches of data are combined, spurious biological signals can arise, particularly if batches of data are correlated with biological variables. Batch effects can be minimized through randomization of sample groups across batches. However, in long‐term or multiyear studies where data are added incrementally, full randomization is impossible, and batch effects may be a common feature. Here, we present a case study where false signals of selection were detected due to a batch effect in a multiyear study of Alpine ibex (Capra ibex). The batch effect arose because sequencing read length changed over the course of the project and populations were added incrementally to the study, resulting in nonrandom distributions of populations across read lengths. The differences in read length caused small misalignments in a subset of the data, leading to false variant alleles and thus false SNPs. Pronounced allele frequency differences between populations arose at these SNPs because of the correlation between read length and population. This created highly statistically significant, but biologically spurious, signals of selection and false associations between allele frequencies and the environment. We highlight the risk of batch effects and discuss strategies to reduce the impacts of batch effects in multiyear high‐throughput sequencing studies.</description><subject>Alleles</subject><subject>Alpine environments</subject><subject>Animals</subject><subject>Biological effects</subject><subject>Case studies</subject><subject>Gene Frequency</subject><subject>genotyping error</subject><subject>Goats - genetics</subject><subject>GWAS</subject><subject>High-Throughput Nucleotide Sequencing</subject><subject>long‐term data</subject><subject>Next-generation sequencing</subject><subject>outlier</subject><subject>Polymorphism, Single Nucleotide</subject><subject>Population genetics</subject><subject>Populations</subject><subject>RADseq</subject><subject>Randomization</subject><subject>Selection, Genetic</subject><subject>sequencing error</subject><subject>Statistical analysis</subject><issn>1755-098X</issn><issn>1755-0998</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><recordid>eNqFkbtPwzAQhy0EgvKY2ZAlFpYW20n8YAPESyqwgMRm-XFpU6VJsROh_ve4LXRg4RafrO8-nX6H0CklI5rqkoqiGBKl5IgyIdQOGmx_dre9_DhAhzHOCOFEiXwfHTBViIzKfIDgxnRuiqEswXURVw02eN7XXbUEE3CEzx4aVzUTHLveL6_wvakjYFu1dTupnKlxF6DxEfsecNdiNzXNBNaeAMbjGppJN43HaK9cDZ78vEfo_f7u7fZxOH59eLq9Hg9dTpkaOiMZyZSyNi8kcLBlxmVWAqWMGMol8YYWUoBwvswFzyzjxFrjLSfOC5pnR-hi412ENm0eOz2vooO6Ng20fdSMUElYThVJ6PkfdNb2oUnbJYpTwRPDEnW5oVxoYwxQ6kWo5iYsNSV6dQG9yliv8tbrC6SJsx9vb-fgt_xv5AkoNsBXVcPyP59-vnvZiL8BXkOPrw</recordid><startdate>201807</startdate><enddate>201807</enddate><creator>Leigh, D. M.</creator><creator>Lischer, H. E. L.</creator><creator>Grossen, C.</creator><creator>Keller, L. F.</creator><general>Wiley Subscription Services, Inc</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SN</scope><scope>7SS</scope><scope>8FD</scope><scope>C1K</scope><scope>FR3</scope><scope>M7N</scope><scope>P64</scope><scope>RC3</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0003-3902-2568</orcidid></search><sort><creationdate>201807</creationdate><title>Batch effects in a multiyear sequencing study: False biological trends due to changes in read lengths</title><author>Leigh, D. M. ; Lischer, H. E. L. ; Grossen, C. ; Keller, L. F.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c4129-ca820399bb458e6ebf3683fe1120a1680da1587e7cdf4763b260bbadb60cd7143</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Alleles</topic><topic>Alpine environments</topic><topic>Animals</topic><topic>Biological effects</topic><topic>Case studies</topic><topic>Gene Frequency</topic><topic>genotyping error</topic><topic>Goats - genetics</topic><topic>GWAS</topic><topic>High-Throughput Nucleotide Sequencing</topic><topic>long‐term data</topic><topic>Next-generation sequencing</topic><topic>outlier</topic><topic>Polymorphism, Single Nucleotide</topic><topic>Population genetics</topic><topic>Populations</topic><topic>RADseq</topic><topic>Randomization</topic><topic>Selection, Genetic</topic><topic>sequencing error</topic><topic>Statistical analysis</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Leigh, D. M.</creatorcontrib><creatorcontrib>Lischer, H. E. L.</creatorcontrib><creatorcontrib>Grossen, C.</creatorcontrib><creatorcontrib>Keller, L. F.</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Ecology Abstracts</collection><collection>Entomology Abstracts (Full archive)</collection><collection>Technology Research Database</collection><collection>Environmental Sciences and Pollution Management</collection><collection>Engineering Research Database</collection><collection>Algology Mycology and Protozoology Abstracts (Microbiology C)</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>Molecular ecology resources</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Leigh, D. M.</au><au>Lischer, H. E. L.</au><au>Grossen, C.</au><au>Keller, L. F.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Batch effects in a multiyear sequencing study: False biological trends due to changes in read lengths</atitle><jtitle>Molecular ecology resources</jtitle><addtitle>Mol Ecol Resour</addtitle><date>2018-07</date><risdate>2018</risdate><volume>18</volume><issue>4</issue><spage>778</spage><epage>788</epage><pages>778-788</pages><issn>1755-098X</issn><eissn>1755-0998</eissn><abstract>High‐throughput sequencing is a powerful tool, but suffers biases and errors that must be accounted for to prevent false biological conclusions. Such errors include batch effects; technical errors only present in subsets of data due to procedural changes within a study. If overlooked and multiple batches of data are combined, spurious biological signals can arise, particularly if batches of data are correlated with biological variables. Batch effects can be minimized through randomization of sample groups across batches. However, in long‐term or multiyear studies where data are added incrementally, full randomization is impossible, and batch effects may be a common feature. Here, we present a case study where false signals of selection were detected due to a batch effect in a multiyear study of Alpine ibex (Capra ibex). The batch effect arose because sequencing read length changed over the course of the project and populations were added incrementally to the study, resulting in nonrandom distributions of populations across read lengths. The differences in read length caused small misalignments in a subset of the data, leading to false variant alleles and thus false SNPs. Pronounced allele frequency differences between populations arose at these SNPs because of the correlation between read length and population. This created highly statistically significant, but biologically spurious, signals of selection and false associations between allele frequencies and the environment. We highlight the risk of batch effects and discuss strategies to reduce the impacts of batch effects in multiyear high‐throughput sequencing studies.</abstract><cop>England</cop><pub>Wiley Subscription Services, Inc</pub><pmid>29573184</pmid><doi>10.1111/1755-0998.12779</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0003-3902-2568</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1755-098X
ispartof Molecular ecology resources, 2018-07, Vol.18 (4), p.778-788
issn 1755-098X
1755-0998
language eng
recordid cdi_proquest_miscellaneous_2018024190
source Wiley-Blackwell Read & Publish Collection
subjects Alleles
Alpine environments
Animals
Biological effects
Case studies
Gene Frequency
genotyping error
Goats - genetics
GWAS
High-Throughput Nucleotide Sequencing
long‐term data
Next-generation sequencing
outlier
Polymorphism, Single Nucleotide
Population genetics
Populations
RADseq
Randomization
Selection, Genetic
sequencing error
Statistical analysis
title Batch effects in a multiyear sequencing study: False biological trends due to changes in read lengths
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T08%3A58%3A45IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Batch%20effects%20in%20a%20multiyear%20sequencing%20study:%20False%20biological%20trends%20due%20to%20changes%20in%20read%20lengths&rft.jtitle=Molecular%20ecology%20resources&rft.au=Leigh,%20D.%20M.&rft.date=2018-07&rft.volume=18&rft.issue=4&rft.spage=778&rft.epage=788&rft.pages=778-788&rft.issn=1755-098X&rft.eissn=1755-0998&rft_id=info:doi/10.1111/1755-0998.12779&rft_dat=%3Cproquest_cross%3E2018024190%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c4129-ca820399bb458e6ebf3683fe1120a1680da1587e7cdf4763b260bbadb60cd7143%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2061769032&rft_id=info:pmid/29573184&rfr_iscdi=true