Loading…

Abstract 1659: Overcorrection of batch effects by ComBat can be avoided by using an equal medians method

Combining multiple data sets from the Gene Expression Omnibus (GEO) or other data repositories for an integrated analysis requires appropriate batch correction. ComBat, an empirical Bayesian method for batch correction of microarray data, is widely used and has been reported to be the best correctio...

Full description

Saved in:
Bibliographic Details
Published in:Cancer research (Chicago, Ill.) Ill.), 2019-07, Vol.79 (13_Supplement), p.1659-1659
Main Authors: Obenauer, John C., Stockfisch, Thomas P., Fournier, Marcia V.
Format: Article
Language:English
Citations: Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Combining multiple data sets from the Gene Expression Omnibus (GEO) or other data repositories for an integrated analysis requires appropriate batch correction. ComBat, an empirical Bayesian method for batch correction of microarray data, is widely used and has been reported to be the best correction method. We combined cancer data from 16 public studies representing 8 tissue types and a total of 3,563 samples, used the R “sva” package and ComBat for batch correction, and examined 6 gene sets representing positive and negative controls. As positive controls, we extracted 4 gene sets from the Human Protein Atlas that were found to be expressed at least 5-fold higher in one tissue than in any of 35 other tissues, and we matched these genes to their Affymetrix U133A probesets. This resulted in 16 probesets specific for stomach, 18 for lung, 37 for pancreas, and 27 for prostate. A fifth positive control is a group of 85 genes called BA80 that we have found to be expressed much lower in blood than in solid tissues. As a negative control that we do not expect to change much between tissues, we used a list of 3,804 housekeeping (HK) genes that were reported to show less than a four-fold expression change across 16 tissue types. We compared the ComBat results to a new method we call equal medians. The equal medians method assumes that the 22,277 genes measured on the Affymetrix U133A microarrays can vary widely between tissues and batches, but that the median of the 22,277 genes is the same for every sample. We created boxplots of each gene set across the 16 studies before and after each method of batch correction. The reduction in batch effects was scored using the change in standard deviation of the HK genes. The preservation of biological variability was scored using the fold change of the positive controls, comparing the target tissue’s median to the nearest alternate tissue’s median. We used two GEO studies as independent representatives of each tissue type, so the two fold changes were averaged to create a single measure. The results using the HK genes showed that ComBat removed 99.90% of the batch effects visible in the raw data, while equal medians removed 61.58%. However, equal medians did the best at preserving biological variability, with a fold change of 4.8 for stomach, 13.1 for lung, 42.3 for pancreas, 12.0 for prostate, and 3.9 for blood. The corresponding fold changes for ComBat were 1.4, 1.1, 2.2, 1.0, and 1.0. We conclude that ComBat was best at
ISSN:0008-5472
1538-7445
DOI:10.1158/1538-7445.AM2019-1659