1 Introduction

Batch effects refer to differences between data sets generated at different times or in different laboratories. These often occur due to uncontrolled variability in experimental factors, e.g., reagent quality, operator skill, atmospheric ozone levels. The presence of batch effects can interfere with downstream analyses if they are not explicitly modelled. For example, differential expression analyses typically use a blocking factor to absorb any batch-to-batch differences.

For single-cell RNA sequencing (scRNA-seq) data analyses, explicit modelling of the batch effect is less relevant. Manny common downstream procedures for exploratory data analysis are not model-based, including clustering and visualization. It is more generally useful to have methods that can remove batch effects to create an corrected expression matrix for further analysis. This follows the same strategy as, e.g., the removeBatchEffect() function in the limma package (Ritchie et al. 2015).

Batch correction methods designed for bulk genomics data usually require knowledge of the other factors of variation. This is usually not known in scRNA-seq experiments where the aim is to explore unknown heterogeneity in cell populations. The batchelor package implements batch correction methods that do not rely on a priori knowledge about the population structure.

2 Setting up demonstration data

To demonstrate, we will use two brain data sets (Tasic et al. 2016; Zeisel et al. 2015) from the scRNAseq package.

library(scRNAseq)
sce1 <- ZeiselBrainData()
sce1
## class: SingleCellExperiment 
## dim: 20006 3005 
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
##   1772058148_F03
## colData names(10): tissue group # ... level1class level2class
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(2): ERCC repeat
sce2 <- TasicBrainData()[,1:500]
sce2
## class: SingleCellExperiment 
## dim: 24058 500 
## metadata(0):
## assays(1): counts
## rownames(24058): 0610005C13Rik 0610007C21Rik ... mt_X57780 tdTomato
## rowData names(0):
## colnames(500): Calb2_tdTpositive_cell_1 Calb2_tdTpositive_cell_2 ...
##   Htr3a_tdTpositive_cell_101 Htr3a_tdTpositive_cell_102
## colData names(13): sample_title mouse_line ... secondary_type
##   aibs_vignette_id
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(1): ERCC

For the sake of speed, we will use a randomly selected subset of 500 cells from each batch.

set.seed(19999)
sce1 <- sce1[,sample(ncol(sce1), 500, replace=TRUE)]
sce2 <- sce2[,sample(ncol(sce2), 500, replace=TRUE)]

Some preprocessing is required to render these two datasets comparable. We subset to the common subset of genes:

universe <- intersect(rownames(sce1), rownames(sce2))
sce1 <- sce1[universe,]
sce2 <- sce2[universe,]

We compute log-normalized expression values, using a library size-derived size factors for simplicity. (More complex size factor calculation methods are available in the scran package.)

# Ignoring spike-ins with use_altexps=FALSE.
out <- multiBatchNorm(sce1, sce2, norm.args=list(use_altexps=FALSE))
sce1 <- out[[1]]
sce2 <- out[[2]]

Finally, we identify all genes with positive biological components. We will use these for all high-dimensional procedures such as PCA and nearest-neighbor searching.

library(scran)
dec1 <- modelGeneVar(sce1)
dec2 <- modelGeneVar(sce2)
combined.dec <- combineVar(dec1, dec2)
chosen.hvgs <- combined.dec$bio > 0
summary(chosen.hvgs)
##    Mode   FALSE    TRUE 
## logical   10791    7907

We evaluate the batch effect by combining the two SingleCellExperiment objects without any correction and creating a PCA plot:

library(scater)
combined <- correctExperiments(A=sce1, B=sce2, PARAM=NoCorrectParam())
combined <- runPCA(combined, subset_row=chosen.hvgs)
plotPCA(combined, colour_by="batch")