1 Introduction

Batch effects refer to differences between data sets generated at different times or in different laboratories. These often occur due to uncontrolled variability in experimental factors, e.g., reagent quality, operator skill, atmospheric ozone levels. The presence of batch effects can interfere with downstream analyses if they are not explicitly modelled. For example, differential expression analyses typically use a blocking factor to absorb any batch-to-batch differences.

For single-cell RNA sequencing (scRNA-seq) data analyses, explicit modelling of the batch effect is less relevant. Manny common downstream procedures for exploratory data analysis are not model-based, including clustering and visualization. It is more generally useful to have methods that can remove batch effects to create an corrected expression matrix for further analysis. This follows the same strategy as, e.g., the removeBatchEffect() function in the limma package (Ritchie et al. 2015).

Batch correction methods designed for bulk genomics data usually require knowledge of the other factors of variation. This is usually not known in scRNA-seq experiments where the aim is to explore unknown heterogeneity in cell populations. The batchelor package implements batch correction methods that do not rely on a priori knowledge about the population structure.

2 Setting up demonstration data

To demonstrate, we will use two brain data sets (Tasic et al. 2016; Zeisel et al. 2015) from the scRNAseq package.

library(scRNAseq)
sce1 <- ZeiselBrainData()
sce1

## class: SingleCellExperiment 
## dim: 20006 3005 
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
##   1772058148_F03
## colData names(10): tissue group # ... level1class level2class
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(2): ERCC repeat

sce2 <- TasicBrainData()[,1:500]
sce2

## class: SingleCellExperiment 
## dim: 24058 500 
## metadata(0):
## assays(1): counts
## rownames(24058): 0610005C13Rik 0610007C21Rik ... mt_X57780 tdTomato
## rowData names(0):
## colnames(500): Calb2_tdTpositive_cell_1 Calb2_tdTpositive_cell_2 ...
##   Htr3a_tdTpositive_cell_101 Htr3a_tdTpositive_cell_102
## colData names(13): sample_title mouse_line ... secondary_type
##   aibs_vignette_id
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(1): ERCC

For the sake of speed, we will use a randomly selected subset of 500 cells from each batch.

set.seed(19999)
sce1 <- sce1[,sample(ncol(sce1), 500, replace=TRUE)]
sce2 <- sce2[,sample(ncol(sce2), 500, replace=TRUE)]

Some preprocessing is required to render these two datasets comparable. We subset to the common subset of genes:

universe <- intersect(rownames(sce1), rownames(sce2))
sce1 <- sce1[universe,]
sce2 <- sce2[universe,]

We compute log-normalized expression values, using a library size-derived size factors for simplicity. (More complex size factor calculation methods are available in the scran package.)

# Ignoring spike-ins with use_altexps=FALSE.
out <- multiBatchNorm(sce1, sce2, norm.args=list(use_altexps=FALSE))
sce1 <- out[[1]]
sce2 <- out[[2]]

Finally, we identify all genes with positive biological components. We will use these for all high-dimensional procedures such as PCA and nearest-neighbor searching.

library(scran)
dec1 <- modelGeneVar(sce1)
dec2 <- modelGeneVar(sce2)
combined.dec <- combineVar(dec1, dec2)
chosen.hvgs <- combined.dec$bio > 0
summary(chosen.hvgs)

##    Mode   FALSE    TRUE 
## logical   10791    7907

We evaluate the batch effect by combining the two SingleCellExperiment objects without any correction and creating a PCA plot:

library(scater)
combined <- correctExperiments(A=sce1, B=sce2, PARAM=NoCorrectParam())
combined <- runPCA(combined, subset_row=chosen.hvgs)
plotPCA(combined, colour_by="batch")

3 Mutual nearest neighbors

3.1 Overview

Mutual nearest neighbors (MNNs) are defined as pairs of cells - one from each batch - that are within each other’s set of k nearest neighbors. The idea is that MNN pairs across batches refer to the same cell type, assuming that the batch effect is orthogonal to the biological subspace (Haghverdi et al. 2018). Once MNN pairs are identified, the difference between the paired cells is used to infer the magnitude and direction of the batch effect. It is then straightforward to remove the batch effect and obtain a set of corrected expression values.

3.2 The new, fast method

The fastMNN() function performs a principal components analysis (PCA) on the HVGs to obtain a low-dimensional representation of the input data. MNN identification and correction is performed in this low-dimensional space, which offers some advantages with respect to speed and denoising.
This returns a SingleCellExperiment containing a matrix of corrected PC scores is returned, which can be used directly for downstream analyses such as clustering and visualization.

library(batchelor)
f.out <- fastMNN(A=sce1, B=sce2, subset.row=chosen.hvgs)
str(reducedDim(f.out, "corrected"))

##  num [1:1000, 1:50] 0.1553 0.0895 -0.3697 0.0876 0.1379 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:1000] "1772071036_B04" "1772066080_B12" "1772063062_B10" "1772071036_E02" ...
##   ..$ : NULL

The batch of origin for each row/cell in the output matrix is also returned:

rle(f.out$batch)

## Run Length Encoding
##   lengths: int [1:2] 500 500
##   values : chr [1:2] "A" "B"

Another way to call fastMNN() is to specify the batch manually. This will return a corrected matrix with the cells in the same order as that in the input combined object. In this case, it doesn’t matter as the two batches were concatenated to created combined anyway, but these semantics may be useful when cells from the same batch are not contiguous.

f.out2 <- fastMNN(combined, batch=combined$batch, subset.row=chosen.hvgs)
str(reducedDim(f.out2, "corrected"))

##  num [1:1000, 1:50] -0.1553 -0.0895 0.3697 -0.0876 -0.1379 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:1000] "1772071036_B04" "1772066080_B12" "1772063062_B10" "1772071036_E02" ...
##   ..$ : NULL

As we can see, cells from the batches are more intermingled in the PCA plot below. This suggests that the batch effect was removed - assuming, of course, that the intermingled cells represent the same population.

plotReducedDim(f.out, colour_by="batch", dimred="corrected")

We can also obtain per-gene corrected expression values by using the rotation vectors stored in the output. This reverses the original projection used to obtain the initial low-dimensional representation. There are, however, many caveats to using these values for downstream analyses.

cor.exp <- assay(f.out)[1,]
hist(cor.exp, xlab="Corrected expression for gene 1", col="grey80")

While the default arguments are usually satisfactory, there are many options for running fastMNN(), e.g., to improve speed or to achieve a particular merge order. Refer to the corresponding workflow for more details.

3.3 The old, classic method

The original method described by Haghverdi et al. (2018) is implemented in the mnnCorrect() method. This performs the MNN identification and correction in the gene expression space, and uses a different strategy to overcome high-dimensional noise. mnnCorrect() is called with the same semantics as fastMNN():

# Using fewer genes as it is much slower. 
fewer.hvgs <- head(order(combined.dec$bio, decreasing=TRUE), 1000)
classic.out <- mnnCorrect(sce1, sce2, subset.row=fewer.hvgs)

… but returns the corrected gene expression matrix directly, rather than using a low-dimensional representation11 Again, those readers wanting to use the corrected values for per-gene analyses should consider the caveats mentioned previously.. This is wrapped in a SummarizedExperiment object to store various batch-related metadata.

classic.out

## class: SingleCellExperiment 
## dim: 1000 1000 
## metadata(1): merge.info
## assays(1): corrected
## rownames(1000): Plp1 Vip ... Atp2c1 Lrrtm2
## rowData names(0):
## colnames(1000): 1772071036_B04 1772066080_B12 ...
##   Gad2_tdTpositive_cell_69 Htr3a_tdTpositive_cell_61
## colData names(1): batch
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):

For scRNA-seq data, fastMNN() tends to be both faster and better at achieving a satisfactory merge. mnnCorrect() is mainly provided here for posterity’s sake, though it is more robust than fastMNN() to certain violations of the orthogonality assumptions.

4 Batch rescaling

rescaleBatches() effectively centers the batches in log-expression space on a per-gene basis. This is conceptually equivalent to running removeBatchEffect() with no covariates other than the batch. However, rescaleBatches() achieves this rescaling by reversing the log-transformation, downscaling the counts so that the average of each batch is equal to the smallest value, and then re-transforming. This preserves sparsity by ensuring that zeroes remain so after correction, and mitigates differences in the variance when dealing with counts of varying size between batches22 Done by downscaling, which increases the shrinkage from the added pseudo-count..

Calling rescaleBatches() returns a corrected matrix of per-gene log-expression values, wrapped in a SummarizedExperiment containin batch-related metadata. This function operates on a per-gene basis so there is no need to perform subsetting (other than to improve speed).

rescale.out <- rescaleBatches(sce1, sce2)
rescale.out

## class: SingleCellExperiment 
## dim: 18698 1000 
## metadata(0):
## assays(1): corrected
## rownames(18698): Tspan12 Tshz1 ... Uty Usp9y
## rowData names(0):
## colnames(1000): 1772071036_B04 1772066080_B12 ...
##   Gad2_tdTpositive_cell_69 Htr3a_tdTpositive_cell_61
## colData names(1): batch
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):

While this method is fast and simple, it makes the strong assumption that the population composition of each batch is the same. This is usually not the case for scRNA-seq experiments in real systems that exhibit biological variation. Thus, rescaleBatches() is best suited for merging technical replicates of the same sample, e.g., that have been sequenced separately.

rescale.out <- runPCA(rescale.out, subset_row=chosen.hvgs,
    exprs_values="corrected")
plotPCA(rescale.out, colour_by="batch")

Alternatively, a more direct linear regression of the batch effect can be performed with regressBatches(). This does not preserve sparsity but uses a different set of tricks to avoid explicitly creating a dense matrix, specifically by using the ResidualMatrix class from the BiocSingular package.

regress.out <- regressBatches(sce1, sce2)
assay(regress.out)

## <18698 x 1000> matrix of class ResidualMatrix and type "double":
##                     1772071036_B04            1772066080_B12 ...
##  Tspan12                0.27482997               -0.23092943   .
##    Tshz1               -0.32150420               -0.32150420   .
##   Fnbp1l               -0.50544418               -0.50544418   .
## Adamts15               -0.03026338               -0.03026338   .
##   Cldn12                0.26177938               -0.24398003   .
##      ...                         .                         .   .
##     Mid1               -0.05197117               -0.05197117   .
##     Asmt                0.00000000                0.00000000   .
##    Kdm5d               -0.13185490               -0.13185490   .
##      Uty               -0.19049220               -0.19049220   .
##    Usp9y                0.00000000                0.00000000   .
##          Htr3a_tdTpositive_cell_61
##  Tspan12               -0.82382200
##    Tshz1               -0.44042547
##   Fnbp1l               -0.27629089
## Adamts15               -0.02306997
##   Cldn12               -0.44988246
##      ...                         .
##     Mid1               -0.07062336
##     Asmt                0.00000000
##    Kdm5d               -0.33322662
##      Uty                0.11880993
##    Usp9y                0.00000000

5 Using data subsets

5.1 Selecting genes

As shown above, the subset.row= argument will only perform the correction on a subset of genes in the data set. This is useful for focusing on highly variable or marker genes during high-dimensional procedures like PCA or neighbor searches, mitigating noise from irrelevant genes. For per-gene methods, this argument provides a convenient alternative to subsetting the input.

For some functions, it is also possible to set correct.all=TRUE when subset.row= is specified. This will compute corrected values for the unselected genes as well, which is possible once the per-cell statistics are obtained with the gene subset. With this setting, we can guarantee that the output contains all the genes provided in the input.

5.2 Restricted correction

Many functions support the restrict= argument whereby the correction is determined using only a restricted subset of cells in each batch. The effect of the correction is then - for want of a better word - “extrapolated” to all other cells in that batch. This is useful for experimental designs where a control set of cells from the same source population were run on different batches. Any difference in the controls between batches must be artificial in origin, allowing us to estimate and remove the batch effect without making further biological assumptions.

# Pretend the first X cells in each batch are controls.
restrict <- list(1:100, 1:200) 
rescale.out <- rescaleBatches(sce1, sce2, restrict=restrict)

6 Other utilities

6.1 Multi-batch normalization

Differences in sequencing depth between batches are an obvious cause for batch-to-batch differences. These can be removed by multiBatchNorm(), which downscales all batches to match the coverage of the least-sequenced batch. This function returns a list of SingleCellExperiment objects with log-transformed normalized expression values that can be directly used for further correction.

normed <- multiBatchNorm(A=sce1, B=sce2,
    norm.args=list(use_altexps=FALSE))
names(normed)

## [1] "A" "B"

Downscaling mitigates differences in variance between batches due to the mean-variance relationship of count data. It is achieved using a median-based estimator to avoid problems with composition biases between batches (Lun, Bach, and Marioni 2016). Note that this assumes that most genes are not DE between batches, which we try to avoid violating by performing the rescaling on all genes rather than just the HVGs.

6.2 Multi-batch PCA

Users can perform a PCA across multiple batches using the multiBatchPCA() function. The output of this function is roughly equivalent to cbinding all batches together and performing PCA on the merged matrix. The main difference is that each sample is forced to contribute equally to the identification of the rotation vectors. This allows small batches with unique subpopulations to contribute meaningfully to the definition of the low-dimensional space.

# Using the same BSPARAM argument as fastMNN(), for speed.
pca.out <- multiBatchPCA(A=sce1, B=sce2, subset.row=chosen.hvgs,
    BSPARAM=BiocSingular::IrlbaParam(deferred=TRUE))
names(pca.out)

## [1] "A" "B"

This function is used internally in fastMNN() but can be explicitly called by the user. Reduced dimensions can be input into reducedMNN(), which is occasionally convenient as it allows different correction parameters to be repeated without having to repeat the time-consuming PCA step.

7 Session information

sessionInfo()

## R version 3.6.2 (2019-12-12)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scater_1.14.6               ggplot2_3.2.1              
##  [3] scran_1.14.5                scRNAseq_2.0.2             
##  [5] batchelor_1.2.4             SingleCellExperiment_1.8.0 
##  [7] SummarizedExperiment_1.16.1 DelayedArray_0.12.1        
##  [9] BiocParallel_1.20.1         matrixStats_0.55.0         
## [11] Biobase_2.46.0              GenomicRanges_1.38.0       
## [13] GenomeInfoDb_1.22.0         IRanges_2.20.1             
## [15] S4Vectors_0.24.1            BiocGenerics_0.32.0        
## [17] knitr_1.26                  BiocStyle_2.14.2           
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6                  bit64_0.9-7                  
##  [3] httr_1.4.1                    tools_3.6.2                  
##  [5] backports_1.1.5               R6_2.4.1                     
##  [7] irlba_2.3.3                   vipor_0.4.5                  
##  [9] DBI_1.1.0                     lazyeval_0.2.2               
## [11] colorspace_1.4-1              withr_2.1.2                  
## [13] tidyselect_0.2.5              gridExtra_2.3                
## [15] bit_1.1-14                    curl_4.3                     
## [17] compiler_3.6.2                BiocNeighbors_1.4.1          
## [19] labeling_0.3                  bookdown_0.16                
## [21] scales_1.1.0                  rappdirs_0.3.1               
## [23] stringr_1.4.0                 digest_0.6.23                
## [25] rmarkdown_2.0                 XVector_0.26.0               
## [27] pkgconfig_2.0.3               htmltools_0.4.0              
## [29] limma_3.42.0                  fastmap_1.0.1                
## [31] dbplyr_1.4.2                  rlang_0.4.2                  
## [33] RSQLite_2.1.5                 shiny_1.4.0                  
## [35] DelayedMatrixStats_1.8.0      farver_2.0.1                 
## [37] dplyr_0.8.3                   RCurl_1.95-4.12              
## [39] magrittr_1.5                  BiocSingular_1.2.1           
## [41] GenomeInfoDbData_1.2.2        Matrix_1.2-18                
## [43] Rcpp_1.0.3                    ggbeeswarm_0.6.0             
## [45] munsell_0.5.0                 viridis_0.5.1                
## [47] lifecycle_0.1.0               edgeR_3.28.0                 
## [49] stringi_1.4.3                 yaml_2.2.0                   
## [51] zlibbioc_1.32.0               BiocFileCache_1.10.2         
## [53] AnnotationHub_2.18.0          grid_3.6.2                   
## [55] blob_1.2.0                    dqrng_0.2.1                  
## [57] promises_1.1.0                ExperimentHub_1.12.0         
## [59] crayon_1.3.4                  lattice_0.20-38              
## [61] cowplot_1.0.0                 locfit_1.5-9.1               
## [63] zeallot_0.1.0                 pillar_1.4.3                 
## [65] igraph_1.2.4.2                glue_1.3.1                   
## [67] BiocVersion_3.10.1            evaluate_0.14                
## [69] BiocManager_1.30.10           vctrs_0.2.1                  
## [71] httpuv_1.5.2                  gtable_0.3.0                 
## [73] purrr_0.3.3                   assertthat_0.2.1             
## [75] xfun_0.11                     mime_0.8                     
## [77] rsvd_1.0.2                    xtable_1.8-4                 
## [79] later_1.0.0                   viridisLite_0.3.0            
## [81] tibble_2.1.3                  AnnotationDbi_1.48.0         
## [83] beeswarm_0.2.3                memoise_1.1.0                
## [85] statmod_1.4.32                interactiveDisplayBase_1.24.0

References

Haghverdi, L., A. T. L. Lun, M. D. Morgan, and J. C. Marioni. 2018. “Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.” Nat. Biotechnol. 36 (5):421–27.

Lun, A. T., K. Bach, and J. C. Marioni. 2016. “Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.” Genome Biol. 17 (April):75.

Ritchie, M. E., B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and G. K. Smyth. 2015. “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Res. 43 (7):e47.

Tasic, B., V. Menon, T. N. Nguyen, T. K. Kim, T. Jarsky, Z. Yao, B. Levi, et al. 2016. “Adult mouse cortical cell taxonomy revealed by single cell transcriptomics.” Nat. Neurosci. 19 (2):335–46.

Zeisel, A., A. B. Munoz-Manchado, S. Codeluppi, P. Lonnerberg, G. La Manno, A. Jureus, S. Marques, et al. 2015. “Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.” Science 347 (6226):1138–42.

Correcting batch effects in single-cell RNA-seq data

Revised: 3 February 2019

Package