mnnCorrect {batchelor} | R Documentation |
Correct for batch effects in single-cell expression data using the mutual nearest neighbors method.
mnnCorrect(..., batch = NULL, restrict = NULL, k = 20, sigma = 0.1, cos.norm.in = TRUE, cos.norm.out = TRUE, svd.dim = 0L, var.adj = TRUE, subset.row = NULL, correct.all = FALSE, auto.order = FALSE, assay.type = "logcounts", get.spikes = FALSE, BSPARAM = ExactParam(), BNPARAM = KmknnParam(), BPPARAM = SerialParam())
... |
Two or more log-expression matrices where genes correspond to rows and cells correspond to columns. Each matrix should contain cells from the same batch; multiple matrices represent separate batches of cells. Each matrix should contain the same number of rows, corresponding to the same genes (in the same order). Alternatively, one or more SingleCellExperiment objects can be supplied containing a log-expression matrix in the |
batch |
A factor specifying the batch of origin for all cells when only a single object is supplied in |
restrict |
A list of length equal to the number of objects in |
k |
An integer scalar specifying the number of nearest neighbors to consider when identifying mutual nearest neighbors. |
sigma |
A numeric scalar specifying the bandwidth of the Gaussian smoothing kernel used to compute the correction vector for each cell. |
cos.norm.in |
A logical scalar indicating whether cosine normalization should be performed on the input data prior to calculating distances between cells. |
cos.norm.out |
A logical scalar indicating whether cosine normalization should be performed prior to computing corrected expression values. |
svd.dim |
An integer scalar specifying the number of dimensions to use for summarizing biological substructure within each batch. |
var.adj |
A logical scalar indicating whether variance adjustment should be performed on the correction vectors. |
subset.row |
A vector specifying which features to use for correction. |
correct.all |
A logical scalar specifying whether correction should be applied to all genes, even if only a subset is used for the MNN calculations. |
auto.order |
Logical scalar indicating whether re-ordering of batches should be performed to maximize the number of MNN pairs at each step. Alternatively, an integer vector containing a permutation of |
assay.type |
A string or integer scalar specifying the assay containing the log-expression values, if SingleCellExperiment objects are present in |
get.spikes |
A logical scalar indicating whether to retain rows corresponding to spike-in transcripts. Only used for SingleCellExperiment inputs. |
BSPARAM |
A BiocSingularParam object specifying the SVD algorithm to use. |
BNPARAM |
A BiocNeighborParam object specifying the neighbor search algorithm to use. |
BPPARAM |
A BiocParallelParam object specifying the parallelization scheme to use. |
This function is designed for batch correction of single-cell RNA-seq data where the batches are partially confounded with biological conditions of interest. It does so by identifying pairs of mutual nearest neighbors (MNN) in the high-dimensional log-expression space. Each MNN pair represents cells in different batches that are of the same cell type/state, assuming that batch effects are mostly orthogonal to the biological manifold. Correction vectors are calculated from the pairs of MNNs and corrected (log-)expression values are returned for use in clustering and dimensionality reduction.
The threshold to define nearest neighbors is defined by k
, which is passed to findMutualNN
to identify MNN pairs.
The size of k
can be interpreted as the minimum size of a subpopulation in each batch.
Values that are too small will not yield enough MNN pairs, while values that are too large will ignore substructure within each batch.
The algorithm is generally robust to various choices of k
.
For each MNN pair, a pairwise correction vector is computed based on the difference in the log-expression profiles.
The correction vector for each cell is computed by applying a Gaussian smoothing kernel with bandwidth sigma
is the pairwise vectors.
This stabilizes the vectors across many MNN pairs and extends the correction to those cells that do not have MNNs.
The choice of sigma
determines the extent of smoothing - a value of 0.1 is used by default, corresponding to 10% of the radius of the space after cosine normalization.
A SingleCellExperiment object containing the corrected
assay.
This contains corrected (log-)expression values for each gene (row) in each cell (column) in each batch.
A batch
field is present in the column data, specifying the batch of origin for each cell.
Cells in the output object are always ordered in the same manner as supplied in ...
.
For a single input object, cells will be reported in the same order as they are arranged in that object.
In cases with multiple input objects, the cell identities are simply concatenated from successive objects,
i.e., all cells from the first object (in their provided order), then all cells from the second object, and so on.
The metadata of the SingleCellExperiment contains:
merge.order
: a vector of batch names or indices, specifying the order in which batches were merged.
merge.info
, a DataFrame of information about each merge step (corresponding to each row).
This contains pairs
, a List of DataFrames specifying which pairs of cells in corrected
were identified as MNNs at each step.
All genes are used with the default setting of subset.row=NULL
.
Users can set subset.row
to subset the inputs to highly variable genes or marker genes.
This may provide more meaningful identification of MNN pairs by reducing the noise from irrelevant genes.
Note that users should not be too restrictive with subsetting, as high dimensionality is required to satisfy the orthogonality assumption in MNN detection.
For SingleCellExperiment inputs, spike-in transcripts are automatically removed unless get.spikes=TRUE
.
If subset.row
is specified and get.spikes=FALSE
, only the non-spike-in specified features will be used.
All SingleCellExperiment objects should have the same set of spike-in transcripts.
If subset.row
is specified and correct.all=TRUE
, corrected values are returned for all genes.
This is possible as subset.row
is only used to identify the MNN pairs and other cell-based distance calculations.
Correction vectors between MNN pairs can then be computed in for all genes in the supplied matrices.
The input expression values should generally be log-transformed, e.g., log-counts, see normalize
for details.
They should also be normalized within each data set to remove cell-specific biases in capture efficiency and sequencing depth.
By default, a further cosine normalization step is performed on the supplied expression data to eliminate gross scaling differences between data sets.
When cos.norm.in=TRUE
, cosine normalization is performed on the matrix of expression values used to compute distances between cells.
This can be turned off when there are no scaling differences between data sets.
When cos.norm.out=TRUE
, cosine normalization is performed on the matrix of values used to calculate correction vectors (and on which those vectors are applied).
This can be turned off to obtain corrected values on the log-scale, similar to the input data.
The cosine normalization is achieved using the cosineNorm
function.
The order in which batches are corrected will affect the final results.
The first batch in auto.order
is used as the reference batch against which the second batch is corrected.
Corrected values of the second batch are added to the reference batch, against which the third batch is corrected, and so on.
This strategy maximizes the chance of detecting sufficient MNN pairs for stable calculation of correction vectors in subsequent batches.
If auto.order=TRUE
, batches are ordered to maximize the number of MNN pairs at each step.
The aim is to improve the stability of the correction by first merging more similar batches with more MNN pairs.
This can be somewhat time-consuming as MNN pairs need to be iteratively recomputed for all possible batch pairings.
It is often more convenient for the user to specify an appropriate ordering based on prior knowledge about the batches.
Note that, no matter what the setting of auto.order
is, the order of cells in the output corrected matrix is always the same.
The function depends on a shared biological manifold, i.e., one or more cell types/states being present in multiple batches.
If this is not true, MNNs may be incorrectly identified, resulting in over-correction and removal of interesting biology.
Some protection can be provided by removing components of the correction vectors that are parallel to the biological subspaces in each batch.
The biological subspace in each batch is identified with a SVD on the expression matrix to obtain svd.dim
dimensions.
(By default, this option is turned off by setting svd.dim=0
.)
If var.adj=TRUE
, the function will adjust the correction vector to equalize the variances of the two data sets along the batch effect vector.
In particular, it avoids “kissing” effects whereby MNN pairs are identified between the surfaces of point clouds from different batches.
Naive correction would then bring only the surfaces into contact, rather than fully merging the clouds together.
The adjustment ensures that the cells from the two batches are properly intermingled after correction.
This is done by identifying each cell's position on the correction vector, identifying corresponding quantiles between batches,
and scaling the correction vector to ensure that the quantiles are matched after correction.
It is possible to compute the correction using only a subset of cells in each batch, and then extrapolate that correction to all other cells. This may be desirable in experimental designs where a control set of cells from the same source population were run on different batches. Any difference in the controls must be artificial in origin and can be directly removed without making further biological assumptions.
To do this, users should set restrict
to specify the subset of cells in each batch to be used for correction.
This should be set to a list of length equal to the length of ...
, where each element is a subsetting vector to be applied to the columns of the corresponding batch.
A NULL
element indicates that all the cells from a batch should be used.
In situations where one input object contains multiple batches, restrict
is simply a list containing a single subsetting vector for that object.
mnnCorrect
will only use the restricted subset of cells in each batch to identify MNN pairs (and to perform variance adjustment, if var.adj=TRUE
).
However, it will apply the correction to all cells in each batch - hence the extrapolation.
This means that the output is always of the same dimensionality, regardless of whether restrict
is specified.
Laleh Haghverdi, with modifications by Aaron Lun
Haghverdi L, Lun ATL, Morgan MD, Marioni JC (2018). Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36(5):421
fastMNN
for a faster equivalent.
B1 <- matrix(rnorm(10000), ncol=50) # Batch 1 B2 <- matrix(rnorm(10000), ncol=50) # Batch 2 out <- mnnCorrect(B1, B2) # corrected values