fastMNN {batchelor} | R Documentation |
Correct for batch effects in single-cell expression data using a fast version of the mutual nearest neighbors (MNN) method.
fastMNN(..., batch = NULL, k = 20, restrict = NULL, cos.norm = TRUE, ndist = 3, d = 50, auto.order = FALSE, min.batch.skip = 0, subset.row = NULL, correct.all = FALSE, pc.input = FALSE, assay.type = "logcounts", get.spikes = FALSE, use.dimred = NULL, BSPARAM = ExactParam(), BNPARAM = KmknnParam(), BPPARAM = SerialParam())
... |
One or more log-expression matrices where genes correspond to rows and cells correspond to columns, if Alternatively, one or more matrices of low-dimensional representations can be supplied if Alternatively, one or more SingleCellExperiment objects can be supplied containing a log-expression matrix in the Alternatively, the SingleCellExperiment objects can contain reduced dimension coordinates in the Alternatively, one or more DataFrame objects produced by previous calls to In all cases, each object contains cells from a single batch; multiple objects represent separate batches of cells. Objects of different types can be mixed together if all or none are low-dimensional. |
batch |
A factor specifying the batch of origin for all cells when only a single object is supplied in |
k |
An integer scalar specifying the number of nearest neighbors to consider when identifying MNNs. |
restrict |
A list of length equal to the number of objects in |
cos.norm |
A logical scalar indicating whether cosine normalization should be performed on the input data prior to PCA. |
ndist |
A numeric scalar specifying the threshold beyond which neighbours are to be ignored when computing correction vectors. Each threshold is defined as a multiple of the number of median distances. |
d |
Number of dimensions to use for dimensionality reduction in |
auto.order |
Logical scalar indicating whether re-ordering of batches should be performed to maximize the number of MNN pairs at each step. Alternatively, an integer vector containing a permutation of |
min.batch.skip |
Numeric scalar specifying the minimum relative magnitude of the batch effect, below which no correction will be performed at a given merge step. |
subset.row |
A vector specifying which features to use for correction.
Only relevant for gene expression inputs (i.e., |
correct.all |
Logical scalar indicating whether a rotation matrix should be computed for genes not in |
pc.input |
Logical scalar indicating whether the values in |
assay.type |
A string or integer scalar specifying the assay containing the log-expression values.
Only used for SingleCellExperiment inputs with |
get.spikes |
A logical scalar indicating whether to retain rows corresponding to spike-in transcripts.
Only used for SingleCellExperiment inputs with |
use.dimred |
A string or integer scalar specifying which reduced dimension result to use, if any. Only used for SingleCellExperiment inputs. |
BSPARAM |
A BiocSingularParam object specifying the algorithm to use for PCA. |
BNPARAM |
A BiocNeighborParam object specifying the nearest neighbor algorithm. |
BPPARAM |
A BiocParallelParam object specifying whether the PCA and nearest-neighbor searches should be parallelized. |
This function provides a variant of the mnnCorrect
function, modified for speed and more robust performance.
In particular:
It performs a multi-sample PCA via multiBatchPCA
and subsequently performs all calculations in the PC space.
This reduces computational work and provides some denoising for improved neighbour detection.
As a result, though, the corrected output cannot be interpreted on a gene level and is useful only for cell-level comparisons, e.g., clustering and visualization.
The correction vector for each cell is directly computed from its k
nearest neighbours in the same batch.
Specifically, only the k
nearest neighbouring cells that also participate in MNN pairs are used.
Each MNN-participating neighbour is weighted by distance from the current cell, using a tricube scheme with bandwidth equal to the median distance multiplied by ndist
.
This ensures that the correction vector only uses information from the closest cells, improving the fidelity of local correction.
Issues with “kissing” are avoided with a two-step procedure that removes variation along the batch effect vector. First, the average correction vector across all MNN pairs is computed. Cell coordinates are adjusted such that all cells in a single batch have the same position along this vector. The correction vectors are then recalculated with the adjusted coordinates (but the same MNN pairs).
The default setting of cos.norm=TRUE
provides some protection against differences in scaling between log-expression matrices from batches that are normalized separately
(see cosineNorm
for details).
However, if possible, we recommend using the output of multiBatchNorm
as input to fastMNN
.
This will equalize coverage on the count level before the log-transformation, which is a more accurate rescaling than cosine normalization on the log-values.
The batch
argument allows users to easily perform batch correction when all cells have already been combined into a single object.
This avoids the need to manually split the matrix or SingleCellExperiment object into separate objects for input into fastMNN
.
In this situation, the order of input batches is defined by the order of levels in batch
.
The output of this function depends on whether a PCA is performed on the input ...
.
This will be the case if pc.input=FALSE
for matrix inputs or if use.dimred=NULL
for SingleCellExperiment inputs.
If a PCA is performed, a SingleCellExperiment is returned where each row is a gene and each column is a cell. This contains:
A corrected
matrix in the reducedDims
slot, containing corrected low-dimensional coordinates for each cell.
This has number of columns equal to d
and number of rows equal to the total number of cells in ...
.
A batch
column in the colData
slot, containing the batch of origin for each row (i.e., cell) in corrected
.
A rotation
column the rowData
slot, containing the rotation matrix used for the PCA.
This has d
columns and number of rows equal to the number of genes to report (see the “Choice of genes” section).
A reconstructed
matrix in the assays
slot, containing the low-rank reconstruction of the original expression matrix.
This can be interpreted as per-gene corrected log-expression values (after cosine normalization, if cos.norm=TRUE
) but should not be used for quantitative analyses.
Otherwise, a DataFrame is returned where each row corresponds to a cell. It contains:
corrected
, the matrix of corrected low-dimensional coordinates for each cell.
batch
, the Rle specifying the batch of origin for each row.
Cells in the output object are always ordered in the same manner as supplied in ...
.
For a single input object, cells will be reported in the same order as they are arranged in that object.
In cases with multiple input objects, the cell identities are simply concatenated from successive objects,
i.e., all cells from the first object (in their provided order), then all cells from the second object, and so on.
This is true regardless of the value of auto.order
, which only affects the internal merge order of the batches.
The metadata of the output object contains:
merge.order
, a vector of batch names or indices, specifying the order in which batches were merged.
merge.info
, a DataFrame of information about each merge step (corresponding to each row).
This contains the following fields:
pairs
, a List of DataFrames specifying which pairs of cells in corrected
were identified as MNNs at each step.
batch.vector
, a List of numeric vectors specifying the average batch vector at each step.
batch.size
, a numeric vector specifying the relative magnitude of the batch effect at each merge.
skipped
, a logical vector indicating whether the correction was skipped if the magnitude was below min.batch.skip
.
lost.var
, a numeric matrix specifying the percentage of variance lost due to orthogonalization at each merge step.
This is reported separately for each batch (columns, ordered according to the input order, not the merge order).
pre.orthog
, a DataFrame containing information about pre-correction orthogonalization.
This is only reported if ...
contains one or more DataFrames.
Each row corresponds to a vector used for orthogonalization in one of the DataFrames in ...
.
The vector is stored in batch.vector
and the variance lost due to orthogonalization in each batch is reported in lost.var
.
By default, batches are merged in the user-supplied order.
However, if auto.order=TRUE
, batches are ordered to maximize the number of MNN pairs at each step.
The aim is to improve the stability of the correction by first merging more similar batches with more MNN pairs.
This can be somewhat time-consuming as MNN pairs need to be iteratively recomputed for all possible batch pairings.
It is often more convenient for the user to specify an appropriate ordering based on prior knowledge about the batches.
If auto.order
is an integer vector, it is treated as an ordering permutation with which to merge batches.
For example, if auto.order=c(4,1,3,2)
, batches 4 and 1 in ...
are merged first, followed by batch 3 and then batch 2.
This is often more convenient than changing the order manually in ...
, which would alter the order of batches in the output corrected
matrix.
Indeed, no matter what the setting of auto.order
is, the order of cells in the output corrected matrix is always the same.
Further control of the merge order can be achieved by performing the multi-sample PCA outside of this function with multiBatchPCA
.
Batches can then be progressively merged by repeated calls to fastMNN
with low-dimensional inputs (see below).
This is useful in situations where the batches need to be merged in a hierarhical manner, e.g., combining replicate samples before merging them across different conditions.
For example, we could merge batch 1 with 4 to obtain a corrected 1+4; and then batch 2 with 3 to obtain a corrected 2+3;
before merging the corrected 1+4 and 2+3 to obtain the final set of corrected values.
All genes are used with the default setting of subset.row=NULL
.
Users can set subset.row
to subset the inputs to highly variable genes or marker genes.
This improves the quality of the PCA and identification of MNN pairs by reducing the noise from irrelevant genes.
Note that users should not be too restrictive with subsetting, as high dimensionality is required to satisfy the orthogonality assumption in MNN detection.
For SingleCellExperiment inputs, spike-in transcripts are automatically removed unless get.spikes=TRUE
.
If subset.row
is specified and get.spikes=FALSE
, only the non-spike-in specified features will be used.
All SingleCellExperiment objects should have the same set of spike-in transcripts.
By default, only the selected genes are used to compute rotation vectors and a low-rank representation of the input matrix.
However, rotation vectors can be obtained that span all genes in the supplied input data with correct.all=TRUE
.
This will not affect the corrected low-dimension coordinates or the output for the selected genes.
Note that these settings for the choice of genes are completely ignored when using low-dimensional inputs (see below).
Low-dimensional inputs can be supplied directly to fastMNN
if the PCA (or some other projection to low-dimensional space) is performed outside the function.
This intructs the function to skip the multiBatchPCA
step.
To enable this, set pc.input=TRUE
for matrix-like inputs in ...
, or specify use.dimred
with SingleCellExperiment inputs.
If ...
contains any DataFrame objects, these are assumed to be the output of a previous fastMNN
call.
All inputs are subsequently treated as low-dimensional inputs and any other setting of pc.input
is ignored.
If any SingleCellExperiment objects are also present in ...
, use.dimred
must be specified.
Note that multiBatchPCA
will not perform cosine-normalization,
so it is the responsibility of the user to cosine-normalize each batch beforehand with cosineNorm
to recapitulate results with cos.norm=TRUE
.
In addition, multiBatchPCA
must be run on all samples at once, to ensure that all cells are projected to the same low-dimensional space.
Users are referred to the Examples for a demonstration of this functionality.
It is possible to compute the correction using only a subset of cells in each batch, and then extrapolate that correction to all other cells. This may be desirable in experimental designs where a control set of cells from the same source population were run on different batches. Any difference in the controls must be artificial in origin and can be directly removed without making further biological assumptions.
To do this, users should set restrict
to specify the subset of cells in each batch to be used for correction.
This should be set to a list of length equal to the length of ...
, where each element is a subsetting vector to be applied to the columns of the corresponding batch.
A NULL
element indicates that all the cells from a batch should be used.
In situations where one input object contains multiple batches, restrict
is simply a list containing a single subsetting vector for that object.
fastMNN
will only use the restricted subset of cells in each batch to identify MNN pairs and the center of the orthogonalization.
However, it will apply the correction to all cells in each batch - hence the extrapolation.
This means that the output is always of the same dimensionality, regardless of whether restrict
is specified.
Note that all cells are used to perform the PCA, regardless of whether restrict
is set.
Constructing the projection vectors with only control cells will not guarantee resolution of unique non-control populations in each batch.
The function will only completely ignore cells that are not in restrict
if pc.input=TRUE
or, for SingleCellExperiment inputs, use.dimred
is set.
fastMNN
will compute the percentage of variance that is lost from each batch during orthogonalization at each merge step.
This represents the variance in each batch that is parallel to the average correction vectors (and hence removed during orthogonalization) at each merge step.
Large proportions suggest that there is biological structure that is parallel to the batch effect,
corresponding to violations of the assumption that the batch effect is orthogonal to the biological subspace.
If fastMNN
is called with DataFrame inputs, each DataFrame is assumed to be the result of a previous fastMNN
call
and have a set of vectors used for orthogonalization in the merge steps of that previous call.
In the current call, fastMNN
will gather all such batch vectors across all DataFrame inputs.
Each batch is then re-orthogonalized with respect to each of these vectors.
This ensures that the same variation is removed from each batch prior to merging.
The variance lost due to this pre-correction orthogonalization is reported in the pre.orthog
field in the output metadata.
Orthogonalization may cause problems if there is actually no batch effect, resulting in large losses of variance.
To avoid this, fastMNN
will not perform any correction if the relative magnitude of the batch effect is less than min.batch.skip
.
The relative magnitude is defined as the L2 norm of the average correction vector divided by the root-mean-square of the L2 norms of the per-MNN pair correction vectors.
This will be large when the per-pair vectors are all pointing in the same direction,
and small when the per-pair vectors point in random directions due to the absence of a consistent batch effect.
If a large loss of variance is observed along with a small batch effect in a given merge step, users can set min.batch.skip
to simply skip correction in that step.
Aaron Lun
Haghverdi L, Lun ATL, Morgan MD, Marioni JC (2018). Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36(5):421
Lun ATL (2018). Further MNN algorithm development. https://MarioniLab.github.io/FurtherMNN2018/theory/description.html
cosineNorm
and multiBatchPCA
to obtain the values to be corrected.
mnnCorrect
for the “classic” version of the MNN correction algorithm.
B1 <- matrix(rnorm(10000), ncol=50) # Batch 1 B2 <- matrix(rnorm(10000), ncol=50) # Batch 2 out <- fastMNN(B1, B2) str(reducedDim(out)) # corrected values # An equivalent approach with PC input. cB1 <- cosineNorm(B1) cB2 <- cosineNorm(B2) pcs <- multiBatchPCA(cB1, cB2) out2 <- fastMNN(pcs[[1]], pcs[[2]], pc.input=TRUE) all.equal(reducedDim(out), out2$corrected) # should be TRUE # Extracting corrected expression values for gene 10. summary(assay(out)[10,])