clusterPurity {scran} | R Documentation |
Use a hypersphere-based approach to compute the “purity” of each cluster based on the number of contaminating cells in its region of the coordinate space.
clusterPurity(x, ...) ## S4 method for signature 'ANY' clusterPurity( x, clusters, k = 50, transposed = FALSE, weighted = TRUE, subset.row = NULL, BNPARAM = KmknnParam(), BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' clusterPurity(x, ..., assay.type = "logcounts") ## S4 method for signature 'SingleCellExperiment' clusterPurity( x, clusters = colLabels(x, onAbsence = "error"), ..., assay.type = "logcounts", use.dimred = NULL )
x |
A numeric matrix-like object containing expression values for genes (rows) and cells (columns).
If Alternatively, a SummarizedExperiment or SingleCellExperiment object containing such a matrix. For the SingleCellExperiment method, if |
... |
For the generic, arguments to pass to specific methods. For the SummarizedExperiment method, arguments to pass to the ANY method. For the SingleCellExperiment method, arguments to pass to the SummarizedExperiment method. |
clusters |
Vector of length equal to |
k |
Integer scalar specifying the number of nearest neighbors to use to determine the radius of the hyperspheres. |
transposed |
Logical scalar specifying whether |
weighted |
A logical scalar indicating whether to weight each cell in inverse proportion to the size of its cluster.
Alternatively, a numeric vector of length equal to |
subset.row |
See |
BNPARAM |
A BiocNeighborParam object specifying the nearest neighbor algorithm.
This should be an algorithm supported by |
BPPARAM |
A BiocParallelParam object indicating whether and how parallelization should be performed across genes. |
assay.type |
A string specifying which assay values to use. |
use.dimred |
A string specifying whether existing values in |
The purity of a cluster is quantified by creating a hypersphere around each cell in the cluster and computing the proportion of cells in that hypersphere from the same cluster. If all cells in a cluster have proportions close to 1, this indicates that the cluster is highly pure, i.e., there are few cells from other clusters in its region of the coordinate space. The distribution of purities for each cluster can be used as a measure of separation from other clusters.
In most cases, the majority of cells of a cluster will have high purities, corresponding to cells close to the cluster center; and a fraction will have low values, corresponding to cells lying at the boundaries of two adjacent clusters, A high degree of over-clustering will manifest as a majority of cells with purities close to zero.
The choice of k
is used only to determine an appropriate value for the hypersphere radius.
We use hyperspheres as this is robust to changes in density throughout the coordinate space,
in contrast to computing purity based on the proportion of k-nearest neighbors in the same cluster.
For example, the latter will fail most obviously when the size of the cluster is less than k
.
A numeric vector of purity values for each cell in x
.
By default, purity values are computed after weighting each cell by the reciprocal of the number of cells in the same cluster. Otherwise, clusters with more cells will have higher purities as any contamination is offset by the bulk of cells. Without weighting, this effect would compromise comparisons between clusters. One can interpret the weighted purities as the expected value after downsampling all clusters to the same size.
Advanced users can achieve greater control by manually supplying a numeric vector of weights to weighted
.
For example, we may wish to check the purity of batches after batch correction.
In this application, clusters
should be set to the blocking factor (not the cluster identities!)
and weighted
should be set to 1 over the frequency of each combination of cell type and batch.
This accounts for differences in cell type composition between batches when computing purities.
If weighted=FALSE
, no weighting is performed.
Aaron Lun
library(scater) sce <- mockSCE() sce <- logNormCounts(sce) g <- buildSNNGraph(sce) clusters <- igraph::cluster_walktrap(g)$membership out <- clusterPurity(sce, clusters) boxplot(split(out, clusters)) # Mocking up a stronger example: ngenes <- 1000 centers <- matrix(rnorm(ngenes*3), ncol=3) clusters <- sample(1:3, ncol(sce), replace=TRUE) y <- centers[,clusters] y <- y + rnorm(length(y)) out2 <- clusterPurity(y, clusters) boxplot(split(out2, clusters))