scDblFinder {scDblFinder} | R Documentation |
Identification of doublets in single-cell RNAseq directly from counts using overclustering-based generation of artifical doublets.
scDblFinder(sce, artificialDoublets = NULL, clusters = NULL, clust.method = c("louvain", "overcluster", "fast_greedy"), samples = NULL, minClusSize = min(50, ncol(sce)/5), maxClusSize = NULL, nfeatures = 1000, dims = 20, dbr = NULL, dbr.sd = 0.015, k = 20, clust.graph.type = c("snn", "knn"), fullTable = FALSE, verbose = is.null(samples), score = c("weighted", "ratio", "hybrid"), BPPARAM = SerialParam())
sce |
|
artificialDoublets |
The approximate number of artificial doublets to create. If NULL, will be the maximum of the number of cells or '5*nbClusters^2'. |
clusters |
The optional cluster assignments (if omitted, will run clustering). This is used to make doublets more efficiently. 'clusters' should either be a vector of labels for each cell, or the name of a colData column of 'sce'. |
clust.method |
The clustering method if 'clusters' is not given. |
samples |
A vector of the same length as cells (or the name of a column of 'colData(sce)'), indicating to which sample each cell belongs. Here, a sample is understood as being processed independently. If omitted, doublets will be searched for with all cells together. If given, doublets will be searched for independently for each sample, which is preferable if they represent different captures. If your samples were multiplexed using cell hashes, want you want to give here are the different batches/wells (i.e. independent captures, since doublets cannot arise across them) rather than biological samples. |
minClusSize |
The minimum cluster size for 'quickCluster'/'overcluster' (default 50); ignored if 'clusters' is given. |
maxClusSize |
The maximum cluster size for 'overcluster'. Ignored if 'clusters' is given. If NA, clustering will be performed using 'quickCluster', otherwise via 'overcluster'. If missing, the default value will be estimated by 'overcluster'. |
nfeatures |
The number of top features to use (default 1000) |
dims |
The number of dimensions used to build the network (default 20) |
dbr |
The expected doublet rate. By default this is assumed to be 1% per thousand cells captured (so 4% among 4000 thousand cells), which is appropriate for 10x datasets. |
dbr.sd |
The standard deviation of the doublet rate, defaults to 0.015. |
k |
Number of nearest neighbors (for KNN graph). |
clust.graph.type |
Either 'snn' or 'knn'. |
fullTable |
Logical; whether to return the full table including artificial doublets, rather than the table for real cells only (default). |
verbose |
Logical; whether to print messages and the thresholding plot. |
score |
Score to use for final classification; either 'weighted' (default), 'ratio' or 'hybrid' (includes information about library size and detection rate. |
BPPARAM |
Used for multithreading when splitting by samples (i.e. when 'samples!=NULL'); otherwise passed to eventual PCA and K/SNN calculations. |
The 'sce' object with the following additional colData columns: 'scDblFinder.ratio' (ratio of aritifical doublets among neighbors), 'scDblFinder.weighted' (the ratio of artificial doublets among neighbors weigted by their distance), 'scDblFinder.score' (the final score used, by default the same as 'scDblFinder.weighted'), and 'scDblFinder.class' (whether the cell is called as 'doublet' or 'singlet'). Alternatively, if 'fullTable=TRUE', a data.frame will be returned with information about real and artificial cells.
library(SingleCellExperiment) m <- t(sapply( seq(from=0, to=5, length.out=50), FUN=function(x) rpois(50,x) ) ) sce <- SingleCellExperiment( list(counts=m) ) sce <- scDblFinder(sce, verbose=FALSE) table(sce$scDblFinder.class)