buildSNNGraph {scran} | R Documentation |
Build a shared or k-nearest-neighbors graph for cells based on their expression profiles.
buildSNNGraph(x, ...) ## S4 method for signature 'ANY' buildSNNGraph(x, k = 10, d = 50, type = c("rank", "number", "jaccard"), transposed = FALSE, subset.row = NULL, BNPARAM = KmknnParam(), BSPARAM = bsparam(), BPPARAM = SerialParam()) ## S4 method for signature 'SingleCellExperiment' buildSNNGraph(x, ..., subset.row = NULL, assay.type = "logcounts", get.spikes = FALSE, use.dimred = NULL) buildKNNGraph(x, ...) ## S4 method for signature 'ANY' buildKNNGraph(x, k = 10, d = 50, directed = FALSE, transposed = FALSE, subset.row = NULL, BNPARAM = KmknnParam(), BSPARAM = bsparam(), BPPARAM = SerialParam()) ## S4 method for signature 'SingleCellExperiment' buildKNNGraph(x, ..., subset.row = NULL, assay.type = "logcounts", get.spikes = FALSE, use.dimred = NULL) neighborsToSNNGraph(indices, type = c("rank", "number", "jaccard")) neighborsToKNNGraph(indices, directed = FALSE)
x |
For the ANY method, a matrix-like object containing expression values for each gene (row) in each cell (column).
These dimensions can be transposed if For the SingleCellExperiment method, a SingleCellExperiment containing such an expression matrix.
Alternatively, graph building will be performed from its |
... |
For the generics, additional arguments to pass to the specific methods. For the SingleCellExperiment methods, additional arguments to pass to the corresponding ANY method. |
k |
An integer scalar specifying the number of nearest neighbors to consider during graph construction. |
d |
An integer scalar specifying the number of dimensions to use for the search. |
type |
A string specifying the type of weighting scheme to use for shared neighbors. |
transposed |
A logical scalar indicating whether |
subset.row |
See |
BNPARAM |
A BiocNeighborParam object specifying the nearest neighbor algorithm. |
BSPARAM |
A BiocSingularParam object specifying the algorithm to use for PCA, if |
BPPARAM |
A BiocParallelParam object to use for parallel processing. |
assay.type |
A string specifying which assay values to use. |
get.spikes |
See |
use.dimred |
A string specifying whether existing values in |
directed |
A logical scalar indicating whether the output of |
indices |
An integer matrix where each row corresponds to a cell and contains the indices of the |
The buildSNNGraph
method builds a shared nearest-neighbour graph using cells as nodes.
For each cell, its k
nearest neighbours are identified using the findKNN
function,
based on distances between their expression profiles (Euclidean by default).
An edge is drawn between all pairs of cells that share at least one neighbour,
weighted by the characteristics of the shared nearest neighbors:
If type="rank"
, the weighting scheme defined by Xu and Su (2015) is used.
The weight between two nodes is k - r/2 where r is the smallest sum of ranks for any shared neighboring node.
For example, if one node was the closest neighbor of each of two nodes, the weight between the two latter nodes would be k - 1.
For the purposes of this ranking, each node has a rank of zero in its own nearest-neighbor set.
More shared neighbors, or shared neighbors that are close to both cells, will generally yield larger weights.
If type="number"
, the weight between two nodes is simply the number of shared nearest neighbors between them.
The weight can range from zero to k + 1, as the node itself is included in its own nearest-neighbor set.
This is a simpler scheme that is also slightly faster but does not account for the ranking of neighbors within each set.
If type="jaccard"
, the weight between two nodes is the Jaccard similarity between the two sets of neighbors for those nodes.
This weight can range from zero to 1, and is a monotonic transformation of the weight used by type="number"
.
It is provided for consistency with other clustering algorithms such as those in seurat.
The aim is to use the SNN graph to perform clustering of cells via community detection algorithms in the igraph package. This is faster and more memory efficient than hierarchical clustering for large numbers of cells. In particular, it avoids the need to construct a distance matrix for all pairs of cells. Only the identities of nearest neighbours are required, which can be obtained quickly with methods in the BiocNeighbors package.
The choice of k
controls the connectivity of the graph and the resolution of community detection algorithms.
Smaller values of k
will generally yield smaller, finer clusters, while increasing k
will increase the connectivity of the graph and make it more difficult to resolve different communities.
The value of k
can be roughly interpreted as the anticipated size of the smallest subpopulation.
If a subpopulation in the data has fewer than k+1
cells, buildSNNGraph
and buildKNNGraph
will forcibly construct edges between cells in that subpopulation and cells in other subpopulations.
This increases the risk that the subpopulation will not form its own cluster as it is more interconnected with the rest of the cells in the dataset.
Note that the setting of k
here is slightly different from that used in SNN-Cliq.
The original implementation considers each cell to be its first nearest neighbor that contributes to k
.
In buildSNNGraph
, the k
nearest neighbours refers to the number of other cells.
The buildKNNGraph
method builds a simpler k-nearest neighbour graph.
Cells are again nodes, and edges are drawn between each cell and its k-nearest neighbours.
No weighting of the edges is performed.
In theory, these graphs are directed as nearest neighour relationships may not be reciprocal.
However, by default, directed=FALSE
such that an undirected graph is returned.
A graph where nodes are cells and edges represent connections between nearest neighbors.
For buildSNNGraph
, these edges are weighted by the number of shared nearest neighbors.
For buildKNNGraph
, edges are not weighted but may be directed if directed=TRUE
.
In practice, PCA is performed on a matrix-like x
to obtain the first d
principal components.
The k-NN search can then be rapidly performed on the PCs rather than on the full expression matrix.
By default, the first 50 components are chosen, which should retain most of the substructure in the data set.
If d
is NA
or greater than or equal to the number of cells, no dimensionality reduction is performed.
The PCA is performed using methods the runSVD
function from the BiocSingular package.
To improve speed, this can be done using approximate algorithms by modifying BSPARAM
, e.g., to IrlbaParam()
.
Approximate algorithms will converge towards the correct result but often involve some random initialization and thus are technically dependent on the session seed.
For full reproducibility, users are advised to call set.seed
beforehand when using this option.
Expression values in x
should typically be on the log-scale, e.g., log-transformed counts.
Ranks can also be used for greater robustness, e.g., from scaledColRanks
.
(Dimensionality reduction is still okay when ranks are provided - running PCA on ranks is equivalent to running MDS on the distance matrix derived from Spearman's rho.)
If the input matrix x
is already transposed for the ANY method, transposed=TRUE
avoids an unnecessary internal transposition.
A typical use case is when x
contains some reduced dimension coordinates with cells in the rows.
In such cases, setting transposed=TRUE
and d=NA
will use the input coordinates directly for graph-building.
The same principles apply when x
is a SingleCellExperiment object,
except that the relevant matrix is now retrieved from the assays using assay.type
.
If use.dimred
is not NULL
, existing PCs are used from the specified entry of reducedDims(x)
,
and any setting of d
, subset.row
and get.spikes
are ignored.
The neighborsToSNNGraph
and neighborsToKNNGraph
functions operate directly on a matrix of nearest neighbor indices,
obtained using functions like findKNN
.
This may be useful for constructing a graph from precomputed nearest-neighbor search results.
Note that the user is responsible for ensuring that the indices are valid (i.e., range(indices)
is positive and no greater than max(indices)
).
Aaron Lun, with KNN code contributed by Jonathan Griffiths.
Xu C and Su Z (2015). Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics 31:1974-80
See make_graph
for details on the graph output object.
See cluster_walktrap
, cluster_louvain
and related functions in igraph for clustering based on the produced graph.
Also see findKNN
for specifics of the nearest-neighbor search.
library(scater) sce <- mockSCE(ncells=500) sce <- logNormCounts(sce) g <- buildSNNGraph(sce) clusters <- igraph::cluster_fast_greedy(g)$membership table(clusters) # Any clustering method from igraph can be used: clusters <- igraph::cluster_walktrap(g)$membership table(clusters) # Smaller 'k' usually yields finer clusters: g <- buildSNNGraph(sce, k=5) clusters <- igraph::cluster_walktrap(g)$membership table(clusters) # Graph can be built off existing reducedDims results: sce <- runPCA(sce) g <- buildSNNGraph(sce, use.dimred="PCA") clusters <- igraph::cluster_fast_greedy(g)$membership table(clusters)