1 Introduction

The bluster package provides a few diagnostics for quantitatively examining the cluster output. These diagnostics We will demonstrate on another dataset from the scRNAseq package, clustered with graph-based methods via the clusterRows() generic as described in the previous vignette.

library(scRNAseq)
sce <- GrunPancreasData()

# Quality control to remove bad cells.
library(scuttle)
qcstats <- perCellQCMetrics(sce)
qcfilter <- quickPerCellQC(qcstats, sub.fields="altexps_ERCC_percent")
sce <- sce[,!qcfilter$discard]

# Normalization by library size.
sce <- logNormCounts(sce)

# Feature selection.
library(scran)
dec <- modelGeneVar(sce)
hvgs <- getTopHVGs(dec, n=1000)

# Dimensionality reduction.
set.seed(1000)
library(scater)
sce <- runPCA(sce, ncomponents=20, subset_row=hvgs)

# Clustering.
library(bluster)
mat <- reducedDim(sce)
clust.info <- clusterRows(mat, NNGraphParam(), full=TRUE)
clusters <- clust.info$clusters
table(clusters)

## clusters
##   1   2   3   4   5   6   7   8   9  10  11  12 
## 285 171 161  59 174  49  70 137  69  65  28  23

2 Computing the silhouette width

The silhouette width is a standard metric to quantify the separation between clusters generated by any procedure. A cell with a large positive width is closer to other cells from the same cluster compared to cells from different clusters. On the other hand, low or negative widths indicate that cells from different clusters are not well separated.

The exact silhouette calculation is rather computationally intensive so bluster implements an approximation instead. This is provided in the approxSilhouette() function, which returns the width for each cell and its closest (non-self) cluster. Clusters consisting of cells with lower widths may warrant some more care during interpretation.

sil <- approxSilhouette(mat, clusters)
sil

## DataFrame with 1291 rows and 2 columns
##                width    other
##            <numeric> <factor>
## D2ex_1      0.375510        7
## D2ex_2      0.370596        7
## D2ex_3      0.409271        7
## D2ex_4      0.257069        3
## D2ex_5      0.249884        3
## ...              ...      ...
## D17TGFB_91  0.126608        2
## D17TGFB_92  0.403107        5
## D17TGFB_93  0.384364        7
## D17TGFB_94  0.243933        2
## D17TGFB_95  0.204860        2

boxplot(split(sil$width, clusters))

3 Computing the neighborhood purity

Another diagnostic uses the percentage of neighbors for each cell that belong to the same cluster. Well-separated clusters should exhibit high percentages (i.e., “purities”) as cells from different clusters do not mix. Low purities are symptomatic of overclustering where cluster boundaries become more ambiguous.

bluster computes purities through the neighborPurity() function. This returns the purity of the neighborhood for each cell and the identity of the cluster with the strongest presence. The choice of k determines the scope of the neighborhood; lower values will yield higher purities, so this may require some fiddling to find a value that distinguishes between clusters.

pure <- neighborPurity(mat, clusters)
pure

## DataFrame with 1291 rows and 2 columns
##               purity  maximum
##            <numeric> <factor>
## D2ex_1      1.000000        1
## D2ex_2      1.000000        1
## D2ex_3      1.000000        1
## D2ex_4      0.951603        5
## D2ex_5      1.000000        5
## ...              ...      ...
## D17TGFB_91  0.959277        6
## D17TGFB_92  1.000000        1
## D17TGFB_93  1.000000        1
## D17TGFB_94  0.986538        6
## D17TGFB_95  0.898817        6

table(Max=pure$maximum, Assigned=clusters)

##     Assigned
## Max    1   2   3   4   5   6   7   8   9  10  11  12
##   1  279   0   0   0   1   0   0   0   0   0   0   0
##   2    0 170   0   0   0  18   0   0   0   0   0   0
##   3    0   0 158   0   6   0   0   0   0   0   0   0
##   4    0   0   0  59   0   0   1   0   0   0   0   0
##   5    0   0   3   0 165   0   0   0   0   0   0   0
##   6    0   0   0   0   0  31   0   0   0   0   0   0
##   7    6   0   0   0   2   0  67   0   0   0   0   0
##   8    0   0   0   0   0   0   0 136   5   0   0   0
##   9    0   0   0   0   0   0   1   1  64   0   0   0
##   10   0   0   0   0   0   0   0   0   0  65   0   0
##   11   0   0   0   0   0   0   1   0   0   0  28   0
##   12   0   1   0   0   0   0   0   0   0   0   0  23

boxplot(split(pure$purity, clusters))

4 Computing graph modularity

For graph-based methods, we can compute the cluster modularity within clusters and between pairs of clusters. Specifically, we examine the ratio of observed to expected edge weights for each pair of clusters (closely related to the modularity score used in many cluster_* functions from igraph). We would usually expect to see high observed weights between cells in the same cluster with minimal weights between clusters, indicating that the clusters are well-separated. Large off-diagonal entries indicate that the corresponding pair of clusters are closely related.

g <- clust.info$objects$graph
ratio <- pairwiseModularity(g, clusters, as.ratio=TRUE)

library(pheatmap)
pheatmap(log10(ratio+1), cluster_cols=FALSE, cluster_rows=FALSE,
    col=rev(heat.colors(100)))

This may be better visualized with a force-directed layout:

cluster.gr <- igraph::graph_from_adjacency_matrix(log2(ratio+1), 
    mode="upper", weighted=TRUE, diag=FALSE)

# Increasing the weight to increase the visibility of the lines.
set.seed(1100101)
plot(cluster.gr, edge.width=igraph::E(cluster.gr)$weight*5,
    layout=igraph::layout_with_lgl)

We can also tune the resolution of the clustering post hoc with the mergeCommunities() function. This will iteratively merge the most closely related pair of clusters together until the desired number of clusters is reached. For example, if we wanted to whittle down the number of clusters to 10, we could do:

merged <- mergeCommunities(g, clusters, number=10)
table(merged)

## merged
##   1   2   3   5   6   7   9  10  11  12 
## 285 171 161 174  49 129 206  65  28  23

5 Comparing two clusterings

To compare two clusterings, the pairwiseRand() function computes the adjusted Rand index (ARI). High ARIs indicate that the two clusterings are similar with respect to how they partition the observations, and an ARI of 1 means that the clusterings are identical.

hclusters <- clusterRows(mat, HclustParam(cut.dynamic=TRUE))
pairwiseRand(clusters, hclusters, mode="index")

## [1] 0.4512108

Of course, a single number is not particularly useful, so clusterRand() also provides the capability to break down the ARI into its contributions from each cluster or cluster pair. Specifically, for each cluster or cluster pair in a “reference” clustering (here, clusters), we see whether it is preserved in the “alternative” clustering (here, hclusters). Large values on the diagonal indicate that the reference cluster is recapitulated; large values off the diagonal indicate that the separation between the corresponding pair of clusters is also maintained.

ratio <- pairwiseRand(clusters, hclusters, mode="ratio")

library(pheatmap)
pheatmap(ratio, cluster_cols=FALSE, cluster_rows=FALSE,
    col=viridis::viridis(100), breaks=seq(-1, 1, length=101))

6 Bootstrapping cluster stability

We can use bootstrapping to evaluate the effect of sampling noise on the stability of a clustering procedure. The bootstrapStability() function will return the ARI of the original clusters against those generated from bootstrap replicates, averaged across multiple bootstrap iterations. High values indicate that the clustering is robust to sample noise.

set.seed(1001010)
ari <-bootstrapStability(mat, clusters=clusters, 
    mode="index", BLUSPARAM=NNGraphParam())
ari

## [1] 0.7147913

Advanced users may also set mode="ratio" to obtain a more detailed breakdown of the effect of noise on each cluster (pair).

set.seed(1001010)
ratio <-bootstrapStability(mat, clusters=clusters,
    mode="ratio", BLUSPARAM=NNGraphParam())

library(pheatmap)
pheatmap(ratio, cluster_cols=FALSE, cluster_rows=FALSE,
    col=viridis::viridis(100), breaks=seq(-1, 1, length=101))

Session information

sessionInfo()

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] pheatmap_1.0.12             bluster_1.0.0              
##  [3] scater_1.18.0               ggplot2_3.3.2              
##  [5] scran_1.18.0                scuttle_1.0.0              
##  [7] scRNAseq_2.3.17             SingleCellExperiment_1.12.0
##  [9] SummarizedExperiment_1.20.0 Biobase_2.50.0             
## [11] GenomicRanges_1.42.0        GenomeInfoDb_1.26.0        
## [13] IRanges_2.24.0              S4Vectors_0.28.0           
## [15] BiocGenerics_0.36.0         MatrixGenerics_1.2.0       
## [17] matrixStats_0.57.0          BiocStyle_2.18.0           
## 
## loaded via a namespace (and not attached):
##   [1] ggbeeswarm_0.6.0              colorspace_1.4-1             
##   [3] ellipsis_0.3.1                dynamicTreeCut_1.63-1        
##   [5] XVector_0.30.0                BiocNeighbors_1.8.0          
##   [7] farver_2.0.3                  bit64_4.0.5                  
##   [9] RSpectra_0.16-0               interactiveDisplayBase_1.28.0
##  [11] AnnotationDbi_1.52.0          xml2_1.3.2                   
##  [13] sparseMatrixStats_1.2.0       knitr_1.30                   
##  [15] Rsamtools_2.6.0               dbplyr_1.4.4                 
##  [17] uwot_0.1.8                    shiny_1.5.0                  
##  [19] BiocManager_1.30.10           compiler_4.0.3               
##  [21] httr_1.4.2                    dqrng_0.2.1                  
##  [23] assertthat_0.2.1              Matrix_1.2-18                
##  [25] fastmap_1.0.1                 lazyeval_0.2.2               
##  [27] limma_3.46.0                  later_1.1.0.1                
##  [29] BiocSingular_1.6.0            htmltools_0.5.0              
##  [31] prettyunits_1.1.1             tools_4.0.3                  
##  [33] rsvd_1.0.3                    igraph_1.2.6                 
##  [35] gtable_0.3.0                  glue_1.4.2                   
##  [37] GenomeInfoDbData_1.2.4        dplyr_1.0.2                  
##  [39] rappdirs_0.3.1                Rcpp_1.0.5                   
##  [41] vctrs_0.3.4                   Biostrings_2.58.0            
##  [43] ExperimentHub_1.16.0          rtracklayer_1.50.0           
##  [45] DelayedMatrixStats_1.12.0     xfun_0.18                    
##  [47] stringr_1.4.0                 beachmat_2.6.0               
##  [49] mime_0.9                      lifecycle_0.2.0              
##  [51] irlba_2.3.3                   ensembldb_2.14.0             
##  [53] statmod_1.4.35                XML_3.99-0.5                 
##  [55] AnnotationHub_2.22.0          edgeR_3.32.0                 
##  [57] zlibbioc_1.36.0               scales_1.1.1                 
##  [59] hms_0.5.3                     promises_1.1.1               
##  [61] ProtGenerics_1.22.0           RColorBrewer_1.1-2           
##  [63] AnnotationFilter_1.14.0       yaml_2.2.1                   
##  [65] curl_4.3                      gridExtra_2.3                
##  [67] memoise_1.1.0                 biomaRt_2.46.0               
##  [69] stringi_1.5.3                 RSQLite_2.2.1                
##  [71] BiocVersion_3.12.0            GenomicFeatures_1.42.0       
##  [73] BiocParallel_1.24.0           rlang_0.4.8                  
##  [75] pkgconfig_2.0.3               bitops_1.0-6                 
##  [77] evaluate_0.14                 lattice_0.20-41              
##  [79] purrr_0.3.4                   labeling_0.4.2               
##  [81] GenomicAlignments_1.26.0      cowplot_1.1.0                
##  [83] bit_4.0.4                     tidyselect_1.1.0             
##  [85] magrittr_1.5                  bookdown_0.21                
##  [87] R6_2.4.1                      magick_2.5.0                 
##  [89] generics_0.0.2                DelayedArray_0.16.0          
##  [91] DBI_1.1.0                     pillar_1.4.6                 
##  [93] withr_2.3.0                   RCurl_1.98-1.2               
##  [95] tibble_3.0.4                  crayon_1.3.4                 
##  [97] BiocFileCache_1.14.0          rmarkdown_2.5                
##  [99] viridis_0.5.1                 progress_1.2.2               
## [101] locfit_1.5-9.4                grid_4.0.3                   
## [103] FNN_1.1.3                     blob_1.2.1                   
## [105] digest_0.6.27                 xtable_1.8-4                 
## [107] httpuv_1.5.4                  openssl_1.4.3                
## [109] munsell_0.5.0                 viridisLite_0.3.0            
## [111] beeswarm_0.2.3                vipor_0.4.5                  
## [113] askpass_1.1

Assorted clustering diagnostics

27 October 2020

Package

Contents