1 Introduction

smoothclust is a method for segmentation of spatial domains and spatially-aware clustering in spatial transcriptomics data. The method generates spatial domains with smooth boundaries by smoothing gene expression profiles across neighboring spatial locations, followed by unsupervised clustering. Spatial domains consisting of consistent mixtures of cell types may then be further investigated by applying cell type compositional analyses or differential analyses.

2 Installation

The smoothclust package can be installed from Bioconductor as follows (using R version 4.4 and Bioconductor version 3.19, available from approximately 2024-05-01 onwards). Additional details are shown on the Bioconductor package landing page.

install.packages("BiocManager")
BiocManager::install("smoothclust")

The latest development version of the package can be installed from the devel version of Bioconductor or from GitHub.

3 Input data format

Input data can be provided either as a SpatialExperiment object within the Bioconductor framework, or as numeric matrices of expression values and spatial coordinates. See help file (?smoothclust) for details.

In the example workflow below, we assume the input is in SpatialExperiment format.

4 Tutorial

The example workflow in this section demonstrates how to run smoothclust to generate spatial domains with smooth boundaries for a dataset from the 10x Genomics Visium platform.

library(smoothclust)
library(STexampleData)
library(scuttle)
library(scran)
library(scater)
library(ggspavis)

Load dataset from STexampleData package:

# load data
spe <- Visium_humanDLPFC()
## see ?STexampleData and browseVignettes('STexampleData') for documentation
## loading from cache
# keep spots over tissue
spe <- spe[, colData(spe)$in_tissue == 1]

dim(spe)
## [1] 33538  3639
assayNames(spe)
## [1] "counts"

Run smoothclust using default parameter settings, which have been selected to be appropriate for Visium data from human tissue.

The method for smoothing can be specified by providing the method argument, with available options uniform, kernel, and knn. Additional arguments can be used to set parameter values, including bandwidth, k, and truncate, depending on the choice of method. For more details, see the function documentation (?smoothclust) or the paper (in preparation).

# run smoothclust
spe <- smoothclust(spe)

The smoothed expression counts are stored in a new assay named counts_smooth:

# check output object
assayNames(spe)
## [1] "counts"        "counts_smooth"

Calculate log-transformed normalized counts (logcounts) on the smoothed expression counts. Here, the argument assay.type = "counts_smooth" specifies that we want to calculate logcounts using the smoothed counts from smoothclust.

# calculate logcounts
spe <- logNormCounts(spe, assay.type = "counts_smooth")

assayNames(spe)
## [1] "counts"        "counts_smooth" "logcounts"

Run clustering. We use a standard clustering workflow from single-cell data, consisting of k-means clustering on the top principal components (PCs) calculated on the set of top highly variable genes (HVGs) with logcounts as the input.

We use a relatively small number of clusters for demonstration purposes in this example:

# preprocessing steps for clustering

# remove mitochondrial genes
is_mito <- grepl("(^mt-)", rowData(spe)$gene_name, ignore.case = TRUE)
table(is_mito)
## is_mito
## FALSE  TRUE 
## 33525    13
spe <- spe[!is_mito, ]
dim(spe)
## [1] 33525  3639
# select top highly variable genes (HVGs)
dec <- modelGeneVar(spe)
top_hvgs <- getTopHVGs(dec, prop = 0.1)
length(top_hvgs)
## [1] 986
spe <- spe[top_hvgs, ]
dim(spe)
## [1]  986 3639
# dimensionality reduction

# compute PCA on top HVGs
set.seed(123)
spe <- runPCA(spe)
# run k-means clustering
set.seed(123)
k <- 5
clust <- kmeans(reducedDim(spe, "PCA"), centers = k)$cluster
table(clust)
## clust
##    1    2    3    4    5 
##  967 1574  492  272  334
colLabels(spe) <- factor(clust)

Plot clusters / spatial domains generated by the smoothclust workflow:

# color palettes
pal8 <- "libd_layer_colors"
pal36 <- unname(palette.colors(36, "Polychrome 36"))

# plot clusters / spatial domains
plotSpots(spe, annotate = "label", pal = pal8)

Plot manually annotated reference labels, which can be used to evaluate the performance of the clustering in this dataset:

# plot reference labels
plotSpots(spe, annotate = "ground_truth", pal = pal8)

5 Session information

sessionInfo()
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ggspavis_1.9.10             scater_1.31.2              
##  [3] ggplot2_3.5.0               scran_1.31.3               
##  [5] scuttle_1.13.1              STexampleData_1.11.3       
##  [7] SpatialExperiment_1.13.2    SingleCellExperiment_1.25.1
##  [9] SummarizedExperiment_1.33.3 Biobase_2.63.1             
## [11] GenomicRanges_1.55.4        GenomeInfoDb_1.39.14       
## [13] IRanges_2.37.1              S4Vectors_0.41.6           
## [15] MatrixGenerics_1.15.1       matrixStats_1.3.0          
## [17] ExperimentHub_2.11.3        AnnotationHub_3.11.5       
## [19] BiocFileCache_2.11.2        dbplyr_2.5.0               
## [21] BiocGenerics_0.49.1         smoothclust_0.99.8         
## [23] BiocStyle_2.31.0           
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3        jsonlite_1.8.8           
##   [3] wk_0.9.1                  magrittr_2.0.3           
##   [5] ggbeeswarm_0.7.2          magick_2.8.3             
##   [7] farver_2.1.1              rmarkdown_2.26           
##   [9] zlibbioc_1.49.3           vctrs_0.6.5              
##  [11] spdep_1.3-3               memoise_2.0.1            
##  [13] DelayedMatrixStats_1.25.3 tinytex_0.50             
##  [15] htmltools_0.5.8.1         S4Arrays_1.3.7           
##  [17] curl_5.2.1                BiocNeighbors_1.21.2     
##  [19] s2_1.1.6                  SparseArray_1.3.5        
##  [21] sass_0.4.9                spData_2.3.0             
##  [23] KernSmooth_2.23-22        bslib_0.7.0              
##  [25] cachem_1.0.8              igraph_2.0.3             
##  [27] mime_0.12                 ggside_0.3.1             
##  [29] lifecycle_1.0.4           pkgconfig_2.0.3          
##  [31] rsvd_1.0.5                Matrix_1.7-0             
##  [33] R6_2.5.1                  fastmap_1.1.1            
##  [35] GenomeInfoDbData_1.2.12   digest_0.6.35            
##  [37] colorspace_2.1-0          AnnotationDbi_1.65.2     
##  [39] dqrng_0.3.2               irlba_2.3.5.1            
##  [41] RSQLite_2.3.6             beachmat_2.19.4          
##  [43] labeling_0.4.3            filelock_1.0.3           
##  [45] fansi_1.0.6               httr_1.4.7               
##  [47] abind_1.4-5               compiler_4.4.0           
##  [49] proxy_0.4-27              bit64_4.0.5              
##  [51] withr_3.0.0               BiocParallel_1.37.1      
##  [53] viridis_0.6.5             DBI_1.2.2                
##  [55] highr_0.10                rappdirs_0.3.3           
##  [57] DelayedArray_0.29.9       rjson_0.2.21             
##  [59] classInt_0.4-10           bluster_1.13.0           
##  [61] tools_4.4.0               units_0.8-5              
##  [63] vipor_0.4.7               beeswarm_0.4.0           
##  [65] glue_1.7.0                dbscan_1.1-12            
##  [67] grid_4.4.0                sf_1.0-16                
##  [69] cluster_2.1.6             generics_0.1.3           
##  [71] gtable_0.3.4              class_7.3-22             
##  [73] BiocSingular_1.19.0       ScaledMatrix_1.11.1      
##  [75] metapod_1.11.1            sp_2.1-3                 
##  [77] utf8_1.2.4                XVector_0.43.1           
##  [79] ggrepel_0.9.5             BiocVersion_3.19.1       
##  [81] pillar_1.9.0              limma_3.59.8             
##  [83] dplyr_1.1.4               lattice_0.22-6           
##  [85] bit_4.0.5                 deldir_2.0-4             
##  [87] tidyselect_1.2.1          locfit_1.5-9.9           
##  [89] Biostrings_2.71.5         knitr_1.46               
##  [91] gridExtra_2.3             bookdown_0.39            
##  [93] edgeR_4.1.23              xfun_0.43                
##  [95] statmod_1.5.0             UCSC.utils_0.99.7        
##  [97] yaml_2.3.8                boot_1.3-30              
##  [99] evaluate_0.23             codetools_0.2-20         
## [101] tibble_3.2.1              BiocManager_1.30.22      
## [103] cli_3.6.2                 munsell_0.5.1            
## [105] jquerylib_0.1.4           Rcpp_1.0.12              
## [107] png_0.1-8                 parallel_4.4.0           
## [109] blob_1.2.4                sparseMatrixStats_1.15.1 
## [111] viridisLite_0.4.2         scales_1.3.0             
## [113] e1071_1.7-14              purrr_1.0.2              
## [115] crayon_1.5.2              rlang_1.1.3              
## [117] KEGGREST_1.43.0