BayesSpace supports three ways of loading a
SingleCellExperiment for analysis.
Visium datasets processed with Space Ranger can be loaded directly via the
readVisium() function. This function takes only the path to the Space Ranger output directory (containing the
filtered_feature_bc_matrix/ subdirectories) and returns a
Second, all datasets analyzed for the BayesSpace manuscript are readily accessible via the
getRDS() function. This function takes two arguments - the name of the dataset, and the name of the sample in the dataset.
SingleCellExperiment objects can be constructed manually from a counts matrix and tables of row and column data. BayesSpace only requires that spot array coordinates be provided as columns named
colData. (Note that enhancement of Visium datasets additionally requires the pixel coordinates of each spot in the tissue image, but in this case the dataset should be loaded with
readVisium(), which loads these data automatically.)
library(Matrix) rowData <- read.csv("path/to/rowData.csv", stringsAsFactors=FALSE) colData <- read.csv("path/to/colData.csv", stringsAsFactors=FALSE, row.names=1) counts <- read.csv("path/to/counts.csv.gz", row.names=1, check.names=F, stringsAsFactors=FALSE)) sce <- SingleCellExperiment(assays=list(counts=as(counts, "dgCMatrix")), rowData=rowData, colData=colData)
We’ll continue with the melanoma sample from the 2018 Spatial Transcriptomics paper for the remaining examples in this vignette.
BayesSpace requires minimal data pre-processing, but we provide a helper function to automate it.
spatialPreprocess() log-normalizes the count matrix and performs PCA on the top
n.HVGs highly variable genes, keeping the top
n.PCs principal components. Additionally, the spatial sequencing platform is added as metadata in the
SingleCellExperiment for downstream analyses. If you do not wish to rerun PCA, running
spatialPreprocess() with the flag
skip.PCA=TRUE will only add the metadata BayesSpace requires.
Here, we omit log-normalization as all datasets available through
getRDS() already include log-normalized counts.
We can use the
qPlot() functions to help choose
q, the number of clusters to use in our analysis.
qTune()runs the BayesSpace clustering algorithm for multiple specified values of
q(by default, 3 through 7) and computes their average pseudo-log-likelihood. It accepts any arguments to
qPlot()plots the pseudo-log-likelihood as a function of
q; we suggest choosing a
qaround the elbow of this plot.
spatialCluster() function clusters the spots, and adds the predicted cluster labels to the
SingleCellExperiment. Typically, as we did for the analyses in the paper, we suggest running with at least 10,000 iterations (
nrep=10000), but we use 1,000 iteration in this demonstration for the sake of runtime. (Note that a random seed must be set in order for the results to be reproducible.)
Both the mclust initialization (
cluster.init) and the BayesSpace cluster assignments (
spatial.cluster) are now available in the SingleCellExperiment’s
head(colData(melanoma)) #> DataFrame with 6 rows and 5 columns #> row col sizeFactor cluster.init spatial.cluster #> <integer> <integer> <numeric> <numeric> <numeric> #> 7x15 7 15 0.795588 1 1 #> 7x16 7 16 0.307304 1 1 #> 7x17 7 17 0.331247 2 2 #> 7x18 7 18 0.420747 3 2 #> 8x13 8 13 0.255453 1 1 #> 8x14 8 14 1.473439 1 1
We can plot the cluster assignments over the spatial locations of the spots with
clusterPlot() returns a
ggplot object, it can be customized by composing with familiar
ggplot2 functions. Additionally, the argument
palette sets the colors used for each cluster, and
clusterPlot() takes additional arguments to
geom_polygon() such as
color to control the aesthetics of the spot borders.
spatialEnhance() function will enhance the resolution of the principal components, and add these PCs as well as predicted cluster labels at subspot resolution to a new
SingleCellExperiment. As with our demonstration of
spatialCluster() above, we are using fewer iterations for the purpose of this example (
nrep=1000) than we recommend in practice (
nrep=100000 or greater). Note that the
jitter_scale parameter should be tuned so that proposals for updating subspot-level expression are accepted around 30% of the time. This can be evaluated using
mcmcChain(melanoma.enhanced, "Ychange"), where the chain should stabilize to 0.25-0.40. Typically 1000-2500 iterations are sufficient to evaluate if
jitter_scale should be increased if acceptance is too high or decreased if acceptance is too low. After tuning, proceed to a full run of
spatialEnhance with more iterations.
SingleCellExperiment includes an index to the parent spot in the original
spot.idx), along with an index to the subspot. It adds the offsets to the original spot coordinates, and provides the enhanced cluster label (
head(colData(melanoma.enhanced)) #> DataFrame with 6 rows and 9 columns #> spot.idx subspot.idx spot.row spot.col row col #> <numeric> <integer> <integer> <integer> <numeric> <numeric> #> subspot_1.1 1 1 7 15 7.33333 15.3333 #> subspot_2.1 2 1 7 16 7.33333 16.3333 #> subspot_3.1 3 1 7 17 7.33333 17.3333 #> subspot_4.1 4 1 7 18 7.33333 18.3333 #> subspot_5.1 5 1 8 13 8.33333 13.3333 #> subspot_6.1 6 1 8 14 8.33333 14.3333 #> imagerow imagecol spatial.cluster #> <numeric> <numeric> <numeric> #> subspot_1.1 7.33333 15.3333 1 #> subspot_2.1 7.33333 16.3333 2 #> subspot_3.1 7.33333 17.3333 1 #> subspot_4.1 7.33333 18.3333 2 #> subspot_5.1 8.33333 13.3333 1 #> subspot_6.1 8.33333 14.3333 1
We can plot the enhanced cluster assignments as above.
BayesSpace operates on the principal components of the gene expression matrix, and
spatialEnhance() therefore computes enhanced resolution PC vectors. Enhanced gene expression is not computed directly, and is instead imputed using a regression algorithm. For each gene, a model using the PC vectors of each spot is trained to predict the spot-level gene expression, and the fitted model is used to predict subspot expression from the subspot PCs.
Gene expression enhancement is implemented in the
enhanceFeatures() function. BayesSpace predicts expression with
xgboost by default, but linear and Dirichlet regression are also available via the
model argument. When using
xgboost, we suggest automatically tuning the
nrounds parameter by setting it to 0, although this comes at the cost of increased runtime (~4x slower than a pre-specified
nrounds in practice).
enhanceFeatures() can be used to impute subspot-level expression for all genes, or for a subset of genes of interest. Here, we’ll demonstrate by enhancing the expression of four marker genes: PMEL (melanoma), CD2 (T-cells), CD19 (B-cells), and COL1A1 (fibroblasts).
By default, log-normalized expression (
logcounts(sce)) is imputed, although other assays or arbitrary feature matrices can be specified.
logcounts(melanoma.enhanced)[markers, 1:5] #> subspot_1.1 subspot_2.1 subspot_3.1 subspot_4.1 subspot_5.1 #> PMEL 2.3572879 1.2667019 2.2696812 2.0454569 2.6437354 #> CD2 0.4325081 0.6166353 0.3151001 0.2315192 0.2875694 #> CD19 0.6170074 0.6239508 0.4075240 0.4075240 0.6170074 #> COL1A1 0.1031096 2.9217663 1.2033939 1.0364592 0.1780530
Diagnostic measures from each predictive model, such as
rmse when using
xgboost, are added to the
rowData of the enhanced dataset.
rowData(melanoma.enhanced)[markers, ] #> DataFrame with 4 rows and 4 columns #> gene_id gene_name is.HVG enhanceFeatures.rmse #> <character> <character> <logical> <numeric> #> PMEL ENSG00000185664 PMEL TRUE 0.804628 #> CD2 ENSG00000116824 CD2 TRUE 0.614575 #> CD19 ENSG00000177455 CD19 TRUE 0.697328 #> COL1A1 ENSG00000108821 COL1A1 TRUE 0.704845
Spatial gene expression is visualized with
Here, we compare the spatial expression of the imputed marker genes.
And we can compare to the spot-level expression.
save.chain is set to
TRUE in either
spatialEnhance(), the chain associated with the respective MCMC run is preserved to disk as an HDF5 file. The path to this file is stored in the SingleCellExperiment’s metadata at
metadata(sce)$h5.chain, and can be read directly using
The chain is provided as a
coda::mcmc object, which can be analyzed with TidyBayes or as a matrix. The object has one row per iteration, with the values of the parameters concatenated across the row. Columns are named with the parameter name and index (if any).
chain <- mcmcChain(melanoma) chain[1:5, 1:5] #> lambda[1,1] lambda[1,2] lambda[1,3] lambda[1,4] lambda[1,5] #> [1,] 0.01000000 0.000000000 0.00000000 0.00000000 0.000000000 #> [2,] 0.08757829 -0.005798093 0.04594761 0.05789615 -0.019023472 #> [3,] 0.10063323 -0.008922653 0.05061634 0.05269334 -0.004028245 #> [4,] 0.13708014 -0.016090349 0.06626887 0.02779234 -0.022475194 #> [5,] 0.12768109 -0.006919337 0.06831921 0.03047754 -0.027188932
To remove the HDF5 file from disk and remove its path from the metadata, use