StabMap 1.0.0
library(StabMap)
library(magrittr)
library(scater)
library(scran)
library(SingleCellMultiModal)
library(gridExtra)
set.seed(2024)
StabMap is a technique for performing mosaic single cell data integration. Mosaic data integration presents the challenge of integration of data where only some features or cells are shared across datasets. For example, these challenges arise when integrating single-cell datasets that measure different molecular profiles, such as chromatin accessibility or RNA expression assays. Integrative analysis of such data may provide a more in-depth profile of each cell, facilitating downstream analysis. To read more about StabMap please see our paper on Nature Biotechnology.
In this vignette we will elaborate on how mosaic single cell data integration
is implemented in the StabMap
package. We address a few key goals:
Mosaic Data integration for 2 datasets
Demonstrating cell imputation following integration
Indirect mosaic data integration for 3 datasets, including 2 non-overlapping datasets
In this tutorial we will work with a multi-assay single cell dataset, consisting of ATAC and gene expression data for 10,032 cells.
mae <- scMultiome(
"pbmc_10x",
mode = "*", dry.run = FALSE, format = "MTX", verbose = TRUE
)
Perform some exploration of this data.
mae
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] atac: SingleCellExperiment with 108344 rows and 10032 columns
## [2] rna: SingleCellExperiment with 36549 rows and 10032 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
upsetSamples(mae)
head(colData(mae))
## DataFrame with 6 rows and 6 columns
## nCount_RNA nFeature_RNA nCount_ATAC nFeature_ATAC
## <integer> <integer> <integer> <integer>
## AAACAGCCAAGGAATC 8380 3308 55582 13878
## AAACAGCCAATCCCTT 3771 1896 20495 7253
## AAACAGCCAATGCGCT 6876 2904 16674 6528
## AAACAGCCAGTAGGTG 7614 3061 39454 11633
## AAACAGCCAGTTTACG 3633 1691 20523 7245
## AAACAGCCATCCAGGT 7782 3028 22412 8602
## celltype broad_celltype
## <character> <character>
## AAACAGCCAAGGAATC naive CD4 T cells Lymphoid
## AAACAGCCAATCCCTT memory CD4 T cells Lymphoid
## AAACAGCCAATGCGCT naive CD4 T cells Lymphoid
## AAACAGCCAGTAGGTG naive CD4 T cells Lymphoid
## AAACAGCCAGTTTACG memory CD4 T cells Lymphoid
## AAACAGCCATCCAGGT non-classical monocy.. Myeloid
dim(experiments(mae)[["rna"]])
## [1] 36549 10032
names(experiments(mae))
## [1] "atac" "rna"
Keep the first 2,000 cells only. Normalise and select variable features for the RNA modality.
sce.rna <- experiments(mae)[["rna"]]
# Normalisation
sce.rna <- logNormCounts(sce.rna)
# Feature selection
decomp <- modelGeneVar(sce.rna)
hvgs <- rownames(decomp)[decomp$mean > 0.01 & decomp$p.value <= 0.05]
length(hvgs)
## [1] 952
sce.rna <- sce.rna[hvgs, ]
Keep the first 2,000 cells only. Normalise and select variable features for the ATAC modality.
dim(experiments(mae)[["atac"]])
## [1] 108344 10032
sce.atac <- experiments(mae)[["atac"]]
# Normalise
sce.atac <- logNormCounts(sce.atac)
# Feature selection using highly variable peaks
# And adding matching peaks to genes
decomp <- modelGeneVar(sce.atac)
hvgs <- rownames(decomp)[decomp$mean > 0.25 &
decomp$p.value <= 0.05]
length(hvgs)
## [1] 788
sce.atac <- sce.atac[hvgs, ]
Create a composite full data matrix by concatenating.
logcounts_all <- rbind(logcounts(sce.rna), logcounts(sce.atac))
dim(logcounts_all)
## [1] 1740 10032
assayType <- ifelse(rownames(logcounts_all) %in% rownames(sce.rna),
"rna", "atac"
)
table(assayType)
## assayType
## atac rna
## 788 952
We will simulate a situation where half of the cells correspond to the Multiome (RNA + ATAC features) modality, and half of the cells correspond to the RNA modality. Our goal is to then integrate both datasets by generating a joint embedding of the cells using all data, and to impute the missing ATAC cell values from the RNA modality cells.
dataType <- setNames(
sample(c("RNA", "Multiome"), ncol(logcounts_all),
prob = c(0.5, 0.5), replace = TRUE
),
colnames(logcounts_all)
)
table(dataType)
## dataType
## Multiome RNA
## 5025 5007
assay_list <- list(
RNA = logcounts_all[assayType %in% c("rna"), dataType %in% c("RNA")],
Multiome = logcounts_all[
assayType %in% c("rna", "atac"), dataType %in% c("Multiome")
]
)
lapply(assay_list, dim)
## $RNA
## [1] 952 5007
##
## $Multiome
## [1] 1740 5025
lapply(assay_list, class)
## $RNA
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
##
## $Multiome
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
Examine the shared features between the two datasets using mosaicDataUpSet()
.
mosaicDataUpSet(assay_list, plot = FALSE)
From this we note that there are shared features between the RNA and Multiome datasets, but there are many features that are observed only in the Multiome dataset and not the RNA - as we had constructed.
We can understand the mosaicDataTopology()
of these datasets, which
generates an igraph
object, which can be inspected and plotted.
The mosaicDataTopology()
is a weighted network where nodes represent each
dataset, and edges connect nodes with at least one overlapping feature.
mdt <- mosaicDataTopology(assay_list)
mdt
## IGRAPH 957cd84 UN-- 2 1 --
## + attr: name (v/c), frame.color (v/c), color (v/c), label.color (v/c),
## | label.family (v/c)
## + edge from 957cd84 (vertex names):
## [1] RNA--Multiome
plot(mdt)
From this we note that the datasets RNA and Multiome share at least some features. StabMap requires that the mosaic data topology network be connected, that is, that there should be a path between every pair of nodes in the network.
We now aim to integrate the data from the RNA and Multiome modality by generating
a common joint embedding for these data using stabMap()
. The stabMap()
integration
approach aims to stabilize integration of single-cell data by exploting the
non-overlapping features, so that cells with similar biological profiles will
cluster. Stabilisation using non-overlapping features may be important
when there are limited overlapping features or when the informative features are
unknown.
What is
stabMap
doing?stabMap
generates a joint embedding using 3 steps:
Identify the mosaicDataTopology()
Embed the reference dataset into a lower dimensional space
Project cells from non-reference datasets onto the reference dataset embedding
by using a model to traverse shortest paths in the mosaicDataTopology()
Since the Multiome data contains all features, we treat this as the reference dataset.
Since we already examined the mosaic data topology, we set plot = FALSE
.
stab <- stabMap(assay_list,
reference_list = c("Multiome"),
plot = FALSE
)
## treating "Multiome" as reference
## generating embedding for path with reference "Multiome": "Multiome"
## generating embedding for path with reference "Multiome": "RNA" -> "Multiome"
dim(stab)
## [1] 10032 50
stab[1:5, 1:5]
## Multiome_PC1 Multiome_PC2 Multiome_PC3 Multiome_PC4
## AAACAGCCAATCCCTT 12.885344 -3.075968 -1.723863 -0.3561525
## AAACAGCCAGTTTACG 11.314093 -2.344855 2.608507 1.2228681
## AAACATGCAAGGTCCT 13.821325 -3.100703 4.755135 -0.6836924
## AAACATGCACCGGCTA 6.287519 -2.080285 -24.802926 -0.6373922
## AAACATGCAGCAAGTG 12.500354 -3.058831 5.358400 -2.6757611
## Multiome_PC5
## AAACAGCCAATCCCTT -4.6468061
## AAACAGCCAGTTTACG -8.5576292
## AAACATGCAAGGTCCT 6.0538837
## AAACATGCACCGGCTA 7.1583625
## AAACATGCAGCAAGTG -0.1806992
We can reduce the dimension further using non-linear approaches such as UMAP.
stab_umap <- calculateUMAP(t(stab))
dim(stab_umap)
## [1] 10032 2
plot(stab_umap, pch = 16, cex = 0.3, col = factor(dataType[rownames(stab)]))