speckle 1.4.0
The speckle package contains functions to analyse differences in cell type proportions in single cell RNA-seq data. As our research into specialised analyses of single cell data continues we anticipate that the package will be updated with new functions.
The propeller method has now been published in Bioinformatics:
Belinda Phipson, Choon Boon Sim, Enzo R Porrello, Alex W Hewitt, Joseph Powell, Alicia Oshlack, propeller: testing for differences in cell type proportions in single cell data, Bioinformatics, 2022;, btac582, https://doi.org/10.1093/bioinformatics/btac582
The analysis of single cell RNA-seq data consists of a large number of steps, which can be iterative and also depend on the research question. There are many R packages that can do some or most of these steps. The analysis steps are described here briefly.
Once the sequencing data has been summarised into counts over genes, quality control is performed to remove poor quality cells. Poor quality cells are often characterised as having very low total counts (library size) and very few genes detected. Lowly expressed and uninformative genes are filtered out, followed by appropriate normalisation. Dimensionality reduction and clustering of the cells is then performed. Cells that have similar transcriptional profiles cluster together, and these clusters (hopefully) correspond to something biologically relevant, such as different cell types. Differential expression between each cluster compared to all other clusters can highlight genes that are more highly expressed in each cluster. These marker genes help to determine the cell type each cluster corresponds to. Cell type identification is a process that often uses marker genes as well as a list of curated genes that are known to be expressed in each cell type. It is always helpful to visualise the data in a lot of different ways to aid in interpretation of the clusters using tSNE/UMAP plots, clustering trees and heatmaps of known marker genes. An alternative to clustering is classification or label transfer approaches, where reference datasets can be used to annotate new datasets.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("speckle")
In order to determine whether there are statistically significant compositional
differences between groups, there must be some form of biological replication in
the experiment. This is so that we can estimate the variability of the cell type
proportion estimates for each group. A classical statistical test for
differences between two proportions is typically very sensitive to small changes
and will almost always yield a significant p-value. Hence propeller
is only
suitable to use in single cell experiments where there are multiple groups and
multiple biological replicates in at least one of the groups. The absolute
minimum sample size is 2 in one group and 1 in the other group/s. Variance
estimates are obtained from the group with more than 1 biological replicate
which assumes that the cell type proportion variances estimates are similar
between experimental conditions.
The propeller
test is performed after initial analysis of the single cell data
has been done, i.e. after clustering and cell type assignment. The propeller
function can take a SingleCellExperiment
or Seurat
object and extract the
necessary information from the metadata. The basic model for propeller
is that
the cell type proportions for each sample are estimated based on the clustering
information provided by the user or extracted from the relevant slots in the
data objects. The proportions are then transformed using either an arcsin square
root transformation, or logit transformation. For
each cell type \(i\), we fit a linear model with group as the explanatory variable
using functions from the R Bioconductor package limma.
Using limma to obtain p-values has the added benefit
of performing empirical Bayes shrinkage of the variances. For every cell type
we obtain a p-value that indicates whether that cell type proportion is
statistically significantly different between two (or more) groups.
library(speckle)
library(SingleCellExperiment)
library(CellBench)
library(limma)
library(ggplot2)
library(scater)
library(patchwork)
library(edgeR)
library(statmod)
We are using single cell data from the CellBench
package to illustrate how propeller
works. This is an artificial dataset that
is made up of an equal mixture of 3 different cell lines. There are three
datasets corresponding to three different technologies: 10x genomics, CelSeq and
DropSeq.
sc_data <- load_sc_data()
The way that propeller
is designed to be used is in the context of a designed
experiment where there are multiple biological replicates and multiple groups.
Comparing cell type proportions without biological replication should be done
with caution as there will be a large degree of variability in the cell type
proportions between samples due to technical factors (cell capture bias,
sampling, clustering errors), as well as biological variability. The
CellBench dataset does not have biological
replication, so we will create several artificial biological replicates by
bootstrapping the data. Bootstrapping has the advantage that it induces
variability between bootstrap samples by sampling with replacement. Here we will
treat the three technologies as the groups, and create artifical biological
replicates within each group. Note that bootstrapping only induces sampling
variability between our biological replicates, which will almost certainly be
much smaller than biological variability we would expect to see in a real
dataset.
The three single cell experiment objects in sc_data
all have differing numbers
of genes. The first step is to find all the common genes between all three
experiments in order to create one large dataset.
commongenes1 <- rownames(sc_data$sc_dropseq)[rownames(sc_data$sc_dropseq) %in%
rownames(sc_data$sc_celseq)]
commongenes2 <- commongenes1[commongenes1 %in% rownames(sc_data$sc_10x)]
sce_10x <- sc_data$sc_10x[commongenes2,]
sce_celseq <- sc_data$sc_celseq[commongenes2,]
sce_dropseq <- sc_data$sc_dropseq[commongenes2,]
dim(sce_10x)
## [1] 13575 902
dim(sce_celseq)
## [1] 13575 274
dim(sce_dropseq)
## [1] 13575 225
table(rownames(sce_10x) == rownames(sce_celseq))
##
## TRUE
## 13575
table(rownames(sce_10x) == rownames(sce_dropseq))
##
## TRUE
## 13575
This dataset does not have any biological replicates, so we will bootstrap
additional samples and pretend that they are biological replicates.
Bootstrapping won’t replicate true biological variation between samples, but we
will ignore that for the purpose of demonstrating how propeller
works. Note
that we don’t need to simulate gene expression measurements; propeller
only
uses cluster information, hence we simply bootstrap the column indices of the
single cell count matrices.
i.10x <- seq_len(ncol(sce_10x))
i.celseq <- seq_len(ncol(sce_celseq))
i.dropseq <- seq_len(ncol(sce_dropseq))
set.seed(10)
boot.10x <- sample(i.10x, replace=TRUE)
boot.celseq <- sample(i.celseq, replace=TRUE)
boot.dropseq <- sample(i.dropseq, replace=TRUE)
sce_10x_rep2 <- sce_10x[,boot.10x]
sce_celseq_rep2 <- sce_celseq[,boot.celseq]
sce_dropseq_rep2 <- sce_dropseq[,boot.dropseq]
The SingleCellExperiment
objects don’t combine very easily, so I will create a
new object manually, and retain only the information needed to run propeller
.
sample <- rep(c("S1","S2","S3","S4","S5","S6"),
c(ncol(sce_10x),ncol(sce_10x_rep2),ncol(sce_celseq),
ncol(sce_celseq_rep2),
ncol(sce_dropseq),ncol(sce_dropseq_rep2)))
cluster <- c(sce_10x$cell_line,sce_10x_rep2$cell_line,sce_celseq$cell_line,
sce_celseq_rep2$cell_line,sce_dropseq$cell_line,
sce_dropseq_rep2$cell_line)
group <- rep(c("10x","celseq","dropseq"),
c(2*ncol(sce_10x),2*ncol(sce_celseq),2*ncol(sce_dropseq)))
allcounts <- cbind(counts(sce_10x),counts(sce_10x_rep2),
counts(sce_celseq), counts(sce_celseq_rep2),
counts(sce_dropseq), counts(sce_dropseq_rep2))
sce_all <- SingleCellExperiment(assays = list(counts = allcounts))
sce_all$sample <- sample
sce_all$group <- group
sce_all$cluster <- cluster
Here I am going to use the Bioconductor package scater to visualise the data. The scater vignette goes quite deeply into quality control of the cells and the kinds of QC plots we like to look at. Here we will simply log-normalise the gene expression counts, perform dimensionality reduction (PCA) and generate PCA/TSNE/UMAP plots to visualise the relationships between the cells.
sce_all <- scater::logNormCounts(sce_all)
sce_all <- scater::runPCA(sce_all)
sce_all <- scater::runUMAP(sce_all)
Plot PC1 vs PC2 colouring by cell line and technology:
pca1 <- scater::plotReducedDim(sce_all, dimred = "PCA", colour_by = "cluster") +
ggtitle("Cell line")
pca2 <- scater::plotReducedDim(sce_all, dimred = "PCA", colour_by = "group") +
ggtitle("Technology")
pca1 + pca2