Contents

1 Introduction

The speckle package contains functions to analyse differences in cell type proportions in single cell RNA-seq data. As our research into specialised analyses of single cell data continues we anticipate that the package will be updated with new functions.

The propeller method has now been published in Bioinformatics:
Belinda Phipson, Choon Boon Sim, Enzo R Porrello, Alex W Hewitt, Joseph Powell, Alicia Oshlack, propeller: testing for differences in cell type proportions in single cell data, Bioinformatics, 2022;, btac582, https://doi.org/10.1093/bioinformatics/btac582

The analysis of single cell RNA-seq data consists of a large number of steps, which can be iterative and also depend on the research question. There are many R packages that can do some or most of these steps. The analysis steps are described here briefly.

Once the sequencing data has been summarised into counts over genes, quality control is performed to remove poor quality cells. Poor quality cells are often characterised as having very low total counts (library size) and very few genes detected. Lowly expressed and uninformative genes are filtered out, followed by appropriate normalisation. Dimensionality reduction and clustering of the cells is then performed. Cells that have similar transcriptional profiles cluster together, and these clusters (hopefully) correspond to something biologically relevant, such as different cell types. Differential expression between each cluster compared to all other clusters can highlight genes that are more highly expressed in each cluster. These marker genes help to determine the cell type each cluster corresponds to. Cell type identification is a process that often uses marker genes as well as a list of curated genes that are known to be expressed in each cell type. It is always helpful to visualise the data in a lot of different ways to aid in interpretation of the clusters using tSNE/UMAP plots, clustering trees and heatmaps of known marker genes. An alternative to clustering is classification or label transfer approaches, where reference datasets can be used to annotate new datasets.

2 Installation

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("speckle")

3 Finding significant differences in cell type proportions using propeller

In order to determine whether there are statistically significant compositional differences between groups, there must be some form of biological replication in the experiment. This is so that we can estimate the variability of the cell type proportion estimates for each group. A classical statistical test for differences between two proportions is typically very sensitive to small changes and will almost always yield a significant p-value. Hence propeller is only suitable to use in single cell experiments where there are multiple groups and multiple biological replicates in at least one of the groups. The absolute minimum sample size is 2 in one group and 1 in the other group/s. Variance estimates are obtained from the group with more than 1 biological replicate which assumes that the cell type proportion variances estimates are similar between experimental conditions.

The propeller test is performed after initial analysis of the single cell data has been done, i.e. after clustering and cell type assignment. The propeller function can take a SingleCellExperiment or Seurat object and extract the necessary information from the metadata. The basic model for propeller is that the cell type proportions for each sample are estimated based on the clustering information provided by the user or extracted from the relevant slots in the data objects. The proportions are then transformed using either an arcsin square root transformation, or logit transformation. For each cell type \(i\), we fit a linear model with group as the explanatory variable using functions from the R Bioconductor package limma. Using limma to obtain p-values has the added benefit of performing empirical Bayes shrinkage of the variances. For every cell type we obtain a p-value that indicates whether that cell type proportion is statistically significantly different between two (or more) groups.

4 Load the libraries

library(speckle)
library(SingleCellExperiment)
library(CellBench)
library(limma)
library(ggplot2)
library(scater)
library(patchwork)
library(edgeR)
library(statmod)

5 Loading data into R

We are using single cell data from the CellBench package to illustrate how propeller works. This is an artificial dataset that is made up of an equal mixture of 3 different cell lines. There are three datasets corresponding to three different technologies: 10x genomics, CelSeq and DropSeq.

sc_data <- load_sc_data()

The way that propeller is designed to be used is in the context of a designed experiment where there are multiple biological replicates and multiple groups. Comparing cell type proportions without biological replication should be done with caution as there will be a large degree of variability in the cell type proportions between samples due to technical factors (cell capture bias, sampling, clustering errors), as well as biological variability. The CellBench dataset does not have biological replication, so we will create several artificial biological replicates by bootstrapping the data. Bootstrapping has the advantage that it induces variability between bootstrap samples by sampling with replacement. Here we will treat the three technologies as the groups, and create artifical biological replicates within each group. Note that bootstrapping only induces sampling variability between our biological replicates, which will almost certainly be much smaller than biological variability we would expect to see in a real dataset.

The three single cell experiment objects in sc_data all have differing numbers of genes. The first step is to find all the common genes between all three experiments in order to create one large dataset.

commongenes1 <- rownames(sc_data$sc_dropseq)[rownames(sc_data$sc_dropseq) %in% 
                                                rownames(sc_data$sc_celseq)]
commongenes2 <-  commongenes1[commongenes1 %in% rownames(sc_data$sc_10x)]

sce_10x <- sc_data$sc_10x[commongenes2,]
sce_celseq <- sc_data$sc_celseq[commongenes2,] 
sce_dropseq <- sc_data$sc_dropseq[commongenes2,] 

dim(sce_10x)
## [1] 13575   902
dim(sce_celseq)
## [1] 13575   274
dim(sce_dropseq)
## [1] 13575   225
table(rownames(sce_10x) == rownames(sce_celseq))
## 
##  TRUE 
## 13575
table(rownames(sce_10x) == rownames(sce_dropseq))
## 
##  TRUE 
## 13575

6 Bootstrap additional samples

This dataset does not have any biological replicates, so we will bootstrap additional samples and pretend that they are biological replicates. Bootstrapping won’t replicate true biological variation between samples, but we will ignore that for the purpose of demonstrating how propeller works. Note that we don’t need to simulate gene expression measurements; propeller only uses cluster information, hence we simply bootstrap the column indices of the single cell count matrices.

i.10x <- seq_len(ncol(sce_10x))
i.celseq <- seq_len(ncol(sce_celseq))
i.dropseq <- seq_len(ncol(sce_dropseq))

set.seed(10)
boot.10x <- sample(i.10x, replace=TRUE)
boot.celseq <- sample(i.celseq, replace=TRUE)
boot.dropseq <- sample(i.dropseq, replace=TRUE)

sce_10x_rep2 <- sce_10x[,boot.10x]
sce_celseq_rep2 <- sce_celseq[,boot.celseq]
sce_dropseq_rep2 <- sce_dropseq[,boot.dropseq]

7 Combine all SingleCellExperiment objects

The SingleCellExperiment objects don’t combine very easily, so I will create a new object manually, and retain only the information needed to run propeller.

sample <- rep(c("S1","S2","S3","S4","S5","S6"), 
                c(ncol(sce_10x),ncol(sce_10x_rep2),ncol(sce_celseq),
                ncol(sce_celseq_rep2), 
                ncol(sce_dropseq),ncol(sce_dropseq_rep2)))
cluster <- c(sce_10x$cell_line,sce_10x_rep2$cell_line,sce_celseq$cell_line,
                sce_celseq_rep2$cell_line,sce_dropseq$cell_line,
                sce_dropseq_rep2$cell_line)
group <- rep(c("10x","celseq","dropseq"),
                c(2*ncol(sce_10x),2*ncol(sce_celseq),2*ncol(sce_dropseq)))

allcounts <- cbind(counts(sce_10x),counts(sce_10x_rep2), 
                    counts(sce_celseq), counts(sce_celseq_rep2),
                    counts(sce_dropseq), counts(sce_dropseq_rep2))

sce_all <- SingleCellExperiment(assays = list(counts = allcounts))
sce_all$sample <- sample
sce_all$group <- group
sce_all$cluster <- cluster

8 Visualise the data

Here I am going to use the Bioconductor package scater to visualise the data. The scater vignette goes quite deeply into quality control of the cells and the kinds of QC plots we like to look at. Here we will simply log-normalise the gene expression counts, perform dimensionality reduction (PCA) and generate PCA/TSNE/UMAP plots to visualise the relationships between the cells.

sce_all <- scater::logNormCounts(sce_all)
sce_all <- scater::runPCA(sce_all)
sce_all <- scater::runUMAP(sce_all)

Plot PC1 vs PC2 colouring by cell line and technology:

pca1 <- scater::plotReducedDim(sce_all, dimred = "PCA", colour_by = "cluster") +
    ggtitle("Cell line")
pca2 <- scater::plotReducedDim(sce_all, dimred = "PCA", colour_by = "group") +
    ggtitle("Technology")
pca1 + pca2