Differential state analysis with `muscat`

Helena L Crowell1,2*, Charlotte Soneson1,3, Pierre-Luc Germain1,2 and Mark D Robinson1,2

1Institute for Molecular Life Sciences, University of Zurich, Zurich, Switzerland
2Swiss Institute of Bioinformatics (SIB), Zurich, Switzerland
3Present address: Friedrich Miescher Institute Basel, Switzerland
& Swiss Institute of Bioinformatics (SIB), Basel, Switzerland

*helena.crowell@uzh.ch

March 10, 2022

Abstract

Single-cell RNA sequencing (scRNA-seq) has quickly become an empowering technology to profile the transcriptomes of individual cells on a large scale. Many early analyses of differential expression have aimed at identifying differences between subpopulations, and thus are focused on finding subpopulation markers either in a single sample or across multiple samples. More generally, such methods can compare expression levels in multiple sets of cells, thus leading to cross-condition analyses.

However, given the emergence of replicated multi-condition scRNA-seq datasets, an area of increasing focus is making sample-level inferences, termed here as differential state (DS) analysis. For example, one could investigate the condition-specific responses of cell subpopulations measured from patients in each condition.

muscat: multi-sample multi-group scRNA-seq analysis tools (Crowell et al. 2019) provides various methods and visualization tools for DS analysis in multi-sample, multi-group, multi-(cell-)subpopulation scRNA-seq data, including cell-level mixed models and methods based on aggregated “pseudobulk” data, as well as a flexible simulation platform that mimics both single and multi-sample scRNA-seq data.

Package

muscat 1.8.2

For details on the concept and technicalities of DS analysis, and the methods presented here, consider having a look at our preprint:

Crowell HL, Soneson C*, Germain P-L*,
Calini D, Collin L, Raposo C, Malhotra D & Robinson MD:
On the discovery of population-specific state transitions from
multi-sample multi-condition single-cell RNA sequencing data.
bioRxiv 713412 (July, 2019). doi: 10.1101/713412

Load packages

library(dplyr)
library(ggplot2)
library(limma)
library(muscat)
library(purrr)

1 Introduction

1.1 What is DS analysis?

A fundamental task in the analysis of single-cell RNA-sequencing (scRNA-seq) data is the identification of systematic transcriptional changes (Stegle, Teichmann, and Marioni 2015). Such analyses are a critical step in the understanding of molecular responses, and have applications in development, in perturbation studies or in disease.
Most of the current scRNA-seq differential expression (DE) analysis methods are designed to test one set of cells against another (or more generally, multiple sets together), and can be used to compare cell clusters (e.g., for identifying marker genes) or across conditions (cells from one condition versus another) (Soneson and Robinson 2018). In such statistical models, the cells are the experimental units and thus represent the population that inferences will extrapolate to.

Using established terminology, we refer to cell identity as the combination of cell type, a stable molecular signature, and cell state, a transient snapshot of a cell’s molecular events (Wagner, Regev, and Yosef 2016; Trapnell 2015). This classification is inherently arbitrary, but still provides a basis for biological interpretation and a framework for discovering interesting expression patterns from scRNA-seq datasets. For example, T cells could be defined as a single (albeit diverse) cell type or could be divided into discrete subtypes, if relevant information to categorize each cell at this level were available. In either case, the framework presented here would be able to focus on the cell type of interest and look for changes (in expression) across samples.
Given the emergence of multi-sample multi-group scRNA-seq datasets, the goal becomes making sample-level inferences (i.e., experimental units are samples). Thus, differential state (DS) analysis is defined as following a given cell type across a set of samples (e.g., individuals) and experimental conditions (e.g., treatments), in order to identify cell-type-specific responses, i.e., changes in cell state. DS analysis: i) should be able to detect diluted changes that only affect a single cell type, a subset of cell types or even a subset of a single subpopulation; and, ii) is intended to be orthogonal to clustering or cell type assignment.

1.2 Starting point

The starting point for a DS analysis is a (sparse) matrix of gene expression, either as counts or some kind of normalized data, where rows = genes and columns = cells. Each cell additionally has a cluster (subpopulation) label as well as a sample label; metadata should accompany the list of samples, such that they can be organized into comparable groups with sample-level replicates (e.g., via a design matrix).

The approach presented here is modular and thus subpopulation labels could originate from an earlier step in the analysis, such as clustering (Duò, Robinson, and Soneson 2018; Freytag et al. 2018), perhaps after integration (Butler et al. 2018; Stuart et al. 2019) or after labeling of clusters (Diaz-Mejia et al. 2019) or after cell-level type assignment (Zhang et al. 2019).

2 Getting started

2.1 Data description

For this vignette, we will use a SingleCellExperiment (SCE) containing 10x droplet-based scRNA-seq PBCM data from 8 Lupus patients obtained befor and after 6h-treatment with IFN-\(\beta\) (Kang et al. 2018). The complete raw data, as well as gene and cell metadata is available through the NCBI GEO, accession number GSE96583.

2.2 Loading the data

The Kang et al. (2018) dataset has been made available through Biocondcutor’s ExperimentHub and can be loaded into R as follows: We first initialize a Hub instance to search for and load available data with the ExperimentHub function, and store the complete list of records in the variable eh. Using query, we then retrieve any records that match our keyword(s) of interest, as well as their corresponding accession ID (EH1234).

library(ExperimentHub)
eh <- ExperimentHub()
query(eh, "Kang")

## ExperimentHub with 3 records
## # snapshotDate(): 2021-10-19
## # $dataprovider: NCI_GDC, GEO
## # $species: Homo sapiens
## # $rdataclass: character, SingleCellExperiment, BSseq
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["EH1661"]]' 
## 
##            title                                               
##   EH1661 | Whole Genome Bisulfit Sequencing Data for 47 samples
##   EH1662 | Whole Genome Bisulfit Sequencing Data for 47 samples
##   EH2259 | Kang18_8vs8

Finally, we load the data of interest into R via [[ and the corresponding accession ID. The dataset contains >35,000 genes and ~29,000 cells:

(sce <- eh[["EH2259"]])

## class: SingleCellExperiment 
## dim: 35635 29065 
## metadata(0):
## assays(1): counts
## rownames(35635): MIR1302-10 FAM138A ... MT-ND6 MT-CYB
## rowData names(2): ENSEMBL SYMBOL
## colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
##   TTTGCATGTCTTAC-1
## colData names(5): ind stim cluster cell multiplets
## reducedDimNames(1): TSNE
## mainExpName: NULL
## altExpNames(0):

2.3 Preprocessing

The scater package (McCarthy et al. 2017) provides a variety of tools for preprocessing and quality control of single-cell transcriptomic data. For completeness, we will apply some minimal filtering steps to

remove undetected genes
remove cells with very few or many detected genes
remove very lowly expressed genes
compute normalized expression values for visualization

For more thorough preprocessing, we refer to the Quality control with scater vignette.

# remove undetected genes
sce <- sce[rowSums(counts(sce) > 0) > 0, ]
dim(sce)

## [1] 18890 29065

We use perCellQCMetrics to compute various per-cell quality control metrics, and proceed with filtering cells and genes as noted above:

# calculate per-cell quality control (QC) metrics
library(scater)
qc <- perCellQCMetrics(sce)

# remove cells with few or many detected genes
ol <- isOutlier(metric = qc$detected, nmads = 2, log = TRUE)
sce <- sce[, !ol]
dim(sce)

## [1] 18890 26820

# remove lowly expressed genes
sce <- sce[rowSums(counts(sce) > 1) >= 10, ]
dim(sce)

## [1]  7118 26820

Finally, we use logNormCounts to calculate log\(_2\)-transformed normalized expression values by dividing each count by its size factor, adding a pseudo-count of 1, and log-transforming111 Note that, in this workflow, expression values are used for visualization only, and that differential analyses are performed on pseudobulks (section 3.3) or the count data directly (section 3.4)..

# compute sum-factors & normalize
sce <- computeLibraryFactors(sce)
sce <- logNormCounts(sce)

Alternatively, expression values could be obtained via vst (variance stabilizing transformation) from the sctransform package (Hafemeister and Satija 2019), which returns Pearson residuals from a regularized negative binomial regression model that can be interpreted as normalized expression values:

library(sctransform)
assays(sce)$vstresiduals <- vst(counts(sce), verbosity = FALSE)$y

By default, scater’s functions will try to access the assay data specified via argument exprs_values (default logcounts) for e.g. visualization and dimension reduction. When an alternative assay such as the vstresiduals above should be used, it is thus necessary to explicitly specify this, for example, via runUMAP(sce, exprs_values = "vstresiduals") to compute UMAP cell embedings on the assay data compute above.

2.4 Data preparation

muscat expects a certain format of the input SCE. Specifically, the following cell metadata (colData) columns have to be provided:

"sample_id": unique sample identifiers (e.g., PeterPan_ref1, Nautilus_trt3, …)
"cluster_id": subpopulation (cluster) assignments (e.g., T cells, monocytes, …)
"group_id": experimental group/condition (e.g., control/treatment, healthy/diseased, …)

sce$id <- paste0(sce$stim, sce$ind)
(sce <- prepSCE(sce, 
    kid = "cell", # subpopulation assignments
    gid = "stim",  # group IDs (ctrl/stim)
    sid = "id",   # sample IDs (ctrl/stim.1234)
    drop = TRUE))  # drop all other colData columns

## class: SingleCellExperiment 
## dim: 7118 26820 
## metadata(1): experiment_info
## assays(2): counts logcounts
## rownames(7118): NOC2L HES4 ... S100B PRMT2
## rowData names(2): ENSEMBL SYMBOL
## colnames(26820): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
##   TTTGCATGTCTTAC-1
## colData names(3): cluster_id sample_id group_id
## reducedDimNames(1): TSNE
## mainExpName: NULL
## altExpNames(0):

For consistency and easy accession throughout this vignette, we will store cluster and sample IDs, as well as the number of clusters and samples into the following simple variables:

nk <- length(kids <- levels(sce$cluster_id))
ns <- length(sids <- levels(sce$sample_id))
names(kids) <- kids; names(sids) <- sids

2.5 Data overview

2.5.1 Cluster-sample sizes

As we will be aggregating measurements at the cluster-sample level, it is of particular importance to check the number of cells captured for each such instance. While aggregateData (see Section 3.1) allows excluding cluster-sample combinations with less than a threshold number of cells, clusters or samples with overall very low cell-counts may be excluded from further analysis at this point already.

For the Kang et al. (2018) dataset, for example, one might consider removing the Dendritic cells and Megakaryocytes clusters, as these containg less than 50 cells across all samples.

# nb. of cells per cluster-sample
t(table(sce$cluster_id, sce$sample_id))

##           
##            B cells CD14+ Monocytes CD4 T cells CD8 T cells Dendritic cells
##   ctrl101      113             186         336          95               5
##   ctrl1015     476             783         919         226              11
##   ctrl1016     144             419         526         671              10
##   ctrl1039      30             116         202          30               5
##   ctrl107       51             222         197          31               2
##   ctrl1244     134             429        1215          82              28
##   ctrl1256     240             383        1136         156              12
##   ctrl1488     234             317        1343          78              17
##   stim101      144             222         437         121              20
##   stim1015     357             683         814         153              17
##   stim1016     129             361         426         600              10
##   stim1039      39             154         318          40               7
##   stim107       56             185         217          22               7
##   stim1244      94             318         980          46              19
##   stim1256     211             369        1047         133              16
##   stim1488     283             370        1658          73              34
##           
##            FCGR3A+ Monocytes Megakaryocytes NK cells
##   ctrl101                 81             12       84
##   ctrl1015               232             25      208
##   ctrl1016               126             15      151
##   ctrl1039                28              5       20
##   ctrl107                 29              5       49
##   ctrl1244                53             19      131
##   ctrl1256                50             15      275
##   ctrl1488                99             21      120
##   stim101                126              7      120
##   stim1015               222             24      224
##   stim1016               124             12      239
##   stim1039                36             13       32
##   stim107                 36              4       51
##   stim1244                35             14      136
##   stim1256                73             21      257
##   stim1488               139             35      187

2.5.2 Dimension reduction

The dimension reductions (DR) available within the SCE can be accessed via reducedDims from the scater package. The data provided by Kang et al. (2018) already contains t-SNE coordinates; however, we can of course compute additional dimension reductions using one of scater’s runX functions:

# compute UMAP using 1st 20 PCs
sce <- runUMAP(sce, pca = 20)

Using scater’s plotReducedDim function, we can plot t-SNE and UMAP representations colored by cluster and group IDs, respectively. We additionaly create a small wrapper function, .plot_dr(), to improve the readability of color legends and simplify the plotting theme:

# wrapper to prettify reduced dimension plots
.plot_dr <- function(sce, dr, col)
  plotReducedDim(sce, dimred = dr, colour_by = col) +
    guides(fill = guide_legend(override.aes = list(alpha = 1, size = 3))) +
    theme_minimal() + theme(aspect.ratio = 1)

For our dataset, the t-SNE and UMAP colored by cluster_ids show that cell-populations are well-separated from one another. IFN-\(\beta\) stimulation manifests as a severe shift in the low-dimensional projection of cells when coloring by group_ids, indicating widespread, genome-scale transcriptiontal changes.

# downsample to max. 100 cells per cluster
cs_by_k <- split(colnames(sce), sce$cluster_id)
cs100 <- unlist(sapply(cs_by_k, function(u) 
  sample(u, min(length(u), 100))))

# plot t-SNE & UMAP colored by cluster & group ID
for (dr in c("TSNE", "UMAP"))
  for (col in c("cluster_id", "group_id"))
    .plot_dr(sce[, cs100], dr, col)

TSNE

Dimension reduction plots. Cells are colored by cluster ID (A) and group ID (B), respectively. For each cluster, at most 100 cells were sampled for plotting.

Figure 1: Dimension reduction plots
Cells are colored by cluster ID (A) and group ID (B), respectively. For each cluster, at most 100 cells were sampled for plotting.

UMAP

Figure 2: Dimension reduction plots
Cells are colored by cluster ID (A) and group ID (B), respectively. For each cluster, at most 100 cells were sampled for plotting.

3 Differential State (DS) analysis

To test for state changes across conditions, we will consider two types of approaches: i) mixed models that act directly on cell-level measurements; and ii) aggregation-based methods that act on pseudobulk data. For both approaches, each gene is tested for state changes in each cluster. Thus, a total of \(\#(genes) \times \#(clusters)\) tests will be performed per comparison of interest. The following schematic summarizes the data representation considered by cell- and sample-level approaches, respectively:

Schematic overview of cell- and sample-level approaches for DS analysis. Top panels show a schematic of the data distributions or aggregates across samples (each violin is a group or sample; each dot is a sample) and conditions (blue or orange). The bottom panels highlight the data organization in sub-matrix slices of the original count table.

Figure 3: Schematic overview of cell- and sample-level approaches for DS analysis
Top panels show a schematic of the data distributions or aggregates across samples (each violin is a group or sample; each dot is a sample) and conditions (blue or orange). The bottom panels highlight the data organization in sub-matrix slices of the original count table.

3.1 Aggregation of single-cell to pseudobulk data

In order to leverage existing robust bulk RNA-seq DE frameworks, such as edgeR (Robinson, McCarthy, and Smyth 2010), DESeq2 (Love, Huber, and Anders 2014), and limma (Ritchie et al. 2015), we first aggregate measurements for each sample (in each cluster) to obtain pseudobulk data.

In general, aggregateData() will aggregate the data by the colData variables specified with argument by, and return a SingleCellExperiment containing pseudobulk data.
For DS analysis, measurements must be aggregated at the cluster-sample level (default by = c("cluster_id", "sample_id"). In this case, the returned SingleCellExperiment will contain one assay per cluster, where rows = genes and columns = samples. Arguments assay and fun specify the input data and summary statistic, respectively, to use for aggregation.
While, in principle, various combinations of input data (raw/(log-)normalized counts, CPM ect.) and summary statistics (sum, mean, median) could be applied, we here default to the sum of raw counts:

pb <- aggregateData(sce,
    assay = "counts", fun = "sum",
    by = c("cluster_id", "sample_id"))
# one sheet per subpopulation
assayNames(pb)

## [1] "B cells"           "CD14+ Monocytes"   "CD4 T cells"      
## [4] "CD8 T cells"       "Dendritic cells"   "FCGR3A+ Monocytes"
## [7] "Megakaryocytes"    "NK cells"

# pseudobulks for 1st subpopulation
t(head(assay(pb)))

##          NOC2L HES4 ISG15 TNFRSF18 TNFRSF4 SDF4
## ctrl101     12    0    13        8       1   13
## ctrl1015    37    4   136       55      16   42
## ctrl1016    13    3    42       10       0   15
## ctrl1039     2    1    16        3       1    1
## ctrl107      7    0     6        3       4    5
## ctrl1244    24    7    25       30      11    7
## ctrl1256    22    1    65       30      11   17
## ctrl1488    19    1    34       27       6   23
## stim101     12    8  1362        3       1    9
## stim1015    34   27  4022       21       2   19
## stim1016     7    8  1271        2       4    6
## stim1039     3    3   393        1       1    2
## stim107      4    2   556        1       0    3
## stim1244    13   10   830        8       2    5
## stim1256    14   10  2235       10       2    6
## stim1488    18    5  2927       13       1   19

3.2 Pseudobulk-level MDS plot

Prior to conducting any formal testing, we can compute a multi-dimensional scaling (MDS) plot of aggregated signal to explore overall sample similarities.

pbMDS takes as input any SCE containg PB data as returned by aggregateData, and computes MDS dimensions using edgeR. Ideally, such a representation of the data should separate both clusters and groups from one another. Vice versa, samples from the same cluster or group should cluster together.

In our MDS plot on pseudo-bulk counts (Fig. 4), we can observe that the first dimension (MDS1) clearly separates cell populations (clusters), while the second (MDS2) separates control and stimulated samples (groups). Furthermore, the two T-cell clusters fall close to each other.

(pb_mds <- pbMDS(pb))

Pseudobulk-level multidimensional scaling (MDS) plot. Each point represents a cluster-sample instance; points are colored by cluster ID and shaped by group ID.

Figure 4: Pseudobulk-level multidimensional scaling (MDS) plot
Each point represents a cluster-sample instance; points are colored by cluster ID and shaped by group ID.

If you’re not satisfied with how the plot looks, here’s an example of how to modify the ggplot-object from above in various ways:

# use very distinctive shaping of groups & change cluster colors
pb_mds <- pb_mds + 
  scale_shape_manual(values = c(17, 4)) +
  scale_color_manual(values = RColorBrewer::brewer.pal(8, "Set2"))
# change point size & alpha
pb_mds$layers[[1]]$aes_params$size <- 5
pb_mds$layers[[1]]$aes_params$alpha <- 0.6
pb_mds

Pseudobulk-level MDS plot v2. Default plotting aesthetics were modified to change shaping of groups, coloring of clusters, as well as point size and transparency.

Figure 5: Pseudobulk-level MDS plot v2
Default plotting aesthetics were modified to change shaping of groups, coloring of clusters, as well as point size and transparency.

3.3 Sample-level analysis: Pseudobulk methods

Once we have assembled the pseudobulk data, we can test for DS using pbDS. By default, a \(\sim group\_id\) model is fit, and the last coefficient of the linear model is tested to be equal to zero.

# run DS analysis
res <- pbDS(pb, verbose = FALSE)
# access results table for 1st comparison
tbl <- res$table[[1]]
# one data.frame per cluster
names(tbl)

## [1] "B cells"           "CD14+ Monocytes"   "CD4 T cells"      
## [4] "CD8 T cells"       "Dendritic cells"   "FCGR3A+ Monocytes"
## [7] "Megakaryocytes"    "NK cells"

# view results for 1st cluster
k1 <- tbl[[1]]
head(format(k1[, -ncol(k1)], digits = 2))

##       gene cluster_id    logFC logCPM       F   p_val p_adj.loc p_adj.glb
## 1    NOC2L    B cells -0.27278    6.5 1.7e+00 2.2e-01   3.8e-01   3.5e-01
## 2    ISG15    B cells  5.47348   12.2 9.1e+02 2.4e-21   9.3e-19   3.2e-18
## 3 TNFRSF18    B cells -1.34598    6.4 2.6e+01 2.7e-05   1.9e-04   1.7e-04
## 4   CPSF3L    B cells -0.13505    6.1 3.6e-01 5.9e-01   7.4e-01   7.2e-01
## 5 AURKAIP1    B cells -0.01435    7.8 1.6e-02 9.5e-01   9.8e-01   9.8e-01
## 6   MRPL20    B cells  0.23892    7.4 2.7e+00 1.1e-01   2.3e-01   2.1e-01

Depening on the complexity of the experimental design (e.g., when there are more than two groups present), comparison(s) of interest may need to be specified explicitly. We can provide pbDS with a design matrix capturing the experimental design using model.matrix (package stats), and a contrast matrix that specifies our comparison of interesting using makeContrasts from the limma package. Alternatively, the comparison(s) of interest (or a list thereof) can be specified with via coefs (see ?glmQLFTest for details). For the Kang et al. (2018) dataset, we want to carry out a single comparison of stimulated against control samples, thus placing "ctrl" on the right-hand side as the reference condition:

# construct design & contrast matrix
ei <- metadata(sce)$experiment_info
mm <- model.matrix(~ 0 + ei$group_id)
dimnames(mm) <- list(ei$sample_id, levels(ei$group_id))
contrast <- makeContrasts("stim-ctrl", levels = mm)

# run DS analysis
pbDS(pb, design = mm, contrast = contrast)

3.4 Cell-level analysis: Mixed models

Alternative to the above sample-level approach, we fit (for each gene) a mixed model (MM) to the cell-level measurement data. muscat provides implementations of MM that use 3 main approaches:

fitting linear mixed models (LMMs) on log-normalized data with observational weights,
fitting LMMs on variance-stabilized data; and,
fitting generalized linear mixed models (GLMMs) directly on counts

In each case, a \(\sim 1 + \text{group_id} + (1\,|\,\text{sample_id})\) model is fit for each gene, optimizing the log-likelihood (i.e., REML = FALSE). P-values are calculated using the estimates of degrees of freedom specifying by argument df (default "Satterthwaite"). Fitting, testing and moderation are applied subpopulation-wise. For differential testing, mmDS will only consider:

subpopulations with at least n_cells cells (default 10) in at least n_samples samples (default 2)
genes with a count >= min_count (default 1) in at least min_cells (default 20)

Mixed model based approaches can be run directly on cell-level measurements, and do not require prior aggregation:

# 1st approach
mm <- mmDS(sce, method = "dream",
  n_cells = 10, n_samples = 2,
  min_counts = 1, min_cells = 20)

# 2nd & 3rd approach
mm <- mmDS(sce, method = "vst", vst = "sctransform")
mm <- mmDS(sce, method = "nbinom")

4 Handling results

4.1 Results filtering & overview

To get a general overview of the differential testing results, we first filter them to retain hits FDR < 5% and abs(logFC) > 1, and count the number and frequency of differential findings by cluster. Finally, we can view the top hits (lowest adj. p-value) in each cluster.

# filter FDR < 5%, abs(logFC) > 1 & sort by adj. p-value
tbl_fil <- lapply(tbl, function(u) {
  u <- dplyr::filter(u, p_adj.loc < 0.05, abs(logFC) > 1)
  dplyr::arrange(u, p_adj.loc)
})

# nb. of DS genes & % of total by cluster
n_de <- vapply(tbl_fil, nrow, numeric(1))
p_de <- format(n_de / nrow(sce) * 100, digits = 3)
data.frame("#DS" = n_de, "%DS" = p_de, check.names = FALSE)

##                    #DS    %DS
## B cells            238  3.344
## CD14+ Monocytes   1042 14.639
## CD4 T cells        322  4.524
## CD8 T cells         94  1.321
## Dendritic cells    117  1.644
## FCGR3A+ Monocytes  384  5.395
## Megakaryocytes      28  0.393
## NK cells           168  2.360

# view top 2 hits in each cluster
top2 <- bind_rows(lapply(tbl_fil, top_n, 2, p_adj.loc))
format(top2[, -ncol(top2)], digits = 2)

##        gene        cluster_id logFC logCPM    F   p_val p_adj.loc p_adj.glb
## 1   DNAJC15           B cells   1.1    7.1 12.4 0.00168    0.0074    0.0065
## 2      MCM5           B cells  -1.1    6.4  9.9 0.00413    0.0162    0.0141
## 3    CCL3L1   CD14+ Monocytes   1.1    5.8  8.7 0.00750    0.0138    0.0233
## 4  SERPINB2   CD14+ Monocytes  -1.1    7.2  8.4 0.00855    0.0155    0.0260
## 5      HOPX       CD4 T cells  -1.0    4.1  8.0 0.00940    0.0442    0.0282
## 6     CPNE2       CD4 T cells  -1.1    4.1  7.9 0.00957    0.0447    0.0287
## 7    TUBA4A       CD8 T cells  -1.1    7.3 19.7 0.00017    0.0015    0.0009
## 8     ITM2C       CD8 T cells  -1.0    7.3 14.8 0.00078    0.0057    0.0034
## 9      RALA   Dendritic cells  -1.0    9.1  8.1 0.01227    0.0352    0.0351
## 10    H2AFZ   Dendritic cells  -1.0   10.2  7.2 0.01675    0.0454    0.0457
## 11     C1QB FCGR3A+ Monocytes   1.6    6.5  8.8 0.00730    0.0165    0.0228
## 12    CBWD1 FCGR3A+ Monocytes   1.0    6.6  6.9 0.01571    0.0316    0.0434
## 13  TMEM123    Megakaryocytes   1.1    9.7 18.3 0.00065    0.0065    0.0029
## 14   NT5C3A    Megakaryocytes   1.3    9.7 13.8 0.00208    0.0194    0.0079
## 15    FOXP1          NK cells  -1.1    6.7 11.4 0.00231    0.0120    0.0086
## 16    ITM2C          NK cells  -1.1    7.4 10.4 0.00344    0.0165    0.0120

4.2 Calculating expression frequencies

Besides filter DS results based on magnitude (logFCs) and significance (FDR), it is often worthwhile to also consider the expression frequencies of each gene, i.e., the fraction of cells that express a given gene in each sample and/or group.
muscat provides wrapper, calcExprFreqs to compute cluster-sample/-group wise expression frequencies. Here, a gene is considered to be expressed when the specified measurement value (argument assay) falls above a certain threshold (argument th). Note that, assay = "counts" and th = 0 (default) amounts to the fraction of cells for which a respective gene has been detected.
calcExprFreqs will return a SingleCellExperiment object, where sheets (assays) = clusters, rows = genes, and columns = samples (and groups, if group_ids are present in the colData of the input SCE).

frq <- calcExprFreqs(sce, assay = "counts", th = 0)
# one sheet per cluster
assayNames(frq)

## [1] "B cells"           "CD14+ Monocytes"   "CD4 T cells"      
## [4] "CD8 T cells"       "Dendritic cells"   "FCGR3A+ Monocytes"
## [7] "Megakaryocytes"    "NK cells"

# expression frequencies in each
# sample & group; 1st cluster
t(head(assay(frq), 5))

##               NOC2L        HES4      ISG15    TNFRSF18     TNFRSF4
## ctrl101  0.09734513 0.000000000 0.08849558 0.061946903 0.008849558
## ctrl1015 0.07773109 0.008403361 0.18067227 0.090336134 0.021008403
## ctrl1016 0.08333333 0.013888889 0.20833333 0.048611111 0.000000000
## ctrl1039 0.06666667 0.033333333 0.33333333 0.100000000 0.033333333
## ctrl107  0.09803922 0.000000000 0.07843137 0.058823529 0.019607843
## ctrl1244 0.12686567 0.037313433 0.10447761 0.126865672 0.029850746
## ctrl1256 0.08750000 0.004166667 0.18750000 0.091666667 0.025000000
## ctrl1488 0.06837607 0.004273504 0.11965812 0.076923077 0.012820513
## stim101  0.08333333 0.055555556 0.98611111 0.020833333 0.006944444
## stim1015 0.08963585 0.072829132 0.99719888 0.044817927 0.005602241
## stim1016 0.03875969 0.054263566 1.00000000 0.007751938 0.023255814
## stim1039 0.05128205 0.051282051 1.00000000 0.025641026 0.025641026
## stim107  0.05357143 0.035714286 1.00000000 0.017857143 0.000000000
## stim1244 0.11702128 0.095744681 0.98936170 0.053191489 0.021276596
## stim1256 0.06635071 0.037914692 0.99526066 0.037914692 0.004739336
## stim1488 0.05653710 0.014134276 0.99646643 0.031802120 0.003533569
## ctrl     0.08509142 0.009845288 0.15963432 0.084388186 0.018284107
## stim     0.07235339 0.050266565 0.99543031 0.033511043 0.008377761

We can use the obtained frequencies to, for instance, only retain genes that are expressed in an average of 10% of cells in at least 1 group:

gids <- levels(sce$group_id)
frq10 <- vapply(as.list(assays(frq)), 
  function(u) apply(u[, gids] > 0.1, 1, any), 
  logical(nrow(sce)))
t(head(frq10))

##                   NOC2L  HES4 ISG15 TNFRSF18 TNFRSF4  SDF4
## B cells           FALSE FALSE  TRUE    FALSE   FALSE FALSE
## CD14+ Monocytes   FALSE  TRUE  TRUE    FALSE   FALSE  TRUE
## CD4 T cells       FALSE FALSE  TRUE    FALSE   FALSE FALSE
## CD8 T cells       FALSE FALSE  TRUE    FALSE   FALSE FALSE
## Dendritic cells   FALSE  TRUE  TRUE    FALSE    TRUE  TRUE
## FCGR3A+ Monocytes FALSE  TRUE  TRUE    FALSE   FALSE FALSE
## Megakaryocytes    FALSE FALSE  TRUE    FALSE   FALSE FALSE
## NK cells          FALSE FALSE  TRUE     TRUE   FALSE  TRUE

tbl_fil2 <- lapply(kids, function(k)
  dplyr::filter(tbl_fil[[k]], 
    gene %in% names(which(frq10[, k]))))

# nb. of DS genes & % of total by cluster
n_de <- vapply(tbl_fil2, nrow, numeric(1))
p_de <- format(n_de / nrow(sce) * 100, digits = 3)
data.frame("#DS" = n_de, "%DS" = p_de, check.names = FALSE)

##                   #DS   %DS
## B cells           226 3.175
## CD14+ Monocytes   622 8.738
## CD4 T cells       154 2.164
## CD8 T cells        94 1.321
## Dendritic cells   117 1.644
## FCGR3A+ Monocytes 382 5.367
## Megakaryocytes     28 0.393
## NK cells          160 2.248

4.3 Formatting results

Especially when testing multiple contrasts or coefficients, the results returned by runDS may become very complex and unhandy for exploration or exporting. Results can be formatted using resDS, which provides two alternative modes for formatting: bind = "row"/"col".

When bind = "row", results from all comparisons will be merged vertically (analogous to do.call("rbind", ...)) into a tidy format table, with column contrast/coef specifying the comparison.

Otherwise, bind = "col", results will be merged horizontally into a single wide table where all results for a given gene and cluster are kept in one row. An identifier of the respective contrast of coefficient is then appended to the column names. This format is useful when wanting to view a specific gene’s behavior across, for example, multiple treatments, but will become messy when many comparisons are included.

Expression frequencies computed with calcExprFreqs, as well as cluster-sample level avg. CPM, can be included in the results by setting frq/cpm = TRUE. Alternatively, if the former have been pre-computed, they can be supplied directly as an input to resDS (see example below).

# tidy format; attach pre-computed expression frequencies
resDS(sce, res, bind = "row", frq = frq)

# big-table (wide) format; attach CPMs
resDS(sce, res, bind = "col", cpm = TRUE)

Alternatively, if expression frequencies have not been pre-computed with calcExprFreqs, they may be added to the results table directly by specifying frq = TRUE:

# compute expression frequencies on the fly
resDS(sce, res, frq = TRUE)

5 Visualizing results

5.1 Between-cluster concordance

DS analysis aims at identifying population-specific changes in state (or expression) across conditions. In this setting, key questions of interest arise, e.g., which genes are DE in only a single (or very few) clusters? How many DE genes are shared between clusters? In summary, what is the general concordance in differential findings between clusters?

To gain an impression of the between-cluster (dis-)agreement on DE genes, we generate an UpSet-plot that visualizes the number of DE genes that are shared across or unique to certain clusters:

library(UpSetR)
de_gs_by_k <- map(tbl_fil, "gene")
upset(fromList(de_gs_by_k))

Figure 6: Upset plot
Included are DS findings (FDR < 0.05, |logFC| > 1) across all clusters; shown are the 50 most frequent interactions.

An UpSet plot as the one above tells us, for instance, that 185 genes are differential for all subpopulations; 387 across both Monocytes clusters; and 159 only in the B cells cluster.

5.2 DR colored by expression

The code chunk generates a set of t-SNEs colored by gene expression for the top-8 DS genes. To match the affected cells to their cluster and experimental group, see the t-SNEs colored by cluster and group ID from above.

# pull top-8 DS genes across all clusters
top8 <- bind_rows(tbl_fil) %>% 
  top_n(8, dplyr::desc(p_adj.loc)) %>% 
  pull("gene")

# for ea. gene in 'top8', plot t-SNE colored by its expression 
ps <- lapply(top8, function(g)
  .plot_dr(sce[, cs100], "TSNE", g) + 
    ggtitle(g) + theme(legend.position = "none"))

# arrange plots
plot_grid(plotlist = ps, ncol = 4, align = "vh")

t-SNE colored by gene expression. Show are t-SNE projections with cells colored by the expression of the top-8 DS genes. For each cluster, at most 100 cells were sampled for plotting.

Figure 7: t-SNE colored by gene expression
Show are t-SNE projections with cells colored by the expression of the top-8 DS genes. For each cluster, at most 100 cells were sampled for plotting.

5.3 Cell-level viz.: Violin plots

For changes of high interest, we can view the cell-level expression profiles of a specific gene across samples or groups using plotExpression (scater package). Here, we generate violin plots for the top-6 DS genes (lowest adj. p-value) in the B cells cluster222 Note that, as DS testing is done at the cluster-level, we need to subset the cells that have been assigned to the corresponding cluster for plotting..

plotExpression(sce[, sce$cluster_id == "B cells"],
  features = tbl_fil$`B cells`$gene[seq_len(6)],
  x = "sample_id", colour_by = "group_id", ncol = 3) +
  guides(fill = guide_legend(override.aes = list(size = 5, alpha = 1))) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Figure 8: Violin plots
Show are the top 6 hits (lowest adj. p-value) for the B cells cluster. Each violin is a sample; points are colored by group ID.

5.4 Sample-level viz.: Pseudobulk heatmaps

Especially when wanting to gain an overview of numerous DE testing results for many clusters, both dimension reduction and cell-level visualisations require a lot of space can become cumbersome to interpret. In this setting, it is thus recommended to visualise aggregated measures, e.g., mean expressions by cluster sample.

# top-5 DS genes per cluster
pbHeatmap(sce, res, top_n = 5)

## Warning: `filter_()` was deprecated in dplyr 0.7.0.
## Please use `filter()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Alternatively, pbHeatmap provides a set of options regarding which cluster(s), gene(s), and comparison to include (arguments k, g and c, respectively). For example, the following options render a heatmap visualizing the top 20 DS genes for the B cells cluster:

# top-20 DS genes for single cluster
pbHeatmap(sce, res, k = "B cells")

Similarly, we can visualize the cluster-sample means of a single gene of interest across all clusters in order to identify cell-types that are affected similarly by different experimental conditions:

# single gene across all clusters
pbHeatmap(sce, res, g = "ISG20")

Session info

sessionInfo()

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] UpSetR_1.4.0                scater_1.22.0              
##  [3] scuttle_1.4.0               muscData_1.8.0             
##  [5] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0
##  [7] Biobase_2.54.0              GenomicRanges_1.46.1       
##  [9] GenomeInfoDb_1.30.1         IRanges_2.28.0             
## [11] S4Vectors_0.32.3            MatrixGenerics_1.6.0       
## [13] matrixStats_0.61.0          ExperimentHub_2.2.1        
## [15] AnnotationHub_3.2.2         BiocFileCache_2.2.1        
## [17] dbplyr_2.1.1                BiocGenerics_0.40.0        
## [19] purrr_0.3.4                 muscat_1.8.2               
## [21] limma_3.50.1                ggplot2_3.3.5              
## [23] dplyr_1.0.8                 cowplot_1.1.1              
## [25] BiocStyle_2.22.0           
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.2.2                    tidyselect_1.1.2             
##   [3] lme4_1.1-28                   RSQLite_2.2.10               
##   [5] AnnotationDbi_1.56.2          grid_4.1.2                   
##   [7] BiocParallel_1.28.3           munsell_0.5.0                
##   [9] ScaledMatrix_1.2.0            codetools_0.2-18             
##  [11] future_1.24.0                 withr_2.5.0                  
##  [13] colorspace_2.0-3              filelock_1.0.2               
##  [15] highr_0.9                     knitr_1.37                   
##  [17] listenv_0.8.0                 labeling_0.4.2               
##  [19] Rdpack_2.1.4                  emmeans_1.7.2                
##  [21] GenomeInfoDbData_1.2.7        farver_2.1.0                 
##  [23] bit64_4.0.5                   glmmTMB_1.1.2.3              
##  [25] coda_0.19-4                   parallelly_1.30.0            
##  [27] vctrs_0.3.8                   generics_0.1.2               
##  [29] TH.data_1.1-0                 xfun_0.30                    
##  [31] R6_2.5.1                      doParallel_1.0.17            
##  [33] ggbeeswarm_0.6.0              clue_0.3-60                  
##  [35] rsvd_1.0.5                    locfit_1.5-9.5               
##  [37] bitops_1.0-7                  cachem_1.0.6                 
##  [39] DelayedArray_0.20.0           assertthat_0.2.1             
##  [41] promises_1.2.0.1              scales_1.1.1                 
##  [43] multcomp_1.4-18               beeswarm_0.4.0               
##  [45] gtable_0.3.0                  beachmat_2.10.0              
##  [47] globals_0.14.0                sandwich_3.0-1               
##  [49] rlang_1.0.2                   genefilter_1.76.0            
##  [51] GlobalOptions_0.1.2           splines_4.1.2                
##  [53] TMB_1.8.0                     broom_0.7.12                 
##  [55] BiocManager_1.30.16           yaml_2.3.5                   
##  [57] reshape2_1.4.4                backports_1.4.1              
##  [59] httpuv_1.6.5                  tools_4.1.2                  
##  [61] bookdown_0.24                 ellipsis_0.3.2               
##  [63] gplots_3.1.1                  jquerylib_0.1.4              
##  [65] RColorBrewer_1.1-2            Rcpp_1.0.8                   
##  [67] plyr_1.8.6                    sparseMatrixStats_1.6.0      
##  [69] progress_1.2.2                zlibbioc_1.40.0              
##  [71] RCurl_1.98-1.6                prettyunits_1.1.1            
##  [73] GetoptLong_1.0.5              viridis_0.6.2                
##  [75] zoo_1.8-9                     ggrepel_0.9.1                
##  [77] cluster_2.1.2                 variancePartition_1.24.0     
##  [79] magrittr_2.0.2                magick_2.7.3                 
##  [81] RSpectra_0.16-0               data.table_1.14.2            
##  [83] lmerTest_3.1-3                circlize_0.4.14              
##  [85] mvtnorm_1.1-3                 mime_0.12                    
##  [87] hms_1.1.1                     evaluate_0.15                
##  [89] xtable_1.8-4                  pbkrtest_0.5.1               
##  [91] RhpcBLASctl_0.21-247.1        XML_3.99-0.9                 
##  [93] gridExtra_2.3                 shape_1.4.6                  
##  [95] compiler_4.1.2                tibble_3.1.6                 
##  [97] KernSmooth_2.23-20            crayon_1.5.0                 
##  [99] minqa_1.2.4                   htmltools_0.5.2              
## [101] later_1.3.0                   tidyr_1.2.0                  
## [103] geneplotter_1.72.0            DBI_1.1.2                    
## [105] ComplexHeatmap_2.10.0         rappdirs_0.3.3               
## [107] MASS_7.3-55                   boot_1.3-28                  
## [109] Matrix_1.4-0                  cli_3.2.0                    
## [111] rbibutils_2.2.7               parallel_4.1.2               
## [113] pkgconfig_2.0.3               numDeriv_2016.8-1.1          
## [115] foreach_1.5.2                 annotate_1.72.0              
## [117] vipor_0.4.5                   bslib_0.3.1                  
## [119] blme_1.0-5                    XVector_0.34.0               
## [121] estimability_1.3              stringr_1.4.0                
## [123] digest_0.6.29                 RcppAnnoy_0.0.19             
## [125] sctransform_0.3.3             Biostrings_2.62.0            
## [127] rmarkdown_2.12                uwot_0.1.11                  
## [129] edgeR_3.36.0                  DelayedMatrixStats_1.16.0    
## [131] curl_4.3.2                    shiny_1.7.1                  
## [133] gtools_3.9.2                  rjson_0.2.21                 
## [135] nloptr_2.0.0                  lifecycle_1.0.1              
## [137] nlme_3.1-155                  jsonlite_1.8.0               
## [139] aod_1.3.1                     BiocNeighbors_1.12.0         
## [141] viridisLite_0.4.0             fansi_1.0.2                  
## [143] pillar_1.7.0                  lattice_0.20-45              
## [145] KEGGREST_1.34.0               fastmap_1.1.0                
## [147] httr_1.4.2                    survival_3.3-1               
## [149] interactiveDisplayBase_1.32.0 glue_1.6.2                   
## [151] png_0.1-7                     iterators_1.0.14             
## [153] BiocVersion_3.14.0            bit_4.0.4                    
## [155] stringi_1.7.6                 sass_0.4.0                   
## [157] blob_1.2.2                    DESeq2_1.34.0                
## [159] BiocSingular_1.10.0           caTools_1.18.2               
## [161] memoise_2.0.1                 irlba_2.3.5                  
## [163] future.apply_1.8.1

References

Butler, Andrew, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. 2018. “Integrating Single-Cell Transcriptomic Data Across Different Conditions, Technologies, and Species.” Nature Biotechnology 36 (5): 411–20.

Crowell, Helena L, Charlotte Soneson, Pierre-Luc Germain, Daniela Calini, Ludovic Collin, Catarina Raposo, Dheeraj Malhotra, and Mark D Robinson. 2019. “On the Discovery of Population-Specific State Transitions from Multi-Sample Multi-Condition Single-Cell RNA Sequencing Data.” bioRxiv 713412.

Diaz-Mejia, J Javier, J Javier Diaz-Mejia, Elaine C Meng, Alexander R Pico, Sonya A MacParland, Troy Ketela, Trevor J Pugh, Gary D Bader, and John H Morris. 2019. “Evaluation of Methods to Assign Cell Type Labels to Cell Clusters from Single-Cell RNA-sequencing Data.” F1000Research 8: 296.

Duò, Angelo, Mark D Robinson, and Charlotte Soneson. 2018. “A Systematic Performance Evaluation of Clustering Methods for Single-Cell RNA-seq Data.” F1000Research 7: 1141.

Freytag, Saskia, Luyi Tian, Ingrid Lönnstedt, Milica Ng, and Melanie Bahlo. 2018. “Comparison of Clustering Tools in R for Medium-Sized 10x Genomics Single-Cell RNA-sequencing Data.” F1000Research 7: 1297.

Hafemeister, Christoph, and Rahul Satija. 2019. “Normalization and Variance Stabilization of Single-Cell RNA-seq Data Using Regularized Negative Binomial Regression.” bioRxiv 576827.

Kang, Hyun Min, Meena Subramaniam, Sasha Targ, Michelle Nguyen, Lenka Maliskova, Elizabeth McCarthy, Eunice Wan, et al. 2018. “Multiplexed Droplet Single-Cell RNA-sequencing Using Natural Genetic Variation.” Nature Biotechnology 36 (1): 89–94.

Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-seq Data with DESeq2.” Genome Biology 15 (12): 550.

McCarthy, Davis J, Kieran R Campbell, Aaron T L Lun, and Quin F Wills. 2017. “Scater: Pre-Processing, Quality Control, Normalization and Visualization of Single-Cell RNA-seq Data in R.” Bioinformatics 33 (8): 1179–86.

Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47.

Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40.

Soneson, Charlotte, and Mark D Robinson. 2018. “Bias, Robustness and Scalability in Single-Cell Differential Expression Analysis.” Nature Methods 15 (4): 255–61.

Stegle, Oliver, Sarah A Teichmann, and John C Marioni. 2015. “Computational and Analytical Challenges in Single-Cell Transcriptomics.” Nature Reviews Genetics 16 (3): 133–45.

Stuart, Tim, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck 3rd, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. 2019. “Comprehensive Integration of Single-Cell Data.” Cell 177 (7): 1888–1902.e21.

Trapnell, Cole. 2015. “Defining Cell Types and States with Single-Cell Genomics.” Genome Research 25 (10): 1491–8.

Wagner, Allon, Aviv Regev, and Nir Yosef. 2016. “Revealing the Vectors of Cellular Identity with Single-Cell Genomics.” Nature Biotechnology 34 (11): 1145–60.

Zhang, Allen W, Ciara O, Elizabeth Chavez, Jamie L P Lim, Andrew McPherson, Matt Wiens, Pascale Walters, et al. 2019. “Probabilistic Cell Type Assignment of Single-Cell Transcriptomic Data Reveals Spatiotemporal Microenvironment Dynamics in Human Cancers.” bioRxiv 521914.

Differential state analysis with muscat