1 Loading Processed Single-Cell Data

For the demonstration of escape, we will use the example “pbmc_small” data from Seurat and also generate a SingleCellExperiment object from it.

suppressPackageStartupMessages(library(escape))
suppressPackageStartupMessages(library(SingleCellExperiment))
suppressPackageStartupMessages(library(scran))
suppressPackageStartupMessages(library(Seurat))
suppressPackageStartupMessages(library(SeuratObject))
suppressPackageStartupMessages(library(RColorBrewer))
suppressPackageStartupMessages(library(ggplot2))


pbmc_small <- get("pbmc_small")
sce.pbmc <- as.SingleCellExperiment(pbmc_small, assay = "RNA")

2 Getting Gene Sets

2.1 Option 1: Molecular Signature Database

The first step in the process of performing gene set enrichment analysis is identifying the gene sets we would like to use. The function getGeneSets() allows users to isolate a whole or multiple libraries from a list of GSEABase GeneSetCollection objects.

We can do this for gene set collections from the built-in Molecular Signature Database by setting the parameter library equal to library/libraries of interest. For multiple libraries, just set library = c(“Library1”, “Library2”, etc).

Additional parameters include:

  • subcategory Can be used to select subcategories in the larger database.
  • gene.sets Can select individual pathways/gene sets to be isolated.

If the sequencing of the single-cell data is performed on a species other than “Homo sapiens”, make sure to use the species parameter in getGeneSets() in order to get the correct gene nomenclature.

GS.hallmark <- getGeneSets(library = "H")

2.2 Option 2: Built-In gene sets

data("escape.gene.sets", package="escape")
gene.sets <- escape.gene.sets

2.3 Option 3: Define personal gene sets

gene.sets <- list(Bcells = c("MS4A1","CD79B","CD79A","IGH1","IGH2")
                        Myeloid = c("SPI1","FCER1G","CSF1R"),
                        Tcells = c("CD3E", "CD3D", "CD3G", "CD7","CD8A"))

3 Performing Enrichment Calculation

Several popular methods exist for Gene Set Enrichment Analysis (GSEA). These methods can vary in the underlying assumptions. escape incorporates several methods that are particularly advantageous for single-cell RNA values:

3.0.1 ssGSEA

This method calculates the enrichment score using a rank-normalized approach and generating an empirical cumulative distribution function for each individual cell. The enrichment score is defined for a gene set (G) using the number of genes in the gene set (NG) and total number of genes (N).

\[ ES(G,S) \sum_{i = 1}^{n} [P_W^G(G,S,i)-P_{NG}(G,S,i)] \] Please see the following citation for more information.

3.0.2 GSVA

GSVA varies slightly by estimating a Poisson-based kernel cumulative density function. But like ssGSEA, the ultimate enrichment score reported is based on the maximum value of the random walk statistic. GSVA appears to have better overall consistency and runs faster than ssGSEA.

\[ ES_{jk}^{max} = V_{jk} [max_{l=1,...,p}|v_{jk}(l)] \] Please see the following citation for more information.

3.0.3 AUCell

In contrast to ssGSEA and GSVA, AUCell takes the gene rankings for each cell and step-wise plots the position of each gene in the gene set along the y-axis. The output score is the area under the curve for this approach.

Please see the following citation for more information.

3.0.4 UCell

UCell calculates a Mann-Whitney U statistic based on the gene rank list. Importantly, UCell has a cut-off for ranked genes (\[r_{max}\]) at 1500 - this is per design as drop-out in single-cell can alter enrichment results. This also substantially speeds the calculations up.

The enrichment score output is then calculated using the complement of the U statistic scaled by the gene set size and cut-off.

\[ U_j^` = 1-\frac{U_j}{n \bullet r_{max}} \]

Please see the following citation for more information.

3.1 escape.matrix

escape has 2 major functions - the first being escape.matrix(), which serves as the backbone of enrichment calculations. Using count-level data supplied from a single-cell object or matrix, escape.matrix() will produce an enrichment score for the individual cells with the gene sets selected and output the values as a matrix.

method

  • AUCell
  • GSVA
  • ssGSEA
  • UCell

groups

  • The number of cells to calculate at once.

min.size

  • The minimum size of detectable genes in a gene set. Gene sets less than the min.size will be removed before the calculation.

normalize

  • Use the number of genes from the gene sets in each cell to normalize the enrichment scores. The default value is FALSE.

make.positive

  • During normalization, whether to shift the enrichment values to a positive range (TRUE) or not (FALSE). The default value is FALSE.

Cautionary note: make.positive was added to allow for differential analysis downstream of enrichment as some methods may produce negative values. It preserves log-fold change, but ultimately modifies the enrichment values and should be used with caution.

enrichment.scores <- escape.matrix(pbmc_small, 
                                   gene.sets = GS.hallmark, 
                                   groups = 1000, 
                                   min.size = 5)
## [1] "Using sets of 1000 cells. Running 1 times."
## [1] "Calculating ranks..."
## [1] "Calculating absolute values from ranks..."
ggplot(data = as.data.frame(enrichment.scores), 
      mapping = aes(enrichment.scores[,1], enrichment.scores[,2])) + 
  geom_point() + 
  theme_classic() + 
  theme(axis.title = element_blank())

Multi-core support is for all methods is available through BiocParallel. To add more cores, use the argument BPPARAM to escape.matrix(). Here we will use the SnowParam() for it’s support across platforms and explicitly call 2 workers (or cores).

enrichment.scores <- escape.matrix(pbmc_small, 
                                   gene.sets = GS.hallmark, 
                                   groups = 1000, 
                                   min.size = 5, 
                                   BPPARAM = SnowParam(workers = 2))

3.2 runEscape

Alternatively, we can use runEscape() to calculate the enrichment score and directly attach the output to a single-cell object. The additional parameter for ``runEscape is new.assay.name, in order to save the enrichment scores as a custom assay in the single-cell object.

pbmc_small <- runEscape(pbmc_small, 
                        method = "ssGSEA",
                        gene.sets = GS.hallmark, 
                        groups = 1000, 
                        min.size = 5,
                        new.assay.name = "escape.ssGSEA")
## [1] "Using sets of 1000 cells. Running 1 times."
## [1] "Calculating ranks..."
## [1] "Calculating absolute values from ranks..."
sce.pbmc <- runEscape(sce.pbmc, 
                      method = "UCell",
                      gene.sets = GS.hallmark, 
                      groups = 1000, 
                      min.size = 5,
                      new.assay.name = "escape.UCell")
## [1] "Using sets of 1000 cells. Running 1 times."

We can quickly examine the attached enrichment scores using the visualization/workflow we prefer - here we will use just FeaturePlot() from the Seurat R package.

#Define color palette 
colorblind_vector <- hcl.colors(n=7, palette = "inferno", fixup = TRUE)

FeaturePlot(pbmc_small, "HALLMARK-APOPTOSIS") + 
  scale_color_gradientn(colors = colorblind_vector) + 
  theme(plot.title = element_blank())

3.3 performNormalization

Although we glossed over the normalization that can be used in escape.matrix() and runEscape(), it is worth mentioning here as normalization can affect all downstream analyses.

There can be inherent bias in enrichment values due to drop out in single-cell expression data. Cells with larger numbers of features and counts will likely have higher enrichment values. performNormalization() will normalize the enrichment values by calculating the number of genes expressed in each gene set and cell. This is similar to the normalization in classic GSEA and it will be stored in a new assay.

pbmc_small <- performNormalization(pbmc_small, 
                                   assay = "escape.ssGSEA", 
                                   gene.sets = GS.hallmark)
## [1] "Calculating features per cell..."
## [1] "Normalizing enrichment scores per cell..."

An alternative for scaling by expressed gene sets would be to use a scaling factor previously calculated during normal single-cell data processing and quality control. This can be done using the scale.factor argument and providing a vector.

pbmc_small <- performNormalization(pbmc_small, 
                                   assay = "escape.ssGSEA", 
                                   gene.sets = GS.hallmark, 
                                   scale.factor = pbmc_small$nFeature_RNA)
## [1] "Normalizing enrichment scores per cell..."

performNormalization() has an additional parameter make.positive. Across the individual gene sets, if negative normalized enrichment scores are seen, the minimum value is added to all values. For example if the normalized enrichment scores (after the above accounting for drop out) ranges from -50 to 50, make.positive will adjust the range to 0 to 100 (by adding 50). This allows for compatible log2-fold change downstream, but can alter the enrichment score interpretation.


4 Visualizations

There are a number of ways to look at the enrichment values downstream of runEscape() with the myriad plotting and visualizations functions/packages for single-cell data. escape include several additional plotting functions to assist in the analysis.

4.1 heatmapEnrichment

We can examine the enrichment values across our gene sets by using heatmapEnrichment(). This visualization will return the mean of the group.by variable. As a default - all visualizations of single-cell objects will use the cluster assignment or active identity as a default for visualizations.

heatmapEnrichment(pbmc_small, 
                  group.by = "ident",
                  gene.set.use = "all",
                  assay = "escape.ssGSEA")