Contents

1 Introduction

“APL” is a package developed for computation of Association Plots, a method for visualization and analysis of single cell transcriptomics data. The main focus of “APL” is the identification of genes characteristic for individual clusters of cells from input data.

When working with APL package please cite:

Gralinska, E., Kohl, C., Fadakar, B. S., & Vingron, M. (2022).
Visualizing Cluster-specific Genes from Single-cell Transcriptomics Data Using Association Plots.
Journal of Molecular Biology, 434(11), 167525.

A citation can also be obtained in R by running citation("APL"). For a mathematical description of the method, please refer to the manuscript.

2 Installation

To install the APL from Bioconductor, run:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

BiocManager::install("APL")

Alternatively the package can also be installed from GitHub:

library(devtools)
install_github("VingronLab/APL")

To additionally build the package vignette, run instead

install_github("VingronLab/APL", build_vignettes = TRUE, dependencies = TRUE)

Building the vignette will however take considerable time.

2.1 Changes regarding python dependencies

Previous versions of APL used pytorch SVD to speed up the computation of the full SVD. This has been deprecated in favor of fast truncated SVD implementations starting with Version 1.10.1. Calling runAPL or cacomp with python = TRUE will not lead to an error, but only issue a warning. If you still want to perform a full SVD, set the dimensions to rank of the matrix. Until a faster replacement is identified, this computation will be performed by the rather slow base R svd and should therefore not be done on very large matrices. The default number of dimensions now defaults to half of the rank of the matrix.

3 Preprocessing

3.1 Setup

In this vignette we will use a small data set published by Darmanis et al. (2015) consisting of 466 human adult cortical single cells sequenced on the Fluidigm platform as an example. To obtain the data necessary to follow the vignette we use the Bioconductor package scRNAseq.

Besides the package APL we will use Bioconductor packages to preprocess the data. Namely we will use SingleCellExperiment, scater and scran. However, the preprocessing could equally be performed with the single-cell RNA-seq analysis suite Seurat.

The preprocessing steps are performed according to the recommendations published in Orchestrating Single-Cell Analysis with Bioconductor by Amezquita et al. (2022). For more information about the rational behind them please refer to the book.

library(APL)
library(scRNAseq)
library(SingleCellExperiment)
library(scran)
library(scater)
set.seed(1234)

3.2 Loading the data

We start with the loading and preprocessing of the Darmanis data.

darmanis <- DarmanisBrainData()
darmanis
#> class: SingleCellExperiment 
#> dim: 22085 466 
#> metadata(0):
#> assays(1): counts
#> rownames(22085): 1/2-SBSRNA4 A1BG ... ZZZ3 tAKR
#> rowData names(0):
#> colnames(466): GSM1657871 GSM1657872 ... GSM1658365 GSM1658366
#> colData names(6): metrics age ... experiment_sample_name tissue
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

3.3 Normalization, PCA & Clustering

Association Plots from APL should be computed based on the normalized expression data. Therefore, we first normalize the counts from the Darmanis data and calculate both PCA and UMAP for visualizations later.

For now, APL requires the data to be clustered beforehand. The darmanis data comes already annotated, so we will use the cell types stored in the cell.type metadata column instead of performing a clustering.

set.seed(100)
clust <- quickCluster(darmanis)
darmanis <- computeSumFactors(darmanis, cluster = clust, min.mean = 0.1)
darmanis <- logNormCounts(darmanis)

dec <- modelGeneVar(darmanis)
top_darmanis <- getTopHVGs(dec, n = 5000)
darmanis <- fixedPCA(darmanis, subset.row = top_darmanis)
darmanis <- runUMAP(darmanis, dimred = "PCA")

plotReducedDim(darmanis, dimred = "UMAP", colour_by = "cell.type")

4 Quick start

The fastest way to compute the Association Plot for a selected cluster of cells from the input data is by using a wrapper function runAPL(). runAPL() automates most of the analysis steps for ease of use.

For example, to generate an Association Plot for the oligodendrocytes we can use the following command:

runAPL(
  darmanis,
  assay = "logcounts",
  top = 5000,
  group = which(darmanis$cell.type == "oligodendrocytes"),
  type = "ggplot"
)
#> Warning in rm_zeros(obj): Matrix contains rows with only 0s. These rows were
#> removed. If undesired set rm_zeros = FALSE.
#> No dimensions specified. Setting dimensions to: 93
#> Warning in rm_zeros(mat): Matrix contains rows with only 0s. These rows were
#> removed. If undesired set rm_zeros = FALSE.
#> 
#>  Using 13 dimensions. Subsetting.

#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |======================================================================| 100%

The generated Association Plot is computed based on the log-normalized count matrix. By default runAPL uses the top 5,000 most variable genes in the data, but the data can be subset to any number of genes by changing the value for the argument top. The dimensionality of the CA is determined automatically by the elbow rule described below (see here). This default behavior can be overriden by setting the dimensions manually (parameter dims). The cluster-specificity score (\(S_\alpha\)) for each gene is also calculated (score = TRUE). In order to better explore the data, type can be set to "plotly" to obtain an interactive plot. runAPL has many arguments to further customize the output and fine tune the calculations. Please refer to the documentation (?runAPL) for more information. The following sections in this vignette will discuss the choice of dimensionality and the \(S_\alpha\)-score.

5 Step-by-step way of computing Association Plots

Alternatively, Association Plots can be computed step-by-step. This allows to adjust the Association Plots to user’s needs. Below we explain each step of the process of generating Association Plots.

5.1 Correspondence Analysis

The first step of Association Plot computations is correspondence analysis (CA). CA is a data dimensionality reduction method similar to PCA, however it allows for a simultaneous embedding of both cells and genes from the input data in the same space. In this example we perform CA on the log-normalized count matrix of the darmanis brain data.

# Computing CA on logcounts
logcounts <- logcounts(darmanis)
ca <- cacomp(
  obj = logcounts,
  top = 5000
)
#> Warning in rm_zeros(obj): Matrix contains rows with only 0s. These rows were
#> removed. If undesired set rm_zeros = FALSE.
#> No dimensions specified. Setting dimensions to: 93

# The above is equivalent to:
# ca <- cacomp(obj = darmanis,
#              assay = "logcounts",
#              top = 5000)

The function cacomp accepts as an input any matrix with non-negative entries, be it a single-cell RNA-seq, bulk RNA-seq or other data. For ease of use, cacomp accepts also SingleCellExperiment and Seurat objects, however for these we additionally have to specify via the assay and/or slot (for Seurat) parameter from where to extract the data. Importantly, in order to ensure the interpretability of the results cacomp (and related functions such as runAPL) requires that the input matrix contains both row and column names.

When performing a feature selection before CA, we can set the argument top to the desired number of genes with the highest variance across cells from the input data to retain for further analysis. By default, only the top 5,000 most variable genes are kept as a good compromise between computational time and keeping the most relevant genes. If we want to ensure however that even marker genes of smaller clusters are kept, we can increase the number of genes.

The output of cacomp is an object of class cacomp:

ca
#> cacomp object with 466 columns, 5000 rows and 93 dimensions.
#> Calc. standard coord.:  std_coords_rows, std_coords_cols
#> Calc. principal coord.: prin_coords_rows, prin_coords_cols
#> Calc. APL coord.:       
#> Explained inertia:      3.1% Dim1, 2.2% Dim2

As can be seen in the summarized output, by default both types of coordinates in the CA space (principal and standardized) are calculated. Once the coordinates for the Association Plot are calculated, they will also be shown in the output of cacomp. Slots are accessed through an accessor function:

cacomp_slot(ca, "std_coords_cols")[1:5, 1:5]
#>                  Dim1      Dim2       Dim3        Dim4       Dim5
#> GSM1657871  0.1067086 0.1253297 -0.5046485 -0.32461780 -2.4262520
#> GSM1657872 -1.5101357 0.4299418  0.1219273  0.07942823 -0.2362747
#> GSM1657873  0.2237619 0.2610148 -0.1955599  0.16963578 -2.1477040
#> GSM1657874  0.5680872 0.1138251 -0.4725071  0.04409559  0.6708562
#> GSM1657875  0.4739344 0.2505648 -0.4384626  0.10316896 -2.8910899

In the case of SingleCellExperiment and Seurat objects, we can alternatively set return_input = TRUE to get the input object back, with the CA results computed by “APL” and stored in the appropriate slot for dimension reduction. This also allows for using the plotting functions that come with these packages:

darmanis <- cacomp(
  obj = darmanis,
  assay = "logcounts",
  top = 5000,
  return_input = TRUE
)
#> Warning in rm_zeros(obj): Matrix contains rows with only 0s. These rows were
#> removed. If undesired set rm_zeros = FALSE.
#> No dimensions specified. Setting dimensions to: 93

plotReducedDim(darmanis,
  dimred = "CA",
  ncomponents = c(1, 2),
  colour_by = "cell.type"
)