1 Introduction

This document gives an introduction to and overview of the quality control functionality of the scater package. scater contains tools to help with the analysis of single-cell transcriptomic data, focusing on low-level steps such as quality control, normalization and visualization. It is based on the SingleCellExperiment class (from the SingleCellExperiment package), and thus is interoperable with many other Bioconductor packages such as scran, batchelor and iSEE.

Note: A more comprehensive description of the use of scater (along with other packages) in a scRNA-seq analysis workflow is available at https://osca.bioconductor.org.

2 Setting up the data

2.1 Generating a `SingleCellExperiment` object

We assume that you have a matrix containing expression count data summarised at the level of some features (gene, exon, region, etc.). First, we create a SingleCellExperiment object containing the data, as demonstrated below with a famous brain dataset. Rows of the object correspond to features, while columns correspond to samples, i.e., cells in the context of single-cell ’omics data.

library(scRNAseq)
example_sce <- ZeiselBrainData()
example_sce

## class: SingleCellExperiment 
## dim: 20006 3005 
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
##   1772058148_F03
## colData names(10): tissue group # ... level1class level2class
## reducedDimNames(0):
## altExpNames(2): ERCC repeat

We usually expect (raw) count data to be labelled as "counts" in the assays, which can be easily retrieved with the counts accessor. Getters and setters are also provided for exprs, tpm, cpm, fpkm and versions of these with the prefix norm_.

str(counts(example_sce))

Row and column-level metadata are easily accessed (or modified) as shown below. There are also dedicated getters and setters for size factor values (sizeFactors()); reduced dimensionality results (reducedDim()); and alternative experimental features (altExp()).

example_sce$whee <- sample(LETTERS, ncol(example_sce), replace=TRUE)
colData(example_sce)

## DataFrame with 3005 rows and 11 columns
##                        tissue   group # total mRNA mol      well       sex
##                   <character> <numeric>      <numeric> <numeric> <numeric>
## 1772071015_C02       sscortex         1           1221         3         3
## 1772071017_G12       sscortex         1           1231        95         1
## 1772071017_A05       sscortex         1           1652        27         1
## 1772071014_B06       sscortex         1           1696        37         3
## 1772067065_H06       sscortex         1           1219        43         3
## ...                       ...       ...            ...       ...       ...
## 1772067059_B04 ca1hippocampus         9           1997        19         1
## 1772066097_D04 ca1hippocampus         9           1415        21         1
## 1772063068_D01       sscortex         9           1876        34         3
## 1772066098_A12 ca1hippocampus         9           1546        88         1
## 1772058148_F03       sscortex         9           1970        15         3
##                      age  diameter        cell_id       level1class level2class
##                <numeric> <numeric>    <character>       <character> <character>
## 1772071015_C02         2         1 1772071015_C02      interneurons       Int10
## 1772071017_G12         1       353 1772071017_G12      interneurons       Int10
## 1772071017_A05         1        13 1772071017_A05      interneurons        Int6
## 1772071014_B06         2        19 1772071014_B06      interneurons       Int10
## 1772067065_H06         6        12 1772067065_H06      interneurons        Int9
## ...                  ...       ...            ...               ...         ...
## 1772067059_B04         4       382 1772067059_B04 endothelial-mural       Peric
## 1772066097_D04         7        12 1772066097_D04 endothelial-mural        Vsmc
## 1772063068_D01         7       268 1772063068_D01 endothelial-mural        Vsmc
## 1772066098_A12         7       324 1772066098_A12 endothelial-mural        Vsmc
## 1772058148_F03         7         6 1772058148_F03 endothelial-mural        Vsmc
##                       whee
##                <character>
## 1772071015_C02           B
## 1772071017_G12           F
## 1772071017_A05           A
## 1772071014_B06           H
## 1772067065_H06           X
## ...                    ...
## 1772067059_B04           P
## 1772066097_D04           T
## 1772063068_D01           H
## 1772066098_A12           K
## 1772058148_F03           E

rowData(example_sce)$stuff <- runif(nrow(example_sce))
rowData(example_sce)

## DataFrame with 20006 rows and 2 columns
##          featureType     stuff
##          <character> <numeric>
## Tspan12   endogenous  0.776738
## Tshz1     endogenous  0.531341
## Fnbp1l    endogenous  0.245747
## Adamts15  endogenous  0.841682
## Cldn12    endogenous  0.476325
## ...              ...       ...
## mt-Co2          mito  0.390712
## mt-Co1          mito  0.542127
## mt-Rnr2         mito  0.915390
## mt-Rnr1         mito  0.665484
## mt-Nd4l         mito  0.612729

Subsetting is very convenient with this class, as both data and metadata are processed in a synchronized manner. More details about the SingleCellExperiment class can be found in the documentation for SingleCellExperiment package.

2.2 Other methods of data import

Count matrices stored as CSV files or equivalent can be easily read into R session using read.table() from utils or fread() from the data.table package. It is advisable to coerce the resulting object into a matrix before storing it in a SingleCellExperiment object.

For large data sets, the matrix can be read in chunk-by-chunk with progressive coercion into a sparse matrix from the Matrix package. This is performed using the readSparseCounts() function and reduces memory usage by not explicitly storing zeroes in memory.

Data from 10X Genomics experiments can be read in using the read10xCounts function from the DropletUtils package. This will automatically generate a SingleCellExperiment with a sparse matrix, see the documentation for more details.

Transcript abundances from the kallisto and Salmon pseudo-aligners can be imported using methods from the tximeta package. This produces a SummarizedExperiment object that can be coerced into a SingleCellExperiment simply with as(se, "SingleCellExperiment").

3 Quality control

3.1 Background

scater provides functionality for three levels of quality control (QC):

QC and filtering of cells
QC and filtering of features (genes)
QC of experimental variables

3.2 Cell-level QC

3.2.1 Definition of metrics

Cell-level metrics are computed by the perCellQCMetrics() function and include:

sum: total number of counts for the cell (i.e., the library size).
detected: the number of features for the cell that have counts above the detection limit (default of zero).
subsets_X_percent: percentage of all counts that come from the feature control set named X.

library(scater)
per.cell <- perCellQCMetrics(example_sce, 
    subsets=list(Mito=grep("mt-", rownames(example_sce))))
summary(per.cell$sum)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2574    8130   12913   14954   19284   63505

summary(per.cell$detected)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     785    2484    3656    3777    4929    8167

summary(per.cell$subsets_Mito_percent)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.992   6.653   7.956  10.290  56.955

It is often convenient to store this in the colData() of our SingleCellExperiment object for future reference. (In fact, the addPerCellQC() function will do this automatically.)

colData(example_sce) <- cbind(colData(example_sce), per.cell)

3.2.2 Diagnostic plots

Metadata variables can be plotted against each other using the plotColData() function, as shown below. We expect to see an increasing number of detected genes with increasing total count. Each point represents a cell that is coloured according to its tissue of origin.

plotColData(example_sce, x = "sum", y="detected", colour_by="tissue")

Here, we have plotted the total count for each cell against the mitochondrial content. Well-behaved cells should have a large number of expressed features and and a low percentage of expression from feature controls. High percentage expression from feature controls and few expressed features are indicative of blank and failed cells. For some variety, we have faceted by the tissue of origin.

plotColData(example_sce, x = "sum", y="subsets_Mito_percent", 
    other_fields="tissue") + facet_wrap(~tissue)

3.2.3 Identifying low-quality cells

Column subsetting of the SingeCellExperiment object will only retain the selected cells, thus removing low-quality or otherwise unwanted cells. We can identify high-quality cells to retain by setting a fixed threshold on particular metrics. For example, we could retain only cells that have at least 100,000 total counts and at least 500 expressed features:

keep.total <- example_sce$sum > 1e5
keep.n <- example_sce$detected > 500
filtered <- example_sce[,keep.total & keep.n]
dim(filtered)

## [1] 20006     0

The isOutlier function provides a more data-adaptive way of choosing these thresholds. This defines the threshold at a certain number of median absolute deviations (MADs) away from the median. Values beyond this threshold are considered outliers and can be filtered out, assuming that they correspond to low-quality cells. Here, we define small outliers (using type="lower") for the log-total counts at 3 MADs from the median.

keep.total <- isOutlier(per.cell$sum, type="lower", log=TRUE)
filtered <- example_sce[,keep.total]

Detection of outliers can be achieved more conveniently for several common metrics using the quickPerCellQC() function. This uses the total count, number of detected features and the percentage of counts in gene sets of diagnostic value (e.g., mitochondrial genes, spike-in transcripts) to identify which cells to discard and for what reason.

qc.stats <- quickPerCellQC(per.cell, percent_subsets="subsets_Mito_percent")
colSums(as.matrix(qc.stats))

##              low_lib_size            low_n_features high_subsets_Mito_percent 
##                         0                         3                       128 
##                   discard 
##                       131

filtered <- example_sce[,!qc.stats$discard]

The isOutlier approach adjusts to experiment-specific aspects of the data, e.g., sequencing depth, amount of spike-in RNA added, cell type. In contrast, a fixed threshold would require manual adjustment to account for changes to the experimental protocol or system. We refer readers to the simpleSingleCell workflow for more details.

3.3 Feature-level QC

3.3.1 Definition of metrics

Feature-level metrics are computed by the perFeatureQCMetrics() function and include:

mean: the mean count of the gene/feature across all cells.
detected: the percentage of cells with non-zero counts for each gene.
subsets_Y_ratio: ratio of mean counts between the cell control set named Y and all cells.

# Pretending that the first 10 cells are empty wells, for demonstration.
per.feat <- perFeatureQCMetrics(example_sce, subsets=list(Empty=1:10))
summary(per.feat$mean)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0007   0.0097   0.1338   0.7475   0.5763 732.1524

summary(per.feat$detected)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.03328  0.76539  9.01830 18.87800 31.24792 99.96672

summary(per.feat$subsets_Empty_ratio)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.601   1.872   2.016 300.500

A more refined calculation of the average is provided by the calculateAverage() function, which adjusts the counts by the relative library size (or size factor) prior to taking the mean.

ave <- calculateAverage(example_sce)
summary(ave)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0002   0.0109   0.1443   0.7475   0.5674 850.6880

We can also compute the number of cells expressing a gene directly.

summary(nexprs(example_sce, byrow=TRUE))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    23.0   271.0   567.3   939.0  3004.0

3.3.2 Diagnostic plots

We look at a plot that shows the top 50 (by default) most-expressed features. Each row in the plot below corresponds to a gene, and each bar corresponds to the expression of a gene in a single cell. The circle indicates the median expression of each gene, with which genes are sorted. By default, “expression” is defined using the feature counts (if available), but other expression values can be used instead by changing exprs_values.

plotHighestExprs(example_sce, exprs_values = "counts")

We expect to see the “usual suspects”, i.e., mitochondrial genes, actin, ribosomal protein, MALAT1. A few spike-in transcripts may also be present here, though if all of the spike-ins are in the top 50, it suggests that too much spike-in RNA was added. A large number of pseudo-genes or predicted genes may indicate problems with alignment.

3.3.3 Subsetting by row

Genes can be removed by row subsetting of the SingleCellExperiment object. For example, we can filter out features (genes) that are not expressed in any cells:

keep_feature <- nexprs(example_sce, byrow=TRUE) > 0
example_sce <- example_sce[keep_feature,]
dim(example_sce)

## [1] 20006  3005

Other filtering can be done using existing annotation. For example, ribosomal protein genes and predicted genes can be identified (and removed) using regular expressions or biotype information. Such genes are often uninteresting when the aim is to characterize population heterogeneity.

3.4 Variable-level QC

Variable-level metrics are computed by the getVarianceExplained() function (after normalization, see below). This calculates the percentage of variance of each gene’s expression that is explained by each variable in the colData of the SingleCellExperiment object.

example_sce <- logNormCounts(example_sce) # see below.
vars <- getVarianceExplained(example_sce, 
    variables=c("tissue", "total mRNA mol", "sex", "age"))
head(vars)

##              tissue total mRNA mol         sex        age
## Tspan12  0.02207262    0.074086504 0.146344996 0.09472155
## Tshz1    3.36083014    0.003846487 0.001079356 0.31262288
## Fnbp1l   0.43597185    0.421086301 0.003071630 0.64964174
## Adamts15 0.54233888    0.005348505 0.030821621 0.01393787
## Cldn12   0.03506751    0.309128294 0.008341408 0.02363737
## Rxfp1    0.18559637    0.016290703 0.055646799 0.02128006

We can then use this to determine which experimental factors are contributing most to the variance in expression. This is useful for diagnosing batch effects or to quickly verify that a treatment has an effect.

plotExplanatoryVariables(vars)

4 Computing expression values

4.1 Normalization for library size differences

The most commonly used function is logNormCounts(), which calculates log₂-transformed normalized expression values. This is done by dividing each count by its size factor, adding a pseudo-count and log-transforming. The resulting values can be interpreted on the same scale as log-transformed counts, and are stored in "logcounts".

example_sce <- logNormCounts(example_sce)
assayNames(example_sce)

## [1] "counts"    "logcounts"

By default, the size factor is automatically computed from the library size of each cell using the librarySizeFactors() function. This calculation simply involves scaling the library sizes so that they have a mean of 1 across all cells. However, if size factors are explicitly provided in the SingleCellExperiment, they will be used by the normalization functions.

summary(librarySizeFactors(example_sce))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1721  0.5437  0.8635  1.0000  1.2895  4.2466

Alternatively, we can calculate counts-per-million using the aptly-named calculateCPM() function. The output is most appropriately stored as an assay named "cpm" in the assays of the SingleCellExperiment object. Related functions include calculateTPM() and calculateFPKM(), which do pretty much as advertised.

cpm(example_sce) <- calculateCPM(example_sce)

Of course, users can construct any arbitrary matrix of the same dimensions as the count matrix and store it as an assay.

assay(example_sce, "normed") <- normalizeCounts(example_sce, 
    size_factors=runif(ncol(example_sce)), pseudo_count=1.5)

4.2 Aggregation across groups or clusters

The aggregateAcrossCells() function is helpful for aggregating expression values across groups of cells. For example, we might wish to sum together counts for all cells in the same cluster, possibly to use as a summary statistic for downstream analyses (e.g., for differential expression with edgeR). This will also perform the courtesy of sensibly aggregating the column metadata for downstream use.

agg_sce <- aggregateAcrossCells(example_sce, ids=example_sce$level1class)
head(assay(agg_sce))

##          astrocytes_ependymal endothelial-mural interneurons microglia
## Tspan12                    74               138          186        14
## Tshz1                      96               105          335        43
## Fnbp1l                     89               169          689        33
## Adamts15                    2                23           64         0
## Cldn12                     50                27          160         5
## Rxfp1                       0                 3           74         0
##          oligodendrocytes pyramidal CA1 pyramidal SS
## Tspan12                99           269           58
## Tshz1                 297            65          332
## Fnbp1l                239           859         1056
## Adamts15               19            11           15
## Cldn12                214           411          273
## Rxfp1                  13            48           30

colData(agg_sce)[,c("ids", "ncells")]

## DataFrame with 7 rows and 2 columns
##                                       ids    ncells
##                               <character> <integer>
## astrocytes_ependymal astrocytes_ependymal       224
## endothelial-mural       endothelial-mural       235
## interneurons                 interneurons       290
## microglia                       microglia        98
## oligodendrocytes         oligodendrocytes       820
## pyramidal CA1               pyramidal CA1       939
## pyramidal SS                 pyramidal SS       399

It is similarly possible to sum across multiple factors, as shown below for the cell type and the tissue of origin. This yields one column per combination of cell type and tissue, which allows us to conveniently perform downstream analyses with both factors.

agg_sce <- aggregateAcrossCells(example_sce, 
    ids=colData(example_sce)[,c("level1class", "tissue")])
head(assay(agg_sce))

##          [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## Tspan12    23   51   10  128   52  134    0   14   10    89   269    58
## Tshz1      38   58   10   95   90  245    1   42   50   247    65   332
## Fnbp1l     23   66    9  160  227  462    3   30   18   221   859  1056
## Adamts15    0    2    2   21   15   49    0    0    0    19    11    15
## Cldn12      5   45    0   27   61   99    0    5   22   192   411   273
## Rxfp1       0    0    0    3    3   71    0    0    1    12    48    30

colData(agg_sce)[,c("level1class", "tissue", "ncells")]

## DataFrame with 12 rows and 3 columns
##              level1class         tissue    ncells
##              <character>    <character> <integer>
## 1   astrocytes_ependymal ca1hippocampus        81
## 2   astrocytes_ependymal       sscortex       143
## 3      endothelial-mural ca1hippocampus        33
## 4      endothelial-mural       sscortex       202
## 5           interneurons ca1hippocampus       126
## ...                  ...            ...       ...
## 8              microglia       sscortex        84
## 9       oligodendrocytes ca1hippocampus       121
## 10      oligodendrocytes       sscortex       699
## 11         pyramidal CA1 ca1hippocampus       939
## 12          pyramidal SS       sscortex       399

Summation across rows may occasionally be useful for obtaining a measure of the activity of a gene set, e.g., in a pathway. Given a list of gene sets, we can use the sumCountsAcrossFeatures() function to aggregate expression values across features. This is usually best done by averaging the log-expression values as shown below.

agg_feat <- sumCountsAcrossFeatures(example_sce,
    ids=list(GeneSet1=1:10, GeneSet2=11:50, GeneSet3=1:100),
    average=TRUE, exprs_values="logcounts")
agg_feat[,1:10]

##          1772071015_C02 1772071017_G12 1772071017_A05 1772071014_B06
## GeneSet1      0.7717764      0.2177634      0.7577241      0.5380195
## GeneSet2      1.0602489      0.9955551      1.2558629      1.1482845
## GeneSet3      1.1410273      1.0018648      1.2727650      1.1995992
##          1772067065_H06 1772071017_E02 1772067065_B07 1772067060_B09
## GeneSet1      0.4987205      0.3528216      0.5262005      0.0749138
## GeneSet2      1.2010854      1.2855797      1.2889492      1.1806065
## GeneSet3      1.1244981      1.1567887      1.1057770      1.1353704
##          1772071014_E04 1772071015_D04
## GeneSet1      0.7267285      0.7924899
## GeneSet2      1.1728326      1.2871506
## GeneSet3      1.0098231      1.1846439

Similar functions are available to compute the number or proportion of cells with detectable expression in each group.

4.3 Visualizing expression values

The plotExpression() function makes it easy to plot expression values for a subset of genes or features. This can be particularly useful for further examination of features identified from differential expression testing, pseudotime analysis or other analyses. By default, it uses expression values in the "logcounts" assay, but this can be changed through the exprs_values argument.

plotExpression(example_sce, rownames(example_sce)[1:6], x = "level1class")

Setting x will determine the covariate to be shown on the x-axis. This can be a field in the column metadata or the name of a feature (to obtain the expression profile across cells). Categorical covariates will yield grouped violins as shown above, with one panel per feature. By comparison, continuous covariates will generate a scatter plot in each panel, as shown below.

plotExpression(example_sce, rownames(example_sce)[1:6],
    x = rownames(example_sce)[10])

The points can also be coloured, shaped or resized by the column metadata or expression values.

plotExpression(example_sce, rownames(example_sce)[1:6],
    x = "level1class", colour_by="tissue")

Directly plotting the gene expression without any x or other visual parameters will generate a set of grouped violin plots, coloured in an aesthetically pleasing manner.

plotExpression(example_sce, rownames(example_sce)[1:6])

5 Dimensionality reduction

5.1 Principal components analysis

Principal components analysis (PCA) is often performed to denoise and compact the data prior to downstream analyses. The runPCA() function provides a simple wrapper around the base machinery in BiocSingular for computing PCs from log-transformed expression values. This stores the output in the reducedDims slot of the SingleCellExperiment, which can be easily retrieved (along with the percentage of variance explained by each PC) as shown below:

example_sce <- runPCA(example_sce)
str(reducedDim(example_sce, "PCA"))

##  num [1:3005, 1:50] 15.4 15 17.2 16.9 18.4 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:3005] "1772071015_C02" "1772071017_G12" "1772071017_A05" "1772071014_B06" ...
##   ..$ : chr [1:50] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "percentVar")= num [1:50] 39.72 9.38 4.25 3.9 2.76 ...
##  - attr(*, "rotation")= num [1:500, 1:50] -0.1471 -0.1146 -0.1084 -0.0958 -0.0953 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:500] "Plp1" "Trf" "Mal" "Apod" ...
##   .. ..$ : chr [1:50] "PC1" "PC2" "PC3" "PC4" ...

By default, runPCA() uses the top 500 genes with the highest variances to compute the first PCs. This can be tuned by specifying subset_row to pass in an explicit set of genes of interest, and by using ncomponents to determine the number of components to compute. The name argument can also be used to change the name of the result in the reducedDims slot.

example_sce <- runPCA(example_sce, name="PCA2",
    subset_row=rownames(example_sce)[1:1000],
    ncomponents=25)
str(reducedDim(example_sce, "PCA2"))

##  num [1:3005, 1:25] 20 21 23 23.7 21.5 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:3005] "1772071015_C02" "1772071017_G12" "1772071017_A05" "1772071014_B06" ...
##   ..$ : chr [1:25] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "percentVar")= num [1:25] 22.3 5.11 3.42 1.69 1.58 ...
##  - attr(*, "rotation")= num [1:1000, 1:25] 2.24e-04 8.52e-05 2.43e-02 5.92e-04 6.35e-03 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:1000] "Tspan12" "Tshz1" "Fnbp1l" "Adamts15" ...
##   .. ..$ : chr [1:25] "PC1" "PC2" "PC3" "PC4" ...

5.2 Other dimensionality reduction methods

$t$-distributed stochastic neighbour embedding ($t$-SNE) is widely used for visualizing complex single-cell data sets. The same procedure described for PCA plots can be applied to generate $t$-SNE plots using plotTSNE, with coordinates obtained using runTSNE via the Rtsne package. We strongly recommend generating plots with different random seeds and perplexity values, to ensure that any conclusions are robus t to different visualizations.

# Perplexity of 10 just chosen here arbitrarily.
set.seed(1000)
example_sce <- runTSNE(example_sce, perplexity=10)
head(reducedDim(example_sce, "TSNE"))

##                     [,1]      [,2]
## 1772071015_C02 -49.70928 -12.34376
## 1772071017_G12 -52.60643 -10.83418
## 1772071017_A05 -49.34541 -12.32723
## 1772071014_B06 -52.17084 -10.59969
## 1772067065_H06 -51.75725 -10.20202
## 1772071017_E02 -52.71614 -10.29159

A more common pattern involves using the pre-existing PCA results as input into the $t$-SNE algorithm. This is useful as it improves speed by using a low-rank approximation of the expression matrix; and reduces random noise, by focusing on the major factors of variation. The code below uses the first 10 dimensions of the previously computed PCA result to perform the $t$-SNE.

set.seed(1000)
example_sce <- runTSNE(example_sce, perplexity=50, 
    dimred="PCA", n_dimred=10)
head(reducedDim(example_sce, "TSNE"))

##                     [,1]       [,2]
## 1772071015_C02 -38.05331  0.9025655
## 1772071017_G12 -36.02916 -0.1065038
## 1772071017_A05 -38.45340  1.3981213
## 1772071014_B06 -38.64621 -0.2184595
## 1772067065_H06 -41.02368 -0.8125667
## 1772071017_E02 -38.60500  2.6343767

The same can be done for uniform manifold with approximate projection (UMAP) via the runUMAP() function, itself based on the uwot package.

example_sce <- runUMAP(example_sce)
head(reducedDim(example_sce, "UMAP"))

##                     [,1]      [,2]
## 1772071015_C02 -12.04324 -2.072196
## 1772071017_G12 -12.11942 -2.135925
## 1772071017_A05 -11.93046 -2.005134
## 1772071014_B06 -12.08696 -2.123133
## 1772067065_H06 -12.11039 -2.148298
## 1772071017_E02 -12.10755 -2.138869

5.3 Visualizing reduced dimensions

Any dimensionality reduction result can be plotted using the plotReducedDim function. Here, each point represents a cell and is coloured according to its cell type label.

plotReducedDim(example_sce, dimred = "PCA", colour_by = "level1class")

Some result types have dedicated wrappers for convenience, e.g., plotTSNE() for $t$-SNE results:

plotTSNE(example_sce, colour_by = "Snap25")

The dedicated plotPCA() function also adds the percentage of variance explained to the axes:

plotPCA(example_sce, colour_by="Mog")

Multiple components can be plotted in a series of pairwise plots. When more than two components are plotted, the diagonal boxes in the scatter plot matrix show the density for each component.

example_sce <- runPCA(example_sce, ncomponents=20)
plotPCA(example_sce, ncomponents = 4, colour_by = "level1class")

We separate the execution of these functions from the plotting to enable the same coordinates to be re-used across multiple plots. This avoids repeatedly recomputing those coordinates just to change an aesthetic across plots.

6 Utilities for custom visualization

We provide some helper functions to easily convert from a SingleCellExperiment object to a data.frame for use in, say, ggplot2 functions. This allows users to create highly customized plots that are not covered by the existing scater functions. The ggcells() function will intelligently retrieve fields from the colData(), assays(), altExps() or reducedDims() to create a single data.frame for immediate use. In the example below, we create boxplots of Snap25 expression stratified by cell type and tissue of origin:

ggcells(example_sce, mapping=aes(x=level1class, y=Snap25)) + 
    geom_boxplot() +
    facet_wrap(~tissue)

Reduced dimension results are easily pulled in to create customized equivalents of the plotReducedDim() output. In this example, we create a $t$-SNE plot faceted by tissue and coloured by Snap25 expression:

ggcells(example_sce, mapping=aes(x=TSNE.1, y=TSNE.2, colour=Snap25)) +
    geom_point() +
    stat_density_2d() +
    facet_wrap(~tissue) +
    scale_colour_distiller(direction=1)

It is also straightforward to examine the relationship between the size factors on the normalized gene expression:

ggcells(example_sce, mapping=aes(x=sizeFactor, y=Actb)) +
    geom_point() +
    geom_smooth()

Similar operations can be performed on the features using the ggfeatures() function, which will retrieve values either from rowData or from the columns of the assays. Here, we examine the relationship between the feature type and the expression within a given cell; note the renaming of the otherwise syntactically invalid cell name.

colnames(example_sce) <- make.names(colnames(example_sce))
ggfeatures(example_sce, mapping=aes(x=featureType, y=X1772062111_E06)) + 
    geom_violin()

7 Transitioning from the `SCESet` class

As of July 2017, scater has switched from the SCESet class previously defined within the package to the more widely applicable SingleCellExperiment class. From Bioconductor 3.6 (October 2017), the release version of scater will use SingleCellExperiment. SingleCellExperiment is a more modern and robust class that provides a common data structure used by many single-cell Bioconductor packages. Advantages include support for sparse data matrices and the capability for on-disk storage of data to minimise memory usage for large single-cell datasets.

It should be straight-forward to convert existing scripts based on SCESet objects to SingleCellExperiment objects, with key changes outlined immediately below.

The functions toSingleCellExperiment and updateSCESet (for backwards compatibility) can be used to convert an old SCESet object to a SingleCellExperiment object;
Create a new SingleCellExperiment object with the function SingleCellExperiment (actually less fiddly than creating a new SCESet);
scater functions have been refactored to take SingleCellExperiment objects, so once data is in a SingleCellExperiment object, the user experience is almost identical to that with the SCESet class.

Users may need to be aware of the following when updating their own scripts:

Cell names can now be accessed/assigned with the colnames function (instead of sampleNames or cellNames for an SCESet object);
Feature (gene/transcript) names should now be accessed/assigned with the rownames function (instead of featureNames);
Cell metadata, stored as phenoData in an SCESet, corresponds to colData in a SingleCellExperiment object and is accessed/assigned with the colData function (this replaces the pData function);
Individual cell-level variables can still be accessed with the $ operator (e.g. sce$sum);
Feature metadata, stored as featureData in an SCESet, corresponds to rowData in a SingleCellExperiment object and is accessed/assigned with the rowData function (this replaces the fData function);
plotScater, which produces a cumulative expression, overview plot, replaces the generic plot function for SCESet objects.

Session information

sessionInfo()

## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scater_1.16.2               ggplot2_3.3.2              
##  [3] scRNAseq_2.2.0              SingleCellExperiment_1.10.1
##  [5] SummarizedExperiment_1.18.1 DelayedArray_0.14.0        
##  [7] matrixStats_0.56.0          Biobase_2.48.0             
##  [9] GenomicRanges_1.40.0        GenomeInfoDb_1.24.2        
## [11] IRanges_2.22.2              S4Vectors_0.26.1           
## [13] BiocGenerics_0.34.0         BiocStyle_2.16.0           
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-148                  bitops_1.0-6                 
##  [3] bit64_0.9-7                   RColorBrewer_1.1-2           
##  [5] httr_1.4.1                    tools_4.0.0                  
##  [7] R6_2.4.1                      irlba_2.3.3                  
##  [9] vipor_0.4.5                   mgcv_1.8-31                  
## [11] uwot_0.1.8                    DBI_1.1.0                    
## [13] colorspace_1.4-1              withr_2.2.0                  
## [15] tidyselect_1.1.0              gridExtra_2.3                
## [17] bit_1.1-15.2                  curl_4.3                     
## [19] compiler_4.0.0                BiocNeighbors_1.6.0          
## [21] isoband_0.2.2                 labeling_0.3                 
## [23] bookdown_0.20                 scales_1.1.1                 
## [25] rappdirs_0.3.1                stringr_1.4.0                
## [27] digest_0.6.25                 rmarkdown_2.3                
## [29] XVector_0.28.0                pkgconfig_2.0.3              
## [31] htmltools_0.5.0               dbplyr_1.4.4                 
## [33] fastmap_1.0.1                 rlang_0.4.6                  
## [35] RSQLite_2.2.0                 FNN_1.1.3                    
## [37] shiny_1.5.0                   DelayedMatrixStats_1.10.0    
## [39] generics_0.0.2                farver_2.0.3                 
## [41] BiocParallel_1.22.0           dplyr_1.0.0                  
## [43] RCurl_1.98-1.2                magrittr_1.5                 
## [45] BiocSingular_1.4.0            GenomeInfoDbData_1.2.3       
## [47] Matrix_1.2-18                 Rcpp_1.0.4.6                 
## [49] ggbeeswarm_0.6.0              munsell_0.5.0                
## [51] viridis_0.5.1                 lifecycle_0.2.0              
## [53] stringi_1.4.6                 yaml_2.2.1                   
## [55] MASS_7.3-51.6                 zlibbioc_1.34.0              
## [57] Rtsne_0.15                    BiocFileCache_1.12.0         
## [59] AnnotationHub_2.20.0          grid_4.0.0                   
## [61] blob_1.2.1                    promises_1.1.1               
## [63] ExperimentHub_1.14.0          crayon_1.3.4                 
## [65] lattice_0.20-41               splines_4.0.0                
## [67] cowplot_1.0.0                 magick_2.4.0                 
## [69] knitr_1.29                    pillar_1.4.4                 
## [71] glue_1.4.1                    BiocVersion_3.11.1           
## [73] evaluate_0.14                 BiocManager_1.30.10          
## [75] vctrs_0.3.1                   httpuv_1.5.4                 
## [77] gtable_0.3.0                  purrr_0.3.4                  
## [79] assertthat_0.2.1              xfun_0.15                    
## [81] rsvd_1.0.3                    mime_0.9                     
## [83] xtable_1.8-4                  RSpectra_0.16-0              
## [85] later_1.1.0.1                 viridisLite_0.3.0            
## [87] tibble_3.0.1                  AnnotationDbi_1.50.0         
## [89] beeswarm_0.2.3                memoise_1.1.0                
## [91] ellipsis_0.3.1                interactiveDisplayBase_1.26.3

Single-cell analysis toolkit for expression in R

Revised: February 4, 2020

Package

1 Introduction

2 Setting up the data

2.1 Generating a `SingleCellExperiment` object

2.2 Other methods of data import

3 Quality control

3.1 Background

3.2 Cell-level QC

3.2.1 Definition of metrics

3.2.2 Diagnostic plots

3.2.3 Identifying low-quality cells

3.3 Feature-level QC

3.3.1 Definition of metrics

3.3.2 Diagnostic plots

3.3.3 Subsetting by row

3.4 Variable-level QC

4 Computing expression values

4.1 Normalization for library size differences

4.2 Aggregation across groups or clusters

4.3 Visualizing expression values

5 Dimensionality reduction

5.1 Principal components analysis

5.2 Other dimensionality reduction methods

5.3 Visualizing reduced dimensions

6 Utilities for custom visualization

7 Transitioning from the `SCESet` class

Session information

Single-cell analysis toolkit for expression in R

Revised: February 4, 2020

Package

1 Introduction

2 Setting up the data

2.1 Generating a SingleCellExperiment object

2.2 Other methods of data import

3 Quality control

3.1 Background

3.2 Cell-level QC

3.2.1 Definition of metrics

3.2.2 Diagnostic plots

3.2.3 Identifying low-quality cells

3.3 Feature-level QC

3.3.1 Definition of metrics

3.3.2 Diagnostic plots

3.3.3 Subsetting by row

3.4 Variable-level QC

4 Computing expression values

4.1 Normalization for library size differences

4.2 Aggregation across groups or clusters

4.3 Visualizing expression values

5 Dimensionality reduction

5.1 Principal components analysis

5.2 Other dimensionality reduction methods

5.3 Visualizing reduced dimensions

6 Utilities for custom visualization

7 Transitioning from the SCESet class

Session information

2.1 Generating a `SingleCellExperiment` object

7 Transitioning from the `SCESet` class