Contents

1 Introduction

As more and more scRNA-seq datasets become available, carrying out comparisons between them is key. A central application is to compare datasets of similar biological origin collected by different labs to ensure that the annotation and the analysis is consistent. Moreover, as very large references, e.g. the Human Cell Atlas (HCA), become available, an important application will be to project cells from a new sample (e.g. from a disease tissue) onto the reference to characterize differences in composition, or to detect new cell-types.

scmap is a method for projecting cells from a scRNA-seq experiment on to the cell-types or cells identified in a different experiment. A copy of the scmap manuscript is available on bioRxiv.

2 SingleCellExperiment class

scmap is built on top of the Bioconductor’s SingleCellExperiment class. Please read corresponding vignettes on how to create a SingleCellExperiment from your own data. Here we will show a small example on how to do that but note that it is not a comprehensive guide.

3 scmap input

If you already have a SingleCellExperiment object, then proceed to the next chapter.

If you have a matrix or a data frame containing expression data then you first need to create an SingleCellExperiment object containing your data. For illustrative purposes we will use an example expression matrix provided with scmap. The dataset (yan) represents FPKM gene expression of 90 cells derived from human embryo. The authors (Yan et al.) have defined developmental stages of all cells in the original publication (ann data frame). We will use these stages in projection later.

library(SingleCellExperiment)
library(scmap)

head(ann)
##                 cell_type1
## Oocyte..1.RPKM.     zygote
## Oocyte..2.RPKM.     zygote
## Oocyte..3.RPKM.     zygote
## Zygote..1.RPKM.     zygote
## Zygote..2.RPKM.     zygote
## Zygote..3.RPKM.     zygote
yan[1:3, 1:3]
##          Oocyte..1.RPKM. Oocyte..2.RPKM. Oocyte..3.RPKM.
## C9orf152             0.0             0.0             0.0
## RPS11             1219.9          1021.1           931.6
## ELMO2                7.0            12.2             9.3

Note that the cell type information has to be stored in the cell_type1 column of the rowData slot of the SingleCellExperiment object.

Now let’s create a SingleCellExperiment object of the yan dataset:

sce <- SingleCellExperiment(assays = list(normcounts = as.matrix(yan)), colData = ann)
logcounts(sce) <- log2(normcounts(sce) + 1)
# use gene names as feature symbols
rowData(sce)$feature_symbol <- rownames(sce)
isSpike(sce, "ERCC") <- grepl("^ERCC-", rownames(sce))
## Warning: 'isSpike<-' is deprecated.
## Use 'isSpike<-' instead.
## See help("Deprecated")
## Warning: 'spikeNames' is deprecated.
## See help("Deprecated")
## Warning: 'isSpike' is deprecated.
## See help("Deprecated")
# remove features with duplicated names
sce <- sce[!duplicated(rownames(sce)), ]
sce
## class: SingleCellExperiment 
## dim: 20214 90 
## metadata(0):
## assays(2): normcounts logcounts
## rownames(20214): C9orf152 RPS11 ... CTSC AQP7
## rowData names(1): feature_symbol
## colnames(90): Oocyte..1.RPKM. Oocyte..2.RPKM. ...
##   Late.blastocyst..3..Cell.7.RPKM. Late.blastocyst..3..Cell.8.RPKM.
## colData names(1): cell_type1
## reducedDimNames(0):
## spikeNames(1): ERCC
## altExpNames(0):

4 Feature selection

Once we have a SingleCellExperiment object we can run scmap. Firstly, we need to select the most informative features (genes) from our input dataset:

sce <- selectFeatures(sce, suppress_plot = FALSE)
## Warning in linearModel(object, n_features): Your object does not contain
## counts() slot. Dropouts were calculated using logcounts() slot...
## Warning: 'isSpike' is deprecated.
## See help("Deprecated")
## Warning: 'spikeNames' is deprecated.
## See help("Deprecated")
## Warning: 'isSpike' is deprecated.
## See help("Deprecated")

Features highlighted with the red colour will be used in the futher analysis (projection).

Features are stored in the scmap_features column of the rowData slot of the input object. By default scmap selects \(500\) features (it can also be controlled by setting n_features parameter):

table(rowData(sce)$scmap_features)
## 
## FALSE  TRUE 
## 19714   500

5 scmap-cluster

5.1 Index

The scmap-cluster index of a reference dataset is created by finding the median gene expression for each cluster. By default scmap uses the cell_type1 column of the colData slot in the reference to identify clusters. Other columns can be manually selected by adjusting cluster_col parameter:

sce <- indexCluster(sce)

The function indexCluster automatically writes the scmap_cluster_index item of the metadata slot of the reference dataset.

head(metadata(sce)$scmap_cluster_index)
##           zygote     2cell    4cell     8cell   16cell    blast
## ABCB4   5.788589 6.2258580 5.935134 0.6667119 0.000000 0.000000
## ABCC6P1 7.863625 7.7303559 8.322769 7.4303689 4.759867 0.000000
## ABT1    0.320773 0.1315172 0.000000 5.9787977 6.100671 4.627798
## ACCSL   7.922318 8.4274290 9.662611 4.5869260 1.768026 0.000000
## ACOT11  0.000000 0.0000000 0.000000 6.4677243 7.147798 4.057444
## ACOT9   4.877394 4.2196038 5.446969 4.0685468 3.827819 0.000000

One can also visualise the index:

heatmap(as.matrix(metadata(sce)$scmap_cluster_index))