1 Introduction

SingleR is an automatic annotation method for single-cell RNA sequencing (scRNAseq) data (Aran et al. 2019). Given a reference dataset of samples (single-cell or bulk) with known labels, it labels new cells from a test dataset based on similarity to the reference set. Specifically, for each test cell:

We compute the Spearman correlation between its expression profile and that of each reference sample. This is done across the union of marker genes identified between all pairs of labels.
We define the per-label score as a fixed quantile (by default, 0.8) of the distribution of correlations.
We repeat this for all labels and we take the label with the highest score as the annotation for this cell.
We optionally perform a fine-tuning step:

The reference dataset is subsetted to only include labels with scores close to the maximum.
Scores are recomputed using only marker genes for the subset of labels.
This is iterated until one label remains.

Automatic annotation provides a convenient way of transferring biological knowledge across datasets. In this manner, the burden of manually interpreting clusters and defining marker genes only has to be done once, for the reference dataset, and this knowledge can be propagated to new datasets in an automated manner.

2 Using the built-in references

SingleR provides several reference datasets (mostly derived from bulk RNA-seq or microarray data) through dedicated data retrieval functions. For example, we obtain reference data from the Human Primary Cell Atlas using the HumanPrimaryCellAtlasData() function, which returns a SummarizedExperiment object containing matrix of log-expression values with sample-level labels.

library(SingleR)
hpca.se <- HumanPrimaryCellAtlasData()
hpca.se

## class: SummarizedExperiment 
## dim: 19363 713 
## metadata(0):
## assays(1): logcounts
## rownames(19363): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(713): GSM112490 GSM112491 ... GSM92233 GSM92234
## colData names(2): label.main label.fine

Our test dataset will is taken from La Manno et al. (2016).
For the sake of speed, we will only label the first 100 cells from this dataset.

library(scRNAseq)
hESCs <- LaMannoBrainData('human-es')
hESCs <- hESCs[,1:100]

# SingleR() expects log-counts, but the function will also happily take raw
# counts for the test dataset. The reference, however, must have log-values.
library(scater)
hESCs <- logNormCounts(hESCs)

We use our hpca.se reference to annotate each cell in hESCs via the SingleR() function, which uses the algorithm described above. Note that the default marker detection method is to take the genes with the largest positive log-fold changes in the per-label medians for each gene.

pred.hesc <- SingleR(test = hESCs, ref = hpca.se, labels = hpca.se$label.main)
pred.hesc

## DataFrame with 100 rows and 5 columns
##                                                                     scores
##                                                                   <matrix>
## 1772122_301_C02  0.118426779945786:0.179699807625087:0.157326274226517:...
## 1772122_180_E05  0.129708246318855:0.236277439793527:0.202370888668263:...
## 1772122_300_H02  0.158201338525345:0.250060222727419:0.211831550178353:...
## 1772122_180_B09   0.158778546217777:0.27716592787528:0.222681369744636:...
## 1772122_180_G04   0.138505219642345:0.236658649096383:0.19092437361406:...
## ...                                                                    ...
## 1772122_299_E07  0.145931041885859:0.241153701803065:0.217382763112476:...
## 1772122_180_D02  0.122983434596168:0.239181076829949:0.181221997276501:...
## 1772122_300_D09  0.129757310468164:0.233775092572195:0.196637664917917:...
## 1772122_298_F09  0.143118885460347:0.262267367714562:0.214329641867196:...
## 1772122_302_A11 0.0912854247387272:0.185945405472165:0.139232371863794:...
##                         first.labels                         tuning.scores
##                          <character>                           <DataFrame>
## 1772122_301_C02 Neuroepithelial_cell   0.18244020296249:0.0991115652997192
## 1772122_180_E05 Neuroepithelial_cell  0.137548373236792:0.0647133734667384
## 1772122_300_H02 Neuroepithelial_cell   0.275798157639906:0.136969040146444
## 1772122_180_B09 Neuroepithelial_cell 0.0851622797320583:0.0819878452425098
## 1772122_180_G04 Neuroepithelial_cell   0.198841544187094:0.101662168246495
## ...                              ...                                   ...
## 1772122_299_E07 Neuroepithelial_cell  0.176002520599547:0.0922503823656398
## 1772122_180_D02 Neuroepithelial_cell   0.196760862365318:0.112480486219438
## 1772122_300_D09 Neuroepithelial_cell 0.0816424287822026:0.0221368018363302
## 1772122_298_F09 Neuroepithelial_cell  0.187249853552379:0.0671892835266423
## 1772122_302_A11 Neuroepithelial_cell   0.156079956344163:0.105132159755961
##                               labels        pruned.labels
##                          <character>          <character>
## 1772122_301_C02 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_180_E05              Neurons              Neurons
## 1772122_300_H02 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_180_B09 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_180_G04 Neuroepithelial_cell Neuroepithelial_cell
## ...                              ...                  ...
## 1772122_299_E07 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_180_D02 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_300_D09 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_298_F09 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_302_A11            Astrocyte            Astrocyte

Each row of the output DataFrame contains prediction results for a single cell. Labels are shown before fine-tuning (first.labels), after fine-tuning (labels) and after pruning (pruned.labels), along with the associated scores. We summarize the distribution of labels across our subset of cells:

table(pred.hesc$labels)

## 
##            Astrocyte Neuroepithelial_cell              Neurons 
##                   14                   81                    5

At this point, it is worth noting that SingleR is workflow/package agnostic. The above example uses SummarizedExperiment objects, but the same functions will accept any (log-)normalized expression matrix.

3 Using single-cell references

Here, we will use two human pancreas datasets from the scRNAseq package. The aim is to use one pre-labelled dataset to annotate the other unlabelled dataset. First, we set up the Muraro et al. (2016) dataset to be our reference.

library(scRNAseq)
sceM <- MuraroPancreasData()

# One should normally do cell-based quality control at this point, but for
# brevity's sake, we will just remove the unlabelled libraries here.
sceM <- sceM[,!is.na(sceM$label)]

sceM <- logNormCounts(sceM)

We then set up our test dataset from Grun et al. (2016). To speed up this demonstration, we will subset to the first 100 cells.

sceG <- GrunPancreasData()
sceG <- sceG[,colSums(counts(sceG)) > 0] # Remove libraries with no counts.
sceG <- logNormCounts(sceG) 
sceG <- sceG[,1:100]

We then run SingleR() as described previously but with a marker detection mode that considers the variance of expression across cells. Here, we will use the Wilcoxon ranked sum test to identify the top markers for each pairwise comparison between labels. This is slower but more appropriate for single-cell data compared to the default marker detection algorithm (which may fail for low-coverage data where the median is frequently zero).

pred.grun <- SingleR(test=sceG, ref=sceM, labels=sceM$label, de.method="wilcox")
table(pred.grun$labels)

## 
##      acinar        beta       delta        duct endothelial 
##          53           4           1          41           1

4 Annotation diagnostics

4.1 Based on the scores within cells

SingleR provides a few basic yet powerful visualization tools. plotScoreHeatmap() displays the scores for all cells across all reference labels, which allows users to inspect the confidence of the predicted labels across the dataset. The actual assigned label for each cell is shown in the color bar at the top; note that this may not be the visually top-scoring label if fine-tuning is applied, as the only the pre-tuned scores are directly comparable across all labels.

plotScoreHeatmap(pred.grun)

For this plot, the key point is to examine the spread of scores within each cell. Ideally, each cell (i.e., column of the heatmap) should have one score that is obviously larger than the rest, indicating that it is unambiguously assigned to a single label. A spread of similar scores for a given cell indicates that the assignment is uncertain, though this may be acceptable if the uncertainty is distributed across similar cell types that cannot be easily resolved.

We can also display other metadata information for each cell by setting clusters= or annotation_col=. This is occasionally useful for examining potential batch effects, differences in cell type composition between conditions, relationship to clusters from an unsupervised analysis, etc. In the code below, we display which donor each cell comes from:

plotScoreHeatmap(pred.grun, 
    annotation_col=as.data.frame(colData(sceG)[,"donor",drop=FALSE]))

4.2 Based on the deltas across cells

The pruneScores() function will remove potentially poor-quality or ambiguous assignments. In particular, ambiguous assignments are identified based on the per-cell “delta”, i.e., the difference between the score for the assigned label and the median across all labels for each cell. Low deltas indicate that the assignment is uncertain, which is especially relevant if the cell’s true label does not exist in the reference. The exact threshold used for pruning is identified using an outlier-based approach that accounts for differences in the scale of the correlations in various contexts.

to.remove <- pruneScores(pred.grun)
summary(to.remove)

##    Mode   FALSE    TRUE 
## logical      96       4

By default, SingleR() will report pruned labels in the pruned.labels field where low-quality assignments are replaced with NA. However, the default pruning thresholds may not be appropriate for every dataset - see ?pruneScores for a more detailed discussion. We provide the plotScoreDistribution() to help in determining whether the thresholds are appropriate by using information across cells with the same label. This displays the per-label distribution of the deltas across cells, from which pruneScores() defines an appropriate threshold as 3 median absolute deviations (MADs) below the median.

plotScoreDistribution(pred.grun, show = "delta.med", ncol = 3, show.nmads = 3)

If some tuning parameters must be adjusted, we can simply call pruneScores() directly with adjusted parameters. Here, we set labels to NA if they are to be discarded, which is also how SingleR() marks such labels in pruned.labels.

new.pruned <- pred.grun$labels
new.pruned[pruneScores(pred.grun, nmads=5)] <- NA
table(new.pruned, useNA="always")

## new.pruned
##      acinar        beta       delta        duct endothelial        <NA> 
##          53           4           1          41           1           0

4.3 Based on marker gene expression

Another simple yet effective diagnostic is to examine the expression of the marker genes for each label in the test dataset. We extract the identity of the markers from the metadata of the SingleR() results and use them in the plotHeatmap() function from scater, as shown below for beta cell markers. If a cell in the test dataset is confidently assigned to a particular label, we would expect it to have strong expression of that label’s markers. At the very least, it should exhibit upregulation of those markers relative to cells assigned to other labels.

all.markers <- metadata(pred.grun)$de.genes
sceG$labels <- pred.grun$labels

# Beta cell-related markers
plotHeatmap(sceG, order_columns_by="labels",
    features=unique(unlist(all.markers$beta)))

We can similarly perform this for all labels by wrapping this code in a loop, as shown below:

for (lab in unique(pred.grun$labels)) {
    plotHeatmap(sceG, order_columns_by=list(I(pred.grun$labels)), 
        features=unique(unlist(all.markers[[lab]]))) 
}

Heatmaps are particularly useful because they allow users to check that the genes are actually biologically meaningful to that cell type’s identity. For example, beta cells would be expected to express insulin, and the fact that they do so gives more confidence to the correctness of the assignment. By comparison, the scores and deltas are more abstract and difficult to interpret for diagnostic purposes. If the identified markers are not meaningful or not consistently upregulated, some skepticism towards the quality of the assignments is warranted.

5 Available references

The legacy SingleR package provides RDA files that contain normalized expression values and cell types labels based on bulk RNA-seq, microarray and single-cell RNA-seq data from:

Blueprint (Martens and Stunnenberg 2013) and Encode (The ENCODE Project Consortium 2012),
the Human Primary Cell Atlas (Mabbott et al. 2013),
the murine ImmGen (Heng et al. 2008), and
a collection of mouse data sets downloaded from GEO (Benayoun et al. 2019).

The bulk RNA-seq and microarray data sets of the first three reference data sets were obtained from pre-sorted cell populations, i.e., the cell labels of these samples were mostly derived based on the respective sorting/purification strategy, not via in silico prediction methods.

Three additional reference datasets from bulk RNA-seq and microarray data for immune cells have also been prepared. Each of these datasets were also obtained from pre-sorted cell populations:

The Database for Immune Cell Expression(/eQTLs/Epigenomics) (Schmiedel et al. 2018),
Novershtern Hematopoietic Cell Data - GSE24759 - formerly known as Differentiation Map (Novershtern et al. 2011), and
Monaco Immune Cell Data - GSE107011 (Monaco et al. 2019).

The characteristics of each dataset are summarized below:

Data retrieval	Organism	Samples	Sample types	No. of main labels	No. of fine labels	Cell type focus
`HumanPrimaryCellAtlasData()`	human	713	microarrays of sorted cell populations	37	157	Non-specific
`BlueprintEncodeData()`	human	259	RNA-seq	24	43	Non-specific
`DatabaseImmuneCellExpressionData()`	human	1561	RNA-seq	5	15	Immune
`NovershternHematopoieticData()`	human	211	microarrays of sorted cell populations	17	38	Hematopoietic & Immune
`MonacoImmuneData()`	human	114	RNA-seq	11	29	Immune
`ImmGenData()`	mouse	830	microarrays of sorted cell populations	20	253	Hematopoietic & Immune
`MouseRNAseqData()`	mouse	358	RNA-seq	18	28	Non-specific

Details for each dataset can be viewed on the corresponding help page (e.g. ?ImmGenData). The available sample types in each set can be viewed in the collapsible sections below.

BlueprintEncodeData Labels

label.main	label.fine
Neutrophils	Neutrophils
Monocytes	Monocytes
HSC	MEP
CD4+ T-cells	CD4+ T-cells
CD4+ T-cells	Tregs
CD4+ T-cells	CD4+ Tcm
CD4+ T-cells	CD4+ Tem
CD8+ T-cells	CD8+ Tcm
CD8+ T-cells	CD8+ Tem
NK cells	NK cells
B-cells	naive B-cells
B-cells	Memory B-cells
B-cells	Class-switched memory B-cells
HSC	HSC
HSC	MPP
HSC	CLP
HSC	GMP
Macrophages	Macrophages
CD8+ T-cells	CD8+ T-cells
Erythrocytes	Erythrocytes
HSC	Megakaryocytes
HSC	CMP
Macrophages	Macrophages M1
Macrophages	Macrophages M2
Endothelial cells	Endothelial cells
DC	DC
Eosinophils	Eosinophils
B-cells	Plasma cells
Chondrocytes	Chondrocytes
Fibroblasts	Fibroblasts
Smooth muscle	Smooth muscle
Epithelial cells	Epithelial cells
Melanocytes	Melanocytes
Skeletal muscle	Skeletal muscle
Keratinocytes	Keratinocytes
Endothelial cells	mv Endothelial cells
Myocytes	Myocytes
Adipocytes	Adipocytes
Neurons	Neurons
Pericytes	Pericytes
Adipocytes	Preadipocytes
Adipocytes	Astrocytes
Mesangial cells	Mesangial cells

HumanPrimaryCellAtlasData Labels

label.main	label.fine
DC	DC:monocyte-derived:immature
DC	DC:monocyte-derived:Galectin-1
DC	DC:monocyte-derived:LPS
DC	DC:monocyte-derived
Smooth_muscle_cells	Smooth_muscle_cells:bronchial:vit_D
Smooth_muscle_cells	Smooth_muscle_cells:bronchial
Epithelial_cells	Epithelial_cells:bronchial
B_cell	B_cell
Neutrophils	Neutrophil
T_cells	T_cell:CD8+_Central_memory
T_cells	T_cell:CD8+
T_cells	T_cell:CD4+
T_cells	T_cell:CD8+_effector_memory_RA
T_cells	T_cell:CD8+_effector_memory
T_cells	T_cell:CD8+_naive
Monocyte	Monocyte
Erythroblast	Erythroblast
BM & Prog.	BM
DC	DC:monocyte-derived:rosiglitazone
DC	DC:monocyte-derived:AM580
DC	DC:monocyte-derived:rosiglitazone/AGN193109
DC	DC:monocyte-derived:anti-DC-SIGN_2h
Endothelial_cells	Endothelial_cells:HUVEC
Endothelial_cells	Endothelial_cells:HUVEC:Borrelia_burgdorferi
Endothelial_cells	Endothelial_cells:HUVEC:IFNg
Endothelial_cells	Endothelial_cells:lymphatic
Endothelial_cells	Endothelial_cells:HUVEC:Serum_Amyloid_A
Endothelial_cells	Endothelial_cells:lymphatic:TNFa_48h
T_cells	T_cell:effector
T_cells	T_cell:CCR10+CLA+1,25(OH)2_vit_D3/IL-12
T_cells	T_cell:CCR10-CLA+1,25(OH)2_vit_D3/IL-12
Gametocytes	Gametocytes:spermatocyte
DC	DC:monocyte-derived:A._fumigatus_germ_tubes_6h
Neurons	Neurons:ES_cell-derived_neural_precursor
Keratinocytes	Keratinocytes
Keratinocytes	Keratinocytes:IL19
Keratinocytes	Keratinocytes:IL20
Keratinocytes	Keratinocytes:IL22
Keratinocytes	Keratinocytes:IL24
Keratinocytes	Keratinocytes:IL26
Keratinocytes	Keratinocytes:KGF
Keratinocytes	Keratinocytes:IFNg
Keratinocytes	Keratinocytes:IL1b
HSC_-G-CSF	HSC_-G-CSF
DC	DC:monocyte-derived:mature
Monocyte	Monocyte:anti-FcgRIIB
Macrophage	Macrophage:monocyte-derived:IL-4/cntrl
Macrophage	Macrophage:monocyte-derived:IL-4/Dex/cntrl
Macrophage	Macrophage:monocyte-derived:IL-4/Dex/TGFb
Macrophage	Macrophage:monocyte-derived:IL-4/TGFb
Monocyte	Monocyte:leukotriene_D4
NK_cell	NK_cell
NK_cell	NK_cell:IL2
Embryonic_stem_cells	Embryonic_stem_cells
Tissue_stem_cells	Tissue_stem_cells:iliac_MSC
Chondrocytes	Chondrocytes:MSC-derived
Osteoblasts	Osteoblasts
Tissue_stem_cells	Tissue_stem_cells:BM_MSC
Osteoblasts	Osteoblasts:BMP2
Tissue_stem_cells	Tissue_stem_cells:BM_MSC:BMP2
Tissue_stem_cells	Tissue_stem_cells:BM_MSC:TGFb3
DC	DC:monocyte-derived:Poly(IC)
DC	DC:monocyte-derived:CD40L
DC	DC:monocyte-derived:Schuler_treatment
DC	DC:monocyte-derived:antiCD40/VAF347
Tissue_stem_cells	Tissue_stem_cells:dental_pulp
T_cells	T_cell:CD4+_central_memory
T_cells	T_cell:CD4+_effector_memory
T_cells	T_cell:CD4+_Naive
Smooth_muscle_cells	Smooth_muscle_cells:vascular
Smooth_muscle_cells	Smooth_muscle_cells:vascular:IL-17
BM	BM
Platelets	Platelets
Epithelial_cells	Epithelial_cells:bladder
Macrophage	Macrophage:monocyte-derived
Macrophage	Macrophage:monocyte-derived:M-CSF
Macrophage	Macrophage:monocyte-derived:M-CSF/IFNg
Macrophage	Macrophage:monocyte-derived:M-CSF/Pam3Cys
Macrophage	Macrophage:monocyte-derived:M-CSF/IFNg/Pam3Cys
Macrophage	Macrophage:monocyte-derived:IFNa
Gametocytes	Gametocytes:oocyte
Monocyte	Monocyte:F._tularensis_novicida
Endothelial_cells	Endothelial_cells:HUVEC:B._anthracis_LT
B_cell	B_cell:Germinal_center
B_cell	B_cell:Plasma_cell
B_cell	B_cell:Naive
B_cell	B_cell:Memory
DC	DC:monocyte-derived:AEC-conditioned
Tissue_stem_cells	Tissue_stem_cells:lipoma-derived_MSC
Tissue_stem_cells	Tissue_stem_cells:adipose-derived_MSC_AM3
Endothelial_cells	Endothelial_cells:HUVEC:FPV-infected
Endothelial_cells	Endothelial_cells:HUVEC:PR8-infected
Endothelial_cells	Endothelial_cells:HUVEC:H5N1-infected
Macrophage	Macrophage:monocyte-derived:S._aureus
Fibroblasts	Fibroblasts:foreskin
iPS_cells	iPS_cells:skin_fibroblast-derived
iPS_cells	iPS_cells:skin_fibroblast
T_cells	T_cell:gamma-delta
Monocyte	Monocyte:CD14+
Macrophage	Macrophage:Alveolar
Macrophage	Macrophage:Alveolar:B._anthacis_spores
Neutrophils	Neutrophil:inflam
iPS_cells	iPS_cells:PDB_fibroblasts
iPS_cells	iPS_cells:PDB_1lox-17Puro-5
iPS_cells	iPS_cells:PDB_1lox-17Puro-10
iPS_cells	iPS_cells:PDB_1lox-21Puro-20
iPS_cells	iPS_cells:PDB_1lox-21Puro-26
iPS_cells	iPS_cells:PDB_2lox-5
iPS_cells	iPS_cells:PDB_2lox-22
iPS_cells	iPS_cells:PDB_2lox-21
iPS_cells	iPS_cells:PDB_2lox-17
iPS_cells	iPS_cells:CRL2097_foreskin
iPS_cells	iPS_cells:CRL2097_foreskin-derived:d20_hepatic_diff
iPS_cells	iPS_cells:CRL2097_foreskin-derived:undiff.
B_cell	B_cell:CXCR4+_centroblast
B_cell	B_cell:CXCR4-_centrocyte
Endothelial_cells	Endothelial_cells:HUVEC:VEGF
iPS_cells	iPS_cells:fibroblasts
iPS_cells	iPS_cells:fibroblast-derived:Direct_del._reprog
iPS_cells	iPS_cells:fibroblast-derived:Retroviral_transf
Endothelial_cells	Endothelial_cells:lymphatic:KSHV
Endothelial_cells	Endothelial_cells:blood_vessel
Monocyte	Monocyte:CD16-
Monocyte	Monocyte:CD16+
Tissue_stem_cells	Tissue_stem_cells:BM_MSC:osteogenic
Hepatocytes	Hepatocytes
Neutrophils	Neutrophil:uropathogenic_E._coli_UTI89
Neutrophils	Neutrophil:commensal_E._coli_MG1655
MSC	MSC
Neuroepithelial_cell	Neuroepithelial_cell:ESC-derived
Astrocyte	Astrocyte:Embryonic_stem_cell-derived
Endothelial_cells	Endothelial_cells:HUVEC:IL-1b
HSC_CD34+	HSC_CD34+
CMP	CMP
GMP	GMP
B_cell	B_cell:immature
MEP	MEP
Myelocyte	Myelocyte
Pre-B_cell_CD34-	Pre-B_cell_CD34-
Pro-B_cell_CD34+	Pro-B_cell_CD34+
Pro-Myelocyte	Pro-Myelocyte
Smooth_muscle_cells	Smooth_muscle_cells:umbilical_vein
iPS_cells	iPS_cells:foreskin_fibrobasts
iPS_cells	iPS_cells:iPS:minicircle-derived
iPS_cells	iPS_cells:adipose_stem_cells
iPS_cells	iPS_cells:adipose_stem_cell-derived:lentiviral
iPS_cells	iPS_cells:adipose_stem_cell-derived:minicircle-derived
Fibroblasts	Fibroblasts:breast
Monocyte	Monocyte:MCSF
Monocyte	Monocyte:CXCL4
Neurons	Neurons:adrenal_medulla_cell_line
Tissue_stem_cells	Tissue_stem_cells:CD326-CD56+
NK_cell	NK_cell:CD56hiCD62L+
T_cells	T_cell:Treg:Naive
Neutrophils	Neutrophil:LPS
Neutrophils	Neutrophil:GM-CSF_IFNg
Monocyte	Monocyte:S._typhimurium_flagellin
Neurons	Neurons:Schwann_cell

DatabaseImmuneCellExpressionData Labels

label.main	label.fine
B cells	B cells, naive
Monocytes	Monocytes, CD14+
Monocytes	Monocytes, CD16+
NK cells	NK cells
T cells, CD4+	T cells, CD4+, memory TREG
T cells, CD4+	T cells, CD4+, naive
T cells, CD4+	T cells, CD4+, naive, stimulated
T cells, CD4+	T cells, CD4+, naive TREG
T cells, CD4+	T cells, CD4+, TFH
T cells, CD4+	T cells, CD4+, Th1
T cells, CD4+	T cells, CD4+, Th1_17
T cells, CD4+	T cells, CD4+, Th17
T cells, CD4+	T cells, CD4+, Th2
T cells, CD8+	T cells, CD8+, naive
T cells, CD8+	T cells, CD8+, naive, stimulated

NovershternHematopoieticData Labels

	label.main	label.fine
Basophils	Basophils	Basophils
Naïve B cells	B cells	Naïve B cells
Mature B cells class able to switch	B cells	Mature B cells class able to switch
Mature B cells	B cells	Mature B cells
Mature B cells class switched	B cells	Mature B cells class switched
Common myeloid progenitors	CMPs	Common myeloid progenitors
Plasmacytoid Dendritic Cells	Dendritic cells	Plasmacytoid Dendritic Cells
Myeloid Dendritic Cells	Dendritic cells	Myeloid Dendritic Cells
Eosinophils	Eosinophils	Eosinophils
Erythroid_CD34+ CD71+ GlyA-	Erythroid cells	Erythroid_CD34+ CD71+ GlyA-
Erythroid_CD34- CD71+ GlyA-	Erythroid cells	Erythroid_CD34- CD71+ GlyA-
Erythroid_CD34- CD71+ GlyA+	Erythroid cells	Erythroid_CD34- CD71+ GlyA+
Erythroid_CD34- CD71lo GlyA+	Erythroid cells	Erythroid_CD34- CD71lo GlyA+
Erythroid_CD34- CD71- GlyA+	Erythroid cells	Erythroid_CD34- CD71- GlyA+
Granulocyte/monocyte progenitors	GMPs	Granulocyte/monocyte progenitors
Colony Forming Unit-Granulocytes	Granulocytes	Colony Forming Unit-Granulocytes
Granulocytes (Neutrophilic Metamyelocytes)	Granulocytes	Granulocytes (Neutrophilic Metamyelocytes)
Granulocytes (Neutrophils)	Granulocytes	Granulocytes (Neutrophils)
Hematopoietic stem cells_CD133+ CD34dim	HSCs	Hematopoietic stem cells_CD133+ CD34dim
Hematopoietic stem cells_CD38- CD34+	HSCs	Hematopoietic stem cells_CD38- CD34+
Colony Forming Unit-Megakaryocytic	Megakaryocytes	Colony Forming Unit-Megakaryocytic
Megakaryocytes	Megakaryocytes	Megakaryocytes
Megakaryocyte/erythroid progenitors	MEPs	Megakaryocyte/erythroid progenitors
Colony Forming Unit-Monocytes	Monocytes	Colony Forming Unit-Monocytes
Monocytes	Monocytes	Monocytes
Mature NK cells_CD56- CD16+ CD3-	NK cells	Mature NK cells_CD56- CD16+ CD3-
Mature NK cells_CD56+ CD16+ CD3-	NK cells	Mature NK cells_CD56+ CD16+ CD3-
Mature NK cells_CD56- CD16- CD3-	NK cells	Mature NK cells_CD56- CD16- CD3-
NK T cells	NK T cells	NK T cells
Early B cells	B cells	Early B cells
Pro B cells	B cells	Pro B cells
CD8+ Effector Memory RA	CD8+ T cells	CD8+ Effector Memory RA
Naive CD8+ T cells	CD8+ T cells	Naive CD8+ T cells
CD8+ Effector Memory	CD8+ T cells	CD8+ Effector Memory
CD8+ Central Memory	CD8+ T cells	CD8+ Central Memory
Naive CD4+ T cells	CD4+ T cells	Naive CD4+ T cells
CD4+ Effector Memory	CD4+ T cells	CD4+ Effector Memory
CD4+ Central Memory	CD4+ T cells	CD4+ Central Memory

MonacoImmuneData Labels

	label.main	label.fine
Naive CD8 T cells	CD8+ T cells	Naive CD8 T cells
Central memory CD8 T cells	CD8+ T cells	Central memory CD8 T cells
Effector memory CD8 T cells	CD8+ T cells	Effector memory CD8 T cells
Terminal effector CD8 T cells	CD8+ T cells	Terminal effector CD8 T cells
MAIT cells	T cells	MAIT cells
Vd2 gd T cells	T cells	Vd2 gd T cells
Non-Vd2 gd T cells	T cells	Non-Vd2 gd T cells
Follicular helper T cells	CD4+ T cells	Follicular helper T cells
T regulatory cells	CD4+ T cells	T regulatory cells
Th1 cells	CD4+ T cells	Th1 cells
Th1/Th17 cells	CD4+ T cells	Th1/Th17 cells
Th17 cells	CD4+ T cells	Th17 cells
Th2 cells	CD4+ T cells	Th2 cells
Naive CD4 T cells	CD4+ T cells	Naive CD4 T cells
Progenitor cells	Progenitors	Progenitor cells
Naive B cells	B cells	Naive B cells
Non-switched memory B cells	B cells	Non-switched memory B cells
Exhausted B cells	B cells	Exhausted B cells
Switched memory B cells	B cells	Switched memory B cells
Plasmablasts	B cells	Plasmablasts
Classical monocytes	Monocytes	Classical monocytes
Intermediate monocytes	Monocytes	Intermediate monocytes
Non classical monocytes	Monocytes	Non classical monocytes
Natural killer cells	NK cells	Natural killer cells
Plasmacytoid dendritic cells	Dendritic cells	Plasmacytoid dendritic cells
Myeloid dendritic cells	Dendritic cells	Myeloid dendritic cells
Low-density neutrophils	Neutrophils	Low-density neutrophils
Low-density basophils	Basophils	Low-density basophils
Terminal effector CD4 T cells	CD4+ T cells	Terminal effector CD4 T cells

ImmGenData Labels

label.main	label.fine
Macrophages	Macrophages (MF.11C-11B+)
Macrophages	Macrophages (MF.ALV)
Monocytes	Monocytes (MO.6+I-)
Monocytes	Monocytes (MO.6+2+)
B cells	B cells (B.MEM)
B cells	B cells (B1A)
DC	DC (DC.11B+)
DC	DC (DC.11B-)
Stromal cells	Stromal cells (DN.CFA)
Stromal cells	Stromal cells (DN)
Eosinophils	Eosinophils (EO)
Fibroblasts	Fibroblasts (FRC.CAD11.WT)
Fibroblasts	Fibroblasts (FRC.CFA)
Fibroblasts	Fibroblasts (FRC)
Neutrophils	Neutrophils (GN)
Endothelial cells	Endothelial cells (LEC.CFA)
Endothelial cells	Endothelial cells (LEC)
Macrophages	Macrophages (MF)
T cells	T cells (T.DP.69-)
T cells	T cells (T.DP)
T cells	T cells (T.DP69+)
Macrophages	Macrophages (MF.F480HI.GATA6KO)
Macrophages	Macrophages (MF.F480HI.CTRL)
T cells	T cells (T.CD4.1H)
T cells	T cells (T.CD4.24H)
T cells	T cells (T.CD4.48H)
T cells	T cells (T.CD4.5H)
T cells	T cells (T.CD4.96H)
T cells	T cells (T.CD4.CTR)
T cells	T cells (T.CD8.1H)
T cells	T cells (T.CD8.24H)
T cells	T cells (T.CD8.48H)
T cells	T cells (T.CD8.5H)
T cells	T cells (T.CD8.96H)
T cells	T cells (T.CD8.CTR)
Macrophages	Macrophages (MFAR-)
Monocytes	Monocytes (MO)
ILC	ILC (ILC1.CD127+)
ILC	ILC (LIV.ILC1.DX5-)
ILC	ILC (LPL.NCR+ILC1)
ILC	ILC (ILC2)
ILC	ILC (LPL.NCR+ILC3)
ILC	ILC (ILC3.LTI.CD4+)
ILC	ILC (ILC3.LTI.CD4-)
ILC	ILC (ILC3.LTI.4+)
NK cells	NK cells (NK.CD127-)
ILC	ILC (LIV.NK.DX5+)
ILC	ILC (LPL.NCR+CNK)
Basophils	Basophils (BA)
Epithelial cells	Epithelial cells (Ep.5wk.MEC.Sca1+)
Epithelial cells	Epithelial cells (Ep.5wk.MEChi)
Epithelial cells	Epithelial cells (Ep.5wk.MEClo)
Epithelial cells	Epithelial cells (Ep.8wk.CEC.Sca1+)
Epithelial cells	Epithelial cells (Ep.8wk.CEChi)
Epithelial cells	Epithelial cells (Ep.8wk.MEChi)
Epithelial cells	Epithelial cells (Ep.8wk.MEClo)
Mast cells	Mast cells (MC.ES)
Mast cells	Mast cells (MC)
Mast cells	Mast cells (MC.TO)
Mast cells	Mast cells (MC.TR)
Mast cells	Mast cells (MC.DIGEST)
Epithelial cells	Epithelial cells (MECHI.GFP+.ADULT)
Epithelial cells	Epithelial cells (MECHI.GFP+.ADULT.KO)
Epithelial cells	Epithelial cells (MECHI.GFP-.ADULT)
Macrophages	Macrophages (MF.480HI.NAIVE)
Macrophages	Macrophages (MF.480INT.NAIVE)
T cells	T cells (T.4EFF49D+11A+.D8.LCMV)
T cells	T cells (T.4MEM49D+11A+.D30.LCMV)
T cells	T cells (T.4NVE44-49D-11A-)
T cells	T cells (T.8EFF.TBET+.OT1LISOVA)
T cells	T cells (T.8EFF.TBET-.OT1LISOVA)
T cells	T cells (T.8EFFKLRG1+CD127-.D8.LISOVA)
T cells	T cells (T.8MEMKLRG1-CD127+.D8.LISOVA)
T cells	T cells (T.4+8int)
T cells	T cells (T.4FP3+25+)
T cells	T cells (T.4int8+)
T cells	T cells (T.4SP24-)
T cells	T cells (T.4SP24int)
T cells	T cells (T.4SP69+)
T cells	T cells (T.8SP24-)
T cells	T cells (T.8SP24int)
T cells	T cells (T.8SP69+)
T cells	T cells (T.DPbl)
T cells	T cells (T.DPsm)
T cells	T cells (T.ISP)
B cells	B cells (B.FrE)
B cells	B cells (B.FrF)
B cells	B cells (preB.FrD)
B cells	B cells (proB.FrBC)
B cells	B cells (preB.FrC)
Stem cells	Stem cells (SC.STSL)
T cells	T cells (T.CD4+TESTNA)
T cells	T cells (T.CD4+TESTDB)
B cells	B cells (B.CD19CONTROL)
T cells	T cells (T.CD4CONTROL)
T cells	T cells (T.CD4TESTJS)
T cells	T cells (T.CD4TESTCJ)
Stem cells	Stem cells (SC.CD150-CD48-)
Tgd	Tgd (Tgd.imm.vg2+)
Tgd	Tgd (Tgd.imm.vg2)
Tgd	Tgd (Tgd.mat.vg3)
Tgd	Tgd (Tgd.mat.vg3.)
Tgd	Tgd (Tgd)
Tgd	Tgd (Tgd.vg2+.act)
Tgd	Tgd (Tgd.vg2-.act)
Tgd	Tgd (Tgd.vg2-)
B cells	B cells (B.Fo)
B cells	B cells (B.FRE)
B cells	B cells (B.GC)
B cells	B cells (B.MZ)
B cells	B cells (B.T1)
B cells	B cells (B.T2)
B cells	B cells (B.T3)
B cells	B cells (B1a)
B cells	B cells (B1b)
DC	DC (DC)
DC	DC (DC.103+11B-)
DC	DC (DC.8-4-11B+)
DC	DC (DC.LC)
NK cells	NK cells (NK.49CI+)
NK cells	NK cells (NK.49CI-)
NK cells	NK cells (NK.B2M-)
NK cells	NK cells (NK.DAP10-)
NK cells	NK cells (NK.DAP12-)
NK cells	NK cells (NK.H+.MCMV1)
NK cells	NK cells (NK.H+.MCMV7)
NK cells	NK cells (NK.H+MCMV1)
NK cells	NK cells (NK.MCMV7)
NK cells	NK cells (NK)
NKT	NKT (NKT.4+)
NKT	NKT (NKT.4-)
NKT	NKT (NKT.44+NK1.1+)
NKT	NKT (NKT.44+NK1.1-)
NKT	NKT (NKT.44-NK1.1-)
B cells	B cells (preB.FRD)
B cells	B cells (proB.CLP)
Stem cells	Stem cells (proB.CLP)
B cells	B cells (proB.FrA)
B cells	B cells (proB.FRA)
B cells, pro	B cells, pro (proB.FrA)
T cells	T cells (T.4MEM)
T cells	T cells (T.4Mem)
T cells	T cells (T.4MEM44H62L)
T cells	T cells (T.4Nve)
T cells	T cells (T.4NVE)
T cells	T cells (T.8EFF.OT1.D15.VSVOVA)
T cells	T cells (T.8EFF.OT1.D5.VSVOVA)
T cells	T cells (T.8EFF.OT1.VSVOVA)
T cells	T cells (T.8EFF.OT1.D8.VSVOVA)
T cells	T cells (T.8MEM)
T cells	T cells (T.8Mem)
T cells	T cells (T.8MEM.OT1.D106.VSVOVA)
T cells	T cells (T.8EFF.OT1.D45VSV)
T cells	T cells (T.8Nve)
T cells	T cells (T.8NVE)
B cells	B cells (proB.FRBC)
T cells	T cells (T.4)
T cells	T cells (T.4.Pa)
T cells	T cells (T.4.PLN)
T cells	T cells (T.4FP3-)
Tgd	Tgd (Tgd.VG2+)
Tgd	Tgd (Tgd.vg2+.TCRbko)
Tgd	Tgd (Tgd.vg2-.TCRbko)
Tgd	Tgd (Tgd.vg5+.act)
Tgd	Tgd (Tgd.VG5+.ACT)
Tgd	Tgd (Tgd.VG5+)
Tgd	Tgd (Tgd.vg5-.act)
Tgd	Tgd (Tgd.VG5-)
NK cells	NK cells (NK.49H+)
NK cells	NK cells (NK.49H-)
DC	DC (DC.8+)
DC	DC (DC.8-)
DC	DC (DC.8-4-11B-)
DC	DC (DC.PDC.8+)
DC	DC (DC.PDC.8-)
Macrophages	Macrophages (MF.II-480HI)
Macrophages	Macrophages (MF.RP)
Macrophages	Macrophages (MFIO5.II+480INT)
Macrophages	Macrophages (MFIO5.II+480LO)
Macrophages	Macrophages (MFIO5.II-480HI)
Macrophages	Macrophages (MFIO5.II-480INT)
Monocytes	Monocytes (MO.6C+II+)
Monocytes	Monocytes (MO.6C+II-)
Monocytes	Monocytes (MO.6C-II+)
Monocytes	Monocytes (MO.6C-II-)
Monocytes	Monocytes (MO.6C-IIINT)
T cells	T cells (T.8EFF.OT1.D10LIS)
T cells	T cells (T.8EFF.OT1.D10.LISOVA)
T cells	T cells (T.8EFF.OT1.D15LIS)
T cells	T cells (T.8EFF.OT1.D15.LISOVA)
T cells	T cells (T.8EFF.OT1LISO)
T cells	T cells (T.8EFF.OT1.LISOVA)
T cells	T cells (T.8EFF.OT1.D8LISO)
T cells	T cells (T.8EFF.OT1.D8.LISOVA)
T cells	T cells (T.8MEM.OT1.D100.LISOVA)
T cells	T cells (T.8MEM.OT1.D45.LISOVA)
T cells	T cells (T.8NVE.OT1)
B cells	B cells (B.FO)
Endothelial cells	Endothelial cells (BEC)
Epithelial cells	Epithelial cells (EP.MECHI)
Fibroblasts	Fibroblasts (FI.MTS15+)
Fibroblasts	Fibroblasts (FI)
Stromal cells	Stromal cells (ST.31-38-44-)
Stem cells	Stem cells (SC.LT34F)
Stem cells	Stem cells (SC.MDP)
Stem cells	Stem cells (SC.MEP)
Stem cells	Stem cells (SC.MPP34F)
Stem cells	Stem cells (SC.ST34F)
Stem cells	Stem cells (SC.CDP)
Stem cells	Stem cells (SC.CMP.DR)
Stem cells	Stem cells (GMP)
Stem cells	Stem cells (MLP)
Stem cells	Stem cells (LTHSC)
T cells	T cells (T.DN2-3)
T cells	T cells (T.DN2)
T cells	T cells (T.DN2A)
T cells	T cells (T.DN2B)
T cells	T cells (T.DN3-4)
T cells	T cells (T.DN3A)
T cells	T cells (T.DN3B)
T cells	T cells (T.DN1-2)
T cells	T cells (T.DN4)
Macrophages	Macrophages (MF.103-11B+.SALM3)
Macrophages	Macrophages (MF.103-11B+)
DC	DC (DC.103-11B+24+)
Macrophages	Macrophages (MF.103-11B+24-)
DC	DC (DC.103-11B+F4-80LO.KD)
Macrophages	Macrophages (MF.11CLOSER.SALM3)
Macrophages	Macrophages (MF.11CLOSER)
Macrophages	Macrophages (MF.103CLOSER)
Macrophages	Macrophages (MF.II+480LO)
Neutrophils	Neutrophils (GN.ARTH)
Neutrophils	Neutrophils (GN.Thio)
Neutrophils	Neutrophils (GN.URAC)
Macrophages	Macrophages (MF.169+11CHI)
Macrophages	Macrophages (MF.MEDL)
Macrophages	Macrophages (MF.SBCAPS)
Microglia	Microglia (Microglia)
T cells	T cells (T.ETP)
Tgd	Tgd (Tgd.imm.VG1+)
Tgd	Tgd (Tgd.imm.VG1+VD6+)
Tgd	Tgd (Tgd.mat.VG1+)
Tgd	Tgd (Tgd.mat.VG1+VD6+)
Tgd	Tgd (Tgd.mat.VG2+)
Tgd	Tgd (Tgd.VG3+24AHI)
Tgd	Tgd (Tgd.VG5+24AHI)
T cells	T cells (T.8EFF.OT1.12HR.LISOVA)
T cells	T cells (T.8EFF.OT1.24HR.LISOVA)
T cells	T cells (T.8EFF.OT1.48HR.LISOVA)
T cells	T cells (T.Tregs)
Tgd	Tgd (Tgd.VG2+24AHI)
Tgd	Tgd (Tgd.VG4+24AHI)
Tgd	Tgd (Tgd.VG4+24ALO)

MouseRNAseqData Labels

label.main	label.fine
Adipocytes	Adipocytes
Neurons	aNSCs
Astrocytes	Astrocytes
Astrocytes	Astrocytes activated
Endothelial cells	Endothelial cells
Erythrocytes	Erythrocytes
Fibroblasts	Fibroblasts
Fibroblasts	Fibroblasts activated
Fibroblasts	Fibroblasts senescent
Granulocytes	Granulocytes
Macrophages	Macrophages
Microglia	Microglia
Microglia	Microglia activated
Monocytes	Monocytes
Neurons	Neurons
Neurons	Neurons activated
NK cells	NK cells
Neurons	NPCs
Oligodendrocytes	Oligodendrocytes
Neurons	qNSCs
T cells	T cells
Dendritic cells	Dendritic cells
Cardiomyocytes	Cardiomyocytes
Hepatocytes	Hepatocytes
B cells	B cells
Epithelial cells	Ependymal
Oligodendrocytes	OPCs
Macrophages	Macrophages activated

6 Reference options

6.1 Pseudo-bulk aggregation

Single-cell reference datasets provide a like-for-like comparison to our test datasets, yielding a more accurate classification of the cells in the latter (hopefully). However, there are frequently many more samples in single-cell references compared to bulk references, increasing the computational work involved in classification. We avoid this by aggregating cells into one “pseudo-bulk” sample per label (e.g., by averaging across log-expression values) and using those as the reference, which allows us to achieve the same efficiency as the use of bulk references.

The obvious cost of this approach is that we discard potentially useful information about the distribution of cells within each label. Cells that belong to a heterogeneous population may not be correctly assigned if they are far from the population center. We attempt to preserve some of this information by using \(k\)-means clustering within each cell to create pseudo-bulk samples that are representative of a particular region of the expression space (i.e., vector quantization). We create \(\sqrt{N}\) clusters given a label with \(N\) cells, which provides a reasonable compromise between reducing computational work and preserving the label’s internal distribution.

This aggregation approach is implemented in the aggregateReferences function, which is shown in action below for the Muraro et al. (2016) dataset. The function returns a SummarizedExperiment object containing the pseudo-bulk expression profiles and the corresponding labels.

set.seed(100) # for the k-means step.
aggr <- aggregateReference(sceM, labels=sceM$label)
aggr

## class: SummarizedExperiment 
## dim: 19059 116 
## metadata(0):
## assays(1): logcounts
## rownames(19059): A1BG-AS1__chr19 A1BG__chr19 ... ZZEF1__chr17
##   ZZZ3__chr1
## rowData names(0):
## colnames(116): alpha.1 alpha.2 ... mesenchymal.8 epsilon.1
## colData names(1): label

The resulting SummarizedExperiment can then be used as a reference in SingleR().

pred.aggr <- SingleR(sceG, aggr, labels=aggr$label)
table(pred.aggr$labels)

## 
## acinar   beta  delta   duct 
##     52      4      1     43

6.2 Using multiple references

In some cases, we may wish to use multiple references for annotation of a test dataset. This yield a more comprehensive set of cell types that are not covered by any individual reference, especially when differences in resolution are also considered. Use of multiple references is supported by simply passing multiple objects to the ref= and label= argument in SingleR(). We demonstrate below by including another reference (from Blueprint-Encode) in our annotation of the La Manno et al. (2016) dataset:

bp.se <- BlueprintEncodeData()

pred.combined <- SingleR(test = hESCs, 
    ref = list(BP=bp.se, HPCA=hpca.se), 
    labels = list(bp.se$label.main, hpca.se$label.main))

The output is the same form as previously described, and we can easily gain access to the combined set of labels:

table(pred.combined$labels)

## 
##            Astrocyte Neuroepithelial_cell              Neurons 
##                    4                   63                   33

Our strategy is to perform annotation on each reference separately and then take the highest-scoring label across references. This provides a light-weight approach to combining information from multiple references while avoiding batch effects and the need for up-front harmonization. (Of course, the main practical difficulty of this approach is that the same cell type may have different labels across references, which will require some implicit harmonization during interpretation.) Further comments on the justification behind the choice of this method can be found at ?combineResults.

6.3 Harmonizing labels

The matchReferences() function provides a simple yet elegant approach for label harmonization between two references. Each reference is used to annotate the other and the probability of mutual assignment between each pair of labels is computed. Probabilities close to 1 indicate there is a 1:1 relation between that pair of labels; on the other hand, an all-zero probability vector indicates that a label is unique to a particular reference.

matched <- matchReferences(bp.se, hpca.se,
    bp.se$label.main, hpca.se$label.main)
pheatmap::pheatmap(matched, col=viridis::plasma(100))

A heatmap like the one above can be used to guide harmonization to enforce a consistent vocabulary across all labels representing the same cell type or state. The most obvious benefit of harmonization is that interpretation of the results is simplified. However, an even more important effect is that the presence of harmonized labels from multiple references allows the classification machinery to protect against irrelevant batch effects between references. For example, in SingleR()’s case, marker genes are favored if they are consistently upregulated across multiple references, improving robustness to technical idiosyncrasies in any test dataset.

We stress that some manual intervention is still required in this process, given the risks posed by differences in biological systems and technologies. For example, neurons are considered unique to each reference while smooth muscle cells in the HPCA data are incorrectly matched to fibroblasts in the Blueprint/ENCODE data. CD4⁺ and CD8⁺ T cells are also both assigned to “T cells”, so some decision about the acceptable resolution of the harmonized labels is required here.

As an aside, we can also use this function to identify the matching clusters between two independent scRNA-seq analyses. This is an “off-label” use that involves substituting the cluster assignments as proxies for the labels. We can then match up clusters and integrate conclusions from multiple datasets without the difficulty of batch correction and reclustering.

7 Advanced use

7.1 Improving efficiency

Advanced users can split the SingleR() workflow into two separate training and classification steps. This means that training (e.g., marker detection, assembling of nearest-neighbor indices) only needs to be performed once. The resulting data structures can then be re-used across multiple classifications with different test datasets, provided the test feature set is identical to or a superset of the features in the training set. For example:

common <- intersect(rownames(hESCs), rownames(hpca.se))
trained <- trainSingleR(hpca.se[common,], labels=hpca.se$label.main)
pred.hesc2 <- classifySingleR(hESCs[common,], trained)
table(pred.hesc$labels, pred.hesc2$labels)

##                       
##                        Astrocyte Neuroepithelial_cell Neurons
##   Astrocyte                   14                    0       0
##   Neuroepithelial_cell         0                   81       0
##   Neurons                      0                    0       5

Other efficiency improvements are possible through several arguments:

Switching to an approximate algorithm for the nearest neighbor search in trainSingleR() via the BNPARAM= argument from the BiocNeighbors package.
Parallelizing the fine-tuning step in classifySingleR() with the BPPARAM= argument from the BiocParallel package.

These arguments can also be specified in the SingleR() command.

7.2 Defining custom markers

Users can also construct their own marker lists with any DE testing machinery. For example, we can perform pairwise \(t\)-tests using methods from scran and obtain the top 10 marker genes from each pairwise comparison.

library(scran)
out <- pairwiseTTests(logcounts(sceM), sceM$label, direction="up")
markers <- getTopMarkers(out$statistics, out$pairs, n=10)

We then supply these genes to SingleR() directly via the genes= argument. A more focused gene set also allows annotation to be performed more quickly compared to the default approach.

pred.grun2 <- SingleR(test=sceG, ref=sceM, labels=sceM$label, genes=markers)
table(pred.grun2$labels)

## 
##  acinar    beta   delta    duct      pp unclear 
##      59       4       1      34       1       1

In some cases, markers may only be available for specific labels rather than for pairwise comparisons between labels. This is accommodated by supplying a named list of character vectors to genes. Note that this is likely to be less powerful than the list-of-lists approach as information about pairwise differences is discarded.

label.markers <- lapply(markers, unlist, recursive=FALSE)
pred.grun3 <- SingleR(test=sceG, ref=sceM, labels=sceM$label, genes=label.markers)
table(pred.grun$labels, pred.grun3$labels)

##              
##               acinar beta delta duct pp
##   acinar          51    0     0    2  0
##   beta             0    4     0    0  0
##   delta            0    0     1    0  0
##   duct             2    0     0   39  0
##   endothelial      0    0     0    0  1

8 FAQs

How can I use this with my Seurat, SingleCellExperiment, or cell_data_set object?

SingleR is workflow agnostic - all it needs is normalized counts. An example showing how to map its results back to common single-cell data objects is available in the README.

Where can I find reference sets appropriate for my data?

scRNAseq contains many single-cell datasets with more continually being added. ArrayExpress and GEOquery can be used to download any of the bulk or single-cell datasets in ArrayExpress or GEO, respectively.

9 Session information

sessionInfo()

## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scran_1.14.6                knitr_1.28                 
##  [3] scater_1.14.6               ggplot2_3.3.0              
##  [5] scRNAseq_2.0.2              SingleCellExperiment_1.8.0 
##  [7] SingleR_1.0.6               SummarizedExperiment_1.16.1
##  [9] DelayedArray_0.12.2         BiocParallel_1.20.1        
## [11] matrixStats_0.56.0          Biobase_2.46.0             
## [13] GenomicRanges_1.38.0        GenomeInfoDb_1.22.1        
## [15] IRanges_2.20.2              S4Vectors_0.24.3           
## [17] BiocGenerics_0.32.0         BiocStyle_2.14.4           
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6                  bit64_0.9-7                  
##  [3] RColorBrewer_1.1-2            httr_1.4.1                   
##  [5] tools_3.6.3                   R6_2.4.1                     
##  [7] irlba_2.3.3                   vipor_0.4.5                  
##  [9] DBI_1.1.0                     colorspace_1.4-1             
## [11] withr_2.1.2                   gridExtra_2.3                
## [13] tidyselect_1.0.0              bit_1.1-15.2                 
## [15] curl_4.3                      compiler_3.6.3               
## [17] cli_2.0.2                     BiocNeighbors_1.4.2          
## [19] labeling_0.3                  bookdown_0.18                
## [21] scales_1.1.0                  rappdirs_0.3.1               
## [23] stringr_1.4.0                 digest_0.6.25                
## [25] rmarkdown_2.1                 XVector_0.26.0               
## [27] pkgconfig_2.0.3               htmltools_0.4.0              
## [29] highr_0.8                     limma_3.42.2                 
## [31] dbplyr_1.4.2                  fastmap_1.0.1                
## [33] rlang_0.4.5                   RSQLite_2.2.0                
## [35] shiny_1.4.0.2                 DelayedMatrixStats_1.8.0     
## [37] farver_2.0.3                  dplyr_0.8.5                  
## [39] RCurl_1.98-1.1                magrittr_1.5                 
## [41] BiocSingular_1.2.2            GenomeInfoDbData_1.2.2       
## [43] Matrix_1.2-18                 Rcpp_1.0.4                   
## [45] ggbeeswarm_0.6.0              munsell_0.5.0                
## [47] fansi_0.4.1                   viridis_0.5.1                
## [49] lifecycle_0.2.0               edgeR_3.28.1                 
## [51] stringi_1.4.6                 yaml_2.2.1                   
## [53] zlibbioc_1.32.0               BiocFileCache_1.10.2         
## [55] AnnotationHub_2.18.0          grid_3.6.3                   
## [57] blob_1.2.1                    dqrng_0.2.1                  
## [59] promises_1.1.0                ExperimentHub_1.12.0         
## [61] crayon_1.3.4                  lattice_0.20-41              
## [63] magick_2.3                    locfit_1.5-9.4               
## [65] pillar_1.4.3                  igraph_1.2.5                 
## [67] glue_1.4.0                    BiocVersion_3.10.1           
## [69] evaluate_0.14                 BiocManager_1.30.10          
## [71] vctrs_0.2.4                   httpuv_1.5.2                 
## [73] gtable_0.3.0                  purrr_0.3.3                  
## [75] assertthat_0.2.1              xfun_0.12                    
## [77] rsvd_1.0.3                    mime_0.9                     
## [79] xtable_1.8-4                  later_1.0.0                  
## [81] viridisLite_0.3.0             pheatmap_1.0.12              
## [83] tibble_3.0.0                  AnnotationDbi_1.48.0         
## [85] beeswarm_0.2.3                memoise_1.1.0                
## [87] statmod_1.4.34                ellipsis_0.3.0               
## [89] interactiveDisplayBase_1.24.0

References

Aran, D., A. P. Looney, L. Liu, E. Wu, V. Fong, A. Hsu, S. Chak, et al. 2019. “Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage.” Nat. Immunol. 20 (2):163–72.

Benayoun, Bérénice A., Elizabeth A. Pollina, Param Priya Singh, Salah Mahmoudi, Itamar Harel, Kerriann M. Casey, Ben W. Dulken, Anshul Kundaje, and Anne Brunet. 2019. “Remodeling of epigenome and transcriptome landscapes with aging in mice reveals widespread induction of inflammatory responses.” Genome Research 29:697–709. https://doi.org/10.1101/gr.240093.118.

Grun, D., M. J. Muraro, J. C. Boisset, K. Wiebrands, A. Lyubimova, G. Dharmadhikari, M. van den Born, et al. 2016. “De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data.” Cell Stem Cell 19 (2):266–77.

Heng, Tracy S.P., Michio W. Painter, Kutlu Elpek, Veronika Lukacs-Kornek, Nora Mauermann, Shannon J. Turley, Daphne Koller, et al. 2008. “The immunological genome project: Networks of gene expression in immune cells.” Nature Immunology 9 (10):1091–4. https://doi.org/10.1038/ni1008-1091.

La Manno, G., D. Gyllborg, S. Codeluppi, K. Nishimura, C. Salto, A. Zeisel, L. E. Borm, et al. 2016. “Molecular Diversity of Midbrain Development in Mouse, Human, and Stem Cells.” Cell 167 (2):566–80.

Mabbott, Neil A., J. K. Baillie, Helen Brown, Tom C. Freeman, and David A. Hume. 2013. “An expression atlas of human primary cells: Inference of gene function from coexpression networks.” BMC Genomics 14. https://doi.org/10.1186/1471-2164-14-632.

Martens, Joost H A, and Hendrik G. Stunnenberg. 2013. “BLUEPRINT: Mapping human blood cell epigenomes.” Haematologica 98:1487–9. https://doi.org/10.3324/haematol.2013.094243.

Monaco, Gianni, Bernett Lee, Weili Xu, Seri Mustafah, You Yi Hwang, Christophe Carré, Nicolas Burdin, et al. 2019. “RNA-Seq Signatures Normalized by mRNA Abundance Allow Absolute Deconvolution of Human Immune Cell Types.” Cell Reports 26 (6):1627–1640.e7. https://doi.org/10.1016/j.celrep.2019.01.041.

Muraro, M. J., G. Dharmadhikari, D. Grun, N. Groen, T. Dielen, E. Jansen, L. van Gurp, et al. 2016. “A Single-Cell Transcriptome Atlas of the Human Pancreas.” Cell Syst 3 (4):385–94.

Novershtern, Noa, Aravind Subramanian, Lee N. Lawton, Raymond H. Mak, W. Nicholas Haining, Marie E. McConkey, Naomi Habib, et al. 2011. “Densely Interconnected Transcriptional Circuits Control Cell States in Human Hematopoiesis.” Cell 144 (2):296–309. https://doi.org/10.1016/j.cell.2011.01.004.

Schmiedel, Benjamin J., Divya Singh, Ariel Madrigal, Alan G. Valdovino-Gonzalez, Brandie M. White, Jose Zapardiel-Gonzalo, Brendan Ha, et al. 2018. “Impact of Genetic Polymorphisms on Human Immune Cell Gene Expression.” Cell 175 (6):1701–1715.e16. https://doi.org/10.1016/j.cell.2018.10.022.

The ENCODE Project Consortium. 2012. “An integrated encyclopedia of DNA elements in the human genome.” Nature. https://doi.org/10.1038/nature11247.

Using SingleR to annotate single-cell RNA-seq data

Revised: December 18th, 2019

Package