1 Introduction

SingleR is an automatic annotation method for single-cell RNA sequencing (scRNAseq) data (Aran et al. 2019). Given a reference dataset of samples (single-cell or bulk) with known labels, it labels new cells from a test dataset based on similarity to the reference set. Specifically, for each test cell:

  1. We compute the Spearman correlation between its expression profile and that of each reference sample. This is done across the union of marker genes identified between all pairs of labels.
  2. We define the per-label score as a fixed quantile (by default, 0.8) of the distribution of correlations.
  3. We repeat this for all labels and we take the label with the highest score as the annotation for this cell.
  4. We optionally perform a fine-tuning step:
  • The reference dataset is subsetted to only include labels with scores close to the maximum.
  • Scores are recomputed using only marker genes for the subset of labels.
  • This is iterated until one label remains.

Automatic annotation provides a convenient way of transferring biological knowledge across datasets. In this manner, the burden of manually interpreting clusters and defining marker genes only has to be done once, for the reference dataset, and this knowledge can be propagated to new datasets in an automated manner.

2 Using the built-in references

SingleR provides several reference datasets (mostly derived from bulk RNA-seq or microarray data) through dedicated data retrieval functions. For example, we obtain reference data from the Human Primary Cell Atlas using the HumanPrimaryCellAtlasData() function, which returns a SummarizedExperiment object containing matrix of log-expression values with sample-level labels.

library(SingleR)
hpca.se <- HumanPrimaryCellAtlasData()
hpca.se
## class: SummarizedExperiment 
## dim: 19363 713 
## metadata(0):
## assays(1): logcounts
## rownames(19363): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(713): GSM112490 GSM112491 ... GSM92233 GSM92234
## colData names(2): label.main label.fine

Our test dataset will is taken from La Manno et al. (2016).
For the sake of speed, we will only label the first 100 cells from this dataset.

library(scRNAseq)
hESCs <- LaMannoBrainData('human-es')
hESCs <- hESCs[,1:100]

# SingleR() expects log-counts, but the function will also happily take raw
# counts for the test dataset. The reference, however, must have log-values.
library(scater)
hESCs <- logNormCounts(hESCs)

We use our hpca.se reference to annotate each cell in hESCs via the SingleR() function, which uses the algorithm described above. Note that the default marker detection method is to take the genes with the largest positive log-fold changes in the per-label medians for each gene.

pred.hesc <- SingleR(test = hESCs, ref = hpca.se, labels = hpca.se$label.main)
pred.hesc
## DataFrame with 100 rows and 5 columns
##                                                                     scores
##                                                                   <matrix>
## 1772122_301_C02  0.118426779945786:0.179699807625087:0.157326274226517:...
## 1772122_180_E05  0.129708246318855:0.236277439793527:0.202370888668263:...
## 1772122_300_H02  0.158201338525345:0.250060222727419:0.211831550178353:...
## 1772122_180_B09   0.158778546217777:0.27716592787528:0.222681369744636:...
## 1772122_180_G04   0.138505219642345:0.236658649096383:0.19092437361406:...
## ...                                                                    ...
## 1772122_299_E07  0.145931041885859:0.241153701803065:0.217382763112476:...
## 1772122_180_D02  0.122983434596168:0.239181076829949:0.181221997276501:...
## 1772122_300_D09  0.129757310468164:0.233775092572195:0.196637664917917:...
## 1772122_298_F09  0.143118885460347:0.262267367714562:0.214329641867196:...
## 1772122_302_A11 0.0912854247387272:0.185945405472165:0.139232371863794:...
##                         first.labels                         tuning.scores
##                          <character>                           <DataFrame>
## 1772122_301_C02 Neuroepithelial_cell   0.18244020296249:0.0991115652997192
## 1772122_180_E05 Neuroepithelial_cell  0.137548373236792:0.0647133734667384
## 1772122_300_H02 Neuroepithelial_cell   0.275798157639906:0.136969040146444
## 1772122_180_B09 Neuroepithelial_cell 0.0851622797320583:0.0819878452425098
## 1772122_180_G04 Neuroepithelial_cell   0.198841544187094:0.101662168246495
## ...                              ...                                   ...
## 1772122_299_E07 Neuroepithelial_cell  0.176002520599547:0.0922503823656398
## 1772122_180_D02 Neuroepithelial_cell   0.196760862365318:0.112480486219438
## 1772122_300_D09 Neuroepithelial_cell 0.0816424287822026:0.0221368018363302
## 1772122_298_F09 Neuroepithelial_cell  0.187249853552379:0.0671892835266423
## 1772122_302_A11 Neuroepithelial_cell   0.156079956344163:0.105132159755961
##                               labels        pruned.labels
##                          <character>          <character>
## 1772122_301_C02 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_180_E05              Neurons              Neurons
## 1772122_300_H02 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_180_B09 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_180_G04 Neuroepithelial_cell Neuroepithelial_cell
## ...                              ...                  ...
## 1772122_299_E07 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_180_D02 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_300_D09 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_298_F09 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_302_A11            Astrocyte            Astrocyte

Each row of the output DataFrame contains prediction results for a single cell. Labels are shown before fine-tuning (first.labels), after fine-tuning (labels) and after pruning (pruned.labels), along with the associated scores. We summarize the distribution of labels across our subset of cells:

table(pred.hesc$labels)
## 
##            Astrocyte Neuroepithelial_cell              Neurons 
##                   14                   81                    5

At this point, it is worth noting that SingleR is workflow/package agnostic. The above example uses SummarizedExperiment objects, but the same functions will accept any (log-)normalized expression matrix.

3 Using single-cell references

Here, we will use two human pancreas datasets from the scRNAseq package. The aim is to use one pre-labelled dataset to annotate the other unlabelled dataset. First, we set up the Muraro et al. (2016) dataset to be our reference.

library(scRNAseq)
sceM <- MuraroPancreasData()

# One should normally do cell-based quality control at this point, but for
# brevity's sake, we will just remove the unlabelled libraries here.
sceM <- sceM[,!is.na(sceM$label)]

sceM <- logNormCounts(sceM)

We then set up our test dataset from Grun et al. (2016). To speed up this demonstration, we will subset to the first 100 cells.

sceG <- GrunPancreasData()
sceG <- sceG[,colSums(counts(sceG)) > 0] # Remove libraries with no counts.
sceG <- logNormCounts(sceG) 
sceG <- sceG[,1:100]

We then run SingleR() as described previously but with a marker detection mode that considers the variance of expression across cells. Here, we will use the Wilcoxon ranked sum test to identify the top markers for each pairwise comparison between labels. This is slower but more appropriate for single-cell data compared to the default marker detection algorithm (which may fail for low-coverage data where the median is frequently zero).

pred.grun <- SingleR(test=sceG, ref=sceM, labels=sceM$label, de.method="wilcox")
table(pred.grun$labels)
## 
##      acinar        beta       delta        duct endothelial 
##          53           4           1          41           1

4 Annotation diagnostics

4.1 Based on the scores within cells

SingleR provides a few basic yet powerful visualization tools. plotScoreHeatmap() displays the scores for all cells across all reference labels, which allows users to inspect the confidence of the predicted labels across the dataset. The actual assigned label for each cell is shown in the color bar at the top; note that this may not be the visually top-scoring label if fine-tuning is applied, as the only the pre-tuned scores are directly comparable across all labels.

plotScoreHeatmap(pred.grun)

For this plot, the key point is to examine the spread of scores within each cell. Ideally, each cell (i.e., column of the heatmap) should have one score that is obviously larger than the rest, indicating that it is unambiguously assigned to a single label. A spread of similar scores for a given cell indicates that the assignment is uncertain, though this may be acceptable if the uncertainty is distributed across similar cell types that cannot be easily resolved.

We can also display other metadata information for each cell by setting clusters= or annotation_col=. This is occasionally useful for examining potential batch effects, differences in cell type composition between conditions, relationship to clusters from an unsupervised analysis, etc. In the code below, we display which donor each cell comes from:

plotScoreHeatmap(pred.grun, 
    annotation_col=as.data.frame(colData(sceG)[,"donor",drop=FALSE]))

4.2 Based on the deltas across cells

The pruneScores() function will remove potentially poor-quality or ambiguous assignments. In particular, ambiguous assignments are identified based on the per-cell “delta”, i.e., the difference between the score for the assigned label and the median across all labels for each cell. Low deltas indicate that the assignment is uncertain, which is especially relevant if the cell’s true label does not exist in the reference. The exact threshold used for pruning is identified using an outlier-based approach that accounts for differences in the scale of the correlations in various contexts.

to.remove <- pruneScores(pred.grun)
summary(to.remove)
##    Mode   FALSE    TRUE 
## logical      96       4

By default, SingleR() will report pruned labels in the pruned.labels field where low-quality assignments are replaced with NA. However, the default pruning thresholds may not be appropriate for every dataset - see ?pruneScores for a more detailed discussion. We provide the plotScoreDistribution() to help in determining whether the thresholds are appropriate by using information across cells with the same label. This displays the per-label distribution of the deltas across cells, from which pruneScores() defines an appropriate threshold as 3 median absolute deviations (MADs) below the median.

plotScoreDistribution(pred.grun, show = "delta.med", ncol = 3, show.nmads = 3)

If some tuning parameters must be adjusted, we can simply call pruneScores() directly with adjusted parameters. Here, we set labels to NA if they are to be discarded, which is also how SingleR() marks such labels in pruned.labels.

new.pruned <- pred.grun$labels
new.pruned[pruneScores(pred.grun, nmads=5)] <- NA
table(new.pruned, useNA="always")
## new.pruned
##      acinar        beta       delta        duct endothelial        <NA> 
##          53           4           1          41           1           0

4.3 Based on marker gene expression

Another simple yet effective diagnostic is to examine the expression of the marker genes for each label in the test dataset. We extract the identity of the markers from the metadata of the SingleR() results and use them in the plotHeatmap() function from scater, as shown below for beta cell markers. If a cell in the test dataset is confidently assigned to a particular label, we would expect it to have strong expression of that label’s markers. At the very least, it should exhibit upregulation of those markers relative to cells assigned to other labels.

all.markers <- metadata(pred.grun)$de.genes
sceG$labels <- pred.grun$labels

# Beta cell-related markers
plotHeatmap(sceG, order_columns_by="labels",
    features=unique(unlist(all.markers$beta))) 

We can similarly perform this for all labels by wrapping this code in a loop, as shown below:

for (lab in unique(pred.grun$labels)) {
    plotHeatmap(sceG, order_columns_by=list(I(pred.grun$labels)), 
        features=unique(unlist(all.markers[[lab]]))) 
}

Heatmaps are particularly useful because they allow users to check that the genes are actually biologically meaningful to that cell type’s identity. For example, beta cells would be expected to express insulin, and the fact that they do so gives more confidence to the correctness of the assignment. By comparison, the scores and deltas are more abstract and difficult to interpret for diagnostic purposes. If the identified markers are not meaningful or not consistently upregulated, some skepticism towards the quality of the assignments is warranted.

5 Available references

The legacy SingleR package provides RDA files that contain normalized expression values and cell types labels based on bulk RNA-seq, microarray and single-cell RNA-seq data from:

  • Blueprint (Martens and Stunnenberg 2013) and Encode (The ENCODE Project Consortium 2012),
  • the Human Primary Cell Atlas (Mabbott et al. 2013),
  • the murine ImmGen (Heng et al. 2008), and
  • a collection of mouse data sets downloaded from GEO (Benayoun et al. 2019).

The bulk RNA-seq and microarray data sets of the first three reference data sets were obtained from pre-sorted cell populations, i.e., the cell labels of these samples were mostly derived based on the respective sorting/purification strategy, not via in silico prediction methods.

Three additional reference datasets from bulk RNA-seq and microarray data for immune cells have also been prepared. Each of these datasets were also obtained from pre-sorted cell populations:

The characteristics of each dataset are summarized below:

Data retrieval Organism Samples Sample types No. of main labels No. of fine labels Cell type focus
HumanPrimaryCellAtlasData() human 713 microarrays of sorted cell populations 37 157 Non-specific
BlueprintEncodeData() human 259 RNA-seq 24 43 Non-specific
DatabaseImmuneCellExpressionData() human 1561 RNA-seq 5 15 Immune
NovershternHematopoieticData() human 211 microarrays of sorted cell populations 17 38 Hematopoietic & Immune
MonacoImmuneData() human 114 RNA-seq 11 29 Immune
ImmGenData() mouse 830 microarrays of sorted cell populations 20 253 Hematopoietic & Immune
MouseRNAseqData() mouse 358 RNA-seq 18 28 Non-specific

Details for each dataset can be viewed on the corresponding help page (e.g. ?ImmGenData). The available sample types in each set can be viewed in the collapsible sections below.

BlueprintEncodeData Labels

label.main label.fine
Neutrophils Neutrophils
Monocytes Monocytes
HSC MEP
CD4+ T-cells CD4+ T-cells
CD4+ T-cells Tregs
CD4+ T-cells CD4+ Tcm
CD4+ T-cells CD4+ Tem
CD8+ T-cells CD8+ Tcm
CD8+ T-cells CD8+ Tem
NK cells NK cells
B-cells naive B-cells
B-cells Memory B-cells
B-cells Class-switched memory B-cells
HSC HSC
HSC MPP
HSC CLP
HSC GMP
Macrophages Macrophages
CD8+ T-cells CD8+ T-cells
Erythrocytes Erythrocytes
HSC Megakaryocytes
HSC CMP
Macrophages Macrophages M1
Macrophages Macrophages M2
Endothelial cells Endothelial cells
DC DC
Eosinophils Eosinophils
B-cells Plasma cells
Chondrocytes Chondrocytes
Fibroblasts Fibroblasts
Smooth muscle Smooth muscle
Epithelial cells Epithelial cells
Melanocytes Melanocytes
Skeletal muscle Skeletal muscle
Keratinocytes Keratinocytes
Endothelial cells mv Endothelial cells
Myocytes Myocytes
Adipocytes Adipocytes
Neurons Neurons
Pericytes Pericytes
Adipocytes Preadipocytes
Adipocytes Astrocytes
Mesangial cells Mesangial cells

HumanPrimaryCellAtlasData Labels

label.main label.fine
DC DC:monocyte-derived:immature
DC DC:monocyte-derived:Galectin-1
DC DC:monocyte-derived:LPS
DC DC:monocyte-derived
Smooth_muscle_cells Smooth_muscle_cells:bronchial:vit_D
Smooth_muscle_cells Smooth_muscle_cells:bronchial
Epithelial_cells Epithelial_cells:bronchial
B_cell B_cell
Neutrophils Neutrophil
T_cells T_cell:CD8+_Central_memory
T_cells T_cell:CD8+
T_cells T_cell:CD4+
T_cells T_cell:CD8+_effector_memory_RA
T_cells T_cell:CD8+_effector_memory
T_cells T_cell:CD8+_naive
Monocyte Monocyte
Erythroblast Erythroblast
BM & Prog. BM
DC DC:monocyte-derived:rosiglitazone
DC DC:monocyte-derived:AM580
DC DC:monocyte-derived:rosiglitazone/AGN193109
DC DC:monocyte-derived:anti-DC-SIGN_2h
Endothelial_cells Endothelial_cells:HUVEC
Endothelial_cells Endothelial_cells:HUVEC:Borrelia_burgdorferi
Endothelial_cells Endothelial_cells:HUVEC:IFNg
Endothelial_cells Endothelial_cells:lymphatic
Endothelial_cells Endothelial_cells:HUVEC:Serum_Amyloid_A
Endothelial_cells Endothelial_cells:lymphatic:TNFa_48h
T_cells T_cell:effector
T_cells T_cell:CCR10+CLA+1,25(OH)2_vit_D3/IL-12
T_cells T_cell:CCR10-CLA+1,25(OH)2_vit_D3/IL-12
Gametocytes Gametocytes:spermatocyte
DC DC:monocyte-derived:A._fumigatus_germ_tubes_6h
Neurons Neurons:ES_cell-derived_neural_precursor
Keratinocytes Keratinocytes
Keratinocytes Keratinocytes:IL19
Keratinocytes Keratinocytes:IL20
Keratinocytes Keratinocytes:IL22
Keratinocytes Keratinocytes:IL24
Keratinocytes Keratinocytes:IL26
Keratinocytes Keratinocytes:KGF
Keratinocytes Keratinocytes:IFNg
Keratinocytes Keratinocytes:IL1b
HSC_-G-CSF HSC_-G-CSF
DC DC:monocyte-derived:mature
Monocyte Monocyte:anti-FcgRIIB
Macrophage Macrophage:monocyte-derived:IL-4/cntrl
Macrophage Macrophage:monocyte-derived:IL-4/Dex/cntrl
Macrophage Macrophage:monocyte-derived:IL-4/Dex/TGFb
Macrophage Macrophage:monocyte-derived:IL-4/TGFb
Monocyte Monocyte:leukotriene_D4
NK_cell NK_cell
NK_cell NK_cell:IL2
Embryonic_stem_cells Embryonic_stem_cells
Tissue_stem_cells Tissue_stem_cells:iliac_MSC
Chondrocytes Chondrocytes:MSC-derived
Osteoblasts Osteoblasts
Tissue_stem_cells Tissue_stem_cells:BM_MSC
Osteoblasts Osteoblasts:BMP2
Tissue_stem_cells Tissue_stem_cells:BM_MSC:BMP2
Tissue_stem_cells Tissue_stem_cells:BM_MSC:TGFb3
DC DC:monocyte-derived:Poly(IC)
DC DC:monocyte-derived:CD40L
DC DC:monocyte-derived:Schuler_treatment
DC DC:monocyte-derived:antiCD40/VAF347
Tissue_stem_cells Tissue_stem_cells:dental_pulp
T_cells T_cell:CD4+_central_memory
T_cells T_cell:CD4+_effector_memory
T_cells T_cell:CD4+_Naive
Smooth_muscle_cells Smooth_muscle_cells:vascular
Smooth_muscle_cells Smooth_muscle_cells:vascular:IL-17
BM BM
Platelets Platelets
Epithelial_cells Epithelial_cells:bladder
Macrophage Macrophage:monocyte-derived
Macrophage Macrophage:monocyte-derived:M-CSF
Macrophage Macrophage:monocyte-derived:M-CSF/IFNg
Macrophage Macrophage:monocyte-derived:M-CSF/Pam3Cys
Macrophage Macrophage:monocyte-derived:M-CSF/IFNg/Pam3Cys
Macrophage Macrophage:monocyte-derived:IFNa
Gametocytes Gametocytes:oocyte
Monocyte Monocyte:F._tularensis_novicida
Endothelial_cells Endothelial_cells:HUVEC:B._anthracis_LT
B_cell B_cell:Germinal_center
B_cell B_cell:Plasma_cell
B_cell B_cell:Naive
B_cell B_cell:Memory
DC DC:monocyte-derived:AEC-conditioned
Tissue_stem_cells Tissue_stem_cells:lipoma-derived_MSC
Tissue_stem_cells Tissue_stem_cells:adipose-derived_MSC_AM3
Endothelial_cells Endothelial_cells:HUVEC:FPV-infected
Endothelial_cells Endothelial_cells:HUVEC:PR8-infected
Endothelial_cells Endothelial_cells:HUVEC:H5N1-infected
Macrophage Macrophage:monocyte-derived:S._aureus
Fibroblasts Fibroblasts:foreskin
iPS_cells iPS_cells:skin_fibroblast-derived
iPS_cells iPS_cells:skin_fibroblast
T_cells T_cell:gamma-delta
Monocyte Monocyte:CD14+
Macrophage Macrophage:Alveolar
Macrophage Macrophage:Alveolar:B._anthacis_spores
Neutrophils Neutrophil:inflam
iPS_cells iPS_cells:PDB_fibroblasts
iPS_cells iPS_cells:PDB_1lox-17Puro-5
iPS_cells iPS_cells:PDB_1lox-17Puro-10
iPS_cells iPS_cells:PDB_1lox-21Puro-20
iPS_cells iPS_cells:PDB_1lox-21Puro-26
iPS_cells iPS_cells:PDB_2lox-5
iPS_cells iPS_cells:PDB_2lox-22
iPS_cells iPS_cells:PDB_2lox-21
iPS_cells iPS_cells:PDB_2lox-17
iPS_cells iPS_cells:CRL2097_foreskin
iPS_cells iPS_cells:CRL2097_foreskin-derived:d20_hepatic_diff
iPS_cells iPS_cells:CRL2097_foreskin-derived:undiff.
B_cell B_cell:CXCR4+_centroblast
B_cell B_cell:CXCR4-_centrocyte
Endothelial_cells Endothelial_cells:HUVEC:VEGF
iPS_cells iPS_cells:fibroblasts
iPS_cells iPS_cells:fibroblast-derived:Direct_del._reprog
iPS_cells iPS_cells:fibroblast-derived:Retroviral_transf
Endothelial_cells Endothelial_cells:lymphatic:KSHV
Endothelial_cells Endothelial_cells:blood_vessel
Monocyte Monocyte:CD16-
Monocyte Monocyte:CD16+
Tissue_stem_cells Tissue_stem_cells:BM_MSC:osteogenic
Hepatocytes Hepatocytes
Neutrophils Neutrophil:uropathogenic_E._coli_UTI89
Neutrophils Neutrophil:commensal_E._coli_MG1655
MSC MSC
Neuroepithelial_cell Neuroepithelial_cell:ESC-derived
Astrocyte Astrocyte:Embryonic_stem_cell-derived
Endothelial_cells Endothelial_cells:HUVEC:IL-1b
HSC_CD34+ HSC_CD34+
CMP CMP
GMP GMP
B_cell B_cell:immature
MEP MEP
Myelocyte Myelocyte
Pre-B_cell_CD34- Pre-B_cell_CD34-
Pro-B_cell_CD34+ Pro-B_cell_CD34+
Pro-Myelocyte Pro-Myelocyte
Smooth_muscle_cells Smooth_muscle_cells:umbilical_vein
iPS_cells iPS_cells:foreskin_fibrobasts
iPS_cells iPS_cells:iPS:minicircle-derived
iPS_cells iPS_cells:adipose_stem_cells
iPS_cells iPS_cells:adipose_stem_cell-derived:lentiviral
iPS_cells iPS_cells:adipose_stem_cell-derived:minicircle-derived
Fibroblasts Fibroblasts:breast
Monocyte Monocyte:MCSF
Monocyte Monocyte:CXCL4
Neurons Neurons:adrenal_medulla_cell_line
Tissue_stem_cells Tissue_stem_cells:CD326-CD56+
NK_cell NK_cell:CD56hiCD62L+
T_cells T_cell:Treg:Naive
Neutrophils Neutrophil:LPS
Neutrophils Neutrophil:GM-CSF_IFNg
Monocyte Monocyte:S._typhimurium_flagellin
Neurons Neurons:Schwann_cell

DatabaseImmuneCellExpressionData Labels

label.main label.fine
B cells B cells, naive
Monocytes Monocytes, CD14+
Monocytes Monocytes, CD16+
NK cells NK cells
T cells, CD4+ T cells, CD4+, memory TREG
T cells, CD4+ T cells, CD4+, naive
T cells, CD4+ T cells, CD4+, naive, stimulated
T cells, CD4+ T cells, CD4+, naive TREG
T cells, CD4+ T cells, CD4+, TFH
T cells, CD4+ T cells, CD4+, Th1
T cells, CD4+ T cells, CD4+, Th1_17
T cells, CD4+ T cells, CD4+, Th17
T cells, CD4+ T cells, CD4+, Th2
T cells, CD8+ T cells, CD8+, naive
T cells, CD8+ T cells, CD8+, naive, stimulated

NovershternHematopoieticData Labels

label.main label.fine
Basophils Basophils Basophils
Naïve B cells B cells Naïve B cells
Mature B cells class able to switch B cells Mature B cells class able to switch
Mature B cells B cells Mature B cells
Mature B cells class switched B cells Mature B cells class switched
Common myeloid progenitors CMPs Common myeloid progenitors
Plasmacytoid Dendritic Cells Dendritic cells Plasmacytoid Dendritic Cells
Myeloid Dendritic Cells Dendritic cells Myeloid Dendritic Cells
Eosinophils Eosinophils Eosinophils
Erythroid_CD34+ CD71+ GlyA- Erythroid cells Erythroid_CD34+ CD71+ GlyA-
Erythroid_CD34- CD71+ GlyA- Erythroid cells Erythroid_CD34- CD71+ GlyA-
Erythroid_CD34- CD71+ GlyA+ Erythroid cells Erythroid_CD34- CD71+ GlyA+
Erythroid_CD34- CD71lo GlyA+ Erythroid cells Erythroid_CD34- CD71lo GlyA+
Erythroid_CD34- CD71- GlyA+ Erythroid cells Erythroid_CD34- CD71- GlyA+
Granulocyte/monocyte progenitors GMPs Granulocyte/monocyte progenitors
Colony Forming Unit-Granulocytes Granulocytes Colony Forming Unit-Granulocytes
Granulocytes (Neutrophilic Metamyelocytes) Granulocytes Granulocytes (Neutrophilic Metamyelocytes)
Granulocytes (Neutrophils) Granulocytes Granulocytes (Neutrophils)
Hematopoietic stem cells_CD133+ CD34dim HSCs Hematopoietic stem cells_CD133+ CD34dim
Hematopoietic stem cells_CD38- CD34+ HSCs Hematopoietic stem cells_CD38- CD34+
Colony Forming Unit-Megakaryocytic Megakaryocytes Colony Forming Unit-Megakaryocytic
Megakaryocytes Megakaryocytes Megakaryocytes
Megakaryocyte/erythroid progenitors MEPs Megakaryocyte/erythroid progenitors
Colony Forming Unit-Monocytes Monocytes Colony Forming Unit-Monocytes
Monocytes Monocytes Monocytes
Mature NK cells_CD56- CD16+ CD3- NK cells Mature NK cells_CD56- CD16+ CD3-
Mature NK cells_CD56+ CD16+ CD3- NK cells Mature NK cells_CD56+ CD16+ CD3-
Mature NK cells_CD56- CD16- CD3- NK cells Mature NK cells_CD56- CD16- CD3-
NK T cells NK T cells NK T cells
Early B cells B cells Early B cells
Pro B cells B cells Pro B cells
CD8+ Effector Memory RA CD8+ T cells CD8+ Effector Memory RA
Naive CD8+ T cells CD8+ T cells Naive CD8+ T cells
CD8+ Effector Memory CD8+ T cells CD8+ Effector Memory
CD8+ Central Memory CD8+ T cells CD8+ Central Memory
Naive CD4+ T cells CD4+ T cells Naive CD4+ T cells
CD4+ Effector Memory CD4+ T cells CD4+ Effector Memory
CD4+ Central Memory CD4+ T cells CD4+ Central Memory

MonacoImmuneData Labels

label.main label.fine
Naive CD8 T cells CD8+ T cells Naive CD8 T cells
Central memory CD8 T cells CD8+ T cells Central memory CD8 T cells
Effector memory CD8 T cells CD8+ T cells Effector memory CD8 T cells
Terminal effector CD8 T cells CD8+ T cells Terminal effector CD8 T cells
MAIT cells T cells MAIT cells
Vd2 gd T cells T cells Vd2 gd T cells
Non-Vd2 gd T cells T cells Non-Vd2 gd T cells
Follicular helper T cells CD4+ T cells Follicular helper T cells
T regulatory cells CD4+ T cells T regulatory cells
Th1 cells CD4+ T cells Th1 cells
Th1/Th17 cells CD4+ T cells Th1/Th17 cells
Th17 cells CD4+ T cells Th17 cells
Th2 cells CD4+ T cells Th2 cells
Naive CD4 T cells CD4+ T cells Naive CD4 T cells
Progenitor cells Progenitors Progenitor cells
Naive B cells B cells Naive B cells
Non-switched memory B cells B cells Non-switched memory B cells
Exhausted B cells B cells Exhausted B cells
Switched memory B cells B cells Switched memory B cells
Plasmablasts B cells Plasmablasts
Classical monocytes Monocytes Classical monocytes
Intermediate monocytes Monocytes Intermediate monocytes
Non classical monocytes Monocytes Non classical monocytes
Natural killer cells NK cells Natural killer cells
Plasmacytoid dendritic cells Dendritic cells Plasmacytoid dendritic cells
Myeloid dendritic cells Dendritic cells Myeloid dendritic cells
Low-density neutrophils Neutrophils Low-density neutrophils
Low-density basophils Basophils Low-density basophils
Terminal effector CD4 T cells CD4+ T cells Terminal effector CD4 T cells

ImmGenData Labels

label.main label.fine
Macrophages Macrophages (MF.11C-11B+)
Macrophages Macrophages (MF.ALV)
Monocytes Monocytes (MO.6+I-)
Monocytes Monocytes (MO.6+2+)
B cells B cells (B.MEM)
B cells B cells (B1A)
DC DC (DC.11B+)
DC DC (DC.11B-)
Stromal cells Stromal cells (DN.CFA)
Stromal cells Stromal cells (DN)
Eosinophils Eosinophils (EO)
Fibroblasts Fibroblasts (FRC.CAD11.WT)
Fibroblasts Fibroblasts (FRC.CFA)
Fibroblasts Fibroblasts (FRC)
Neutrophils Neutrophils (GN)
Endothelial cells Endothelial cells (LEC.CFA)
Endothelial cells Endothelial cells (LEC)
Macrophages Macrophages (MF)
T cells T cells (T.DP.69-)
T cells T cells (T.DP)
T cells T cells (T.DP69+)
Macrophages Macrophages (MF.F480HI.GATA6KO)
Macrophages Macrophages (MF.F480HI.CTRL)
T cells T cells (T.CD4.1H)
T cells T cells (T.CD4.24H)
T cells T cells (T.CD4.48H)
T cells T cells (T.CD4.5H)
T cells T cells (T.CD4.96H)
T cells T cells (T.CD4.CTR)
T cells T cells (T.CD8.1H)
T cells T cells (T.CD8.24H)
T cells T cells (T.CD8.48H)
T cells T cells (T.CD8.5H)
T cells T cells (T.CD8.96H)
T cells T cells (T.CD8.CTR)
Macrophages Macrophages (MFAR-)
Monocytes Monocytes (MO)
ILC ILC (ILC1.CD127+)
ILC ILC (LIV.ILC1.DX5-)
ILC ILC (LPL.NCR+ILC1)
ILC ILC (ILC2)
ILC ILC (LPL.NCR+ILC3)
ILC ILC (ILC3.LTI.CD4+)
ILC ILC (ILC3.LTI.CD4-)
ILC ILC (ILC3.LTI.4+)
NK cells NK cells (NK.CD127-)
ILC ILC (LIV.NK.DX5+)
ILC ILC (LPL.NCR+CNK)
Basophils Basophils (BA)
Epithelial cells Epithelial cells (Ep.5wk.MEC.Sca1+)
Epithelial cells Epithelial cells (Ep.5wk.MEChi)
Epithelial cells Epithelial cells (Ep.5wk.MEClo)
Epithelial cells Epithelial cells (Ep.8wk.CEC.Sca1+)
Epithelial cells Epithelial cells (Ep.8wk.CEChi)
Epithelial cells Epithelial cells (Ep.8wk.MEChi)
Epithelial cells Epithelial cells (Ep.8wk.MEClo)
Mast cells Mast cells (MC.ES)
Mast cells Mast cells (MC)
Mast cells Mast cells (MC.TO)
Mast cells Mast cells (MC.TR)
Mast cells Mast cells (MC.DIGEST)
Epithelial cells Epithelial cells (MECHI.GFP+.ADULT)
Epithelial cells Epithelial cells (MECHI.GFP+.ADULT.KO)
Epithelial cells Epithelial cells (MECHI.GFP-.ADULT)
Macrophages Macrophages (MF.480HI.NAIVE)
Macrophages Macrophages (MF.480INT.NAIVE)
T cells T cells (T.4EFF49D+11A+.D8.LCMV)
T cells T cells (T.4MEM49D+11A+.D30.LCMV)
T cells T cells (T.4NVE44-49D-11A-)
T cells T cells (T.8EFF.TBET+.OT1LISOVA)
T cells T cells (T.8EFF.TBET-.OT1LISOVA)
T cells T cells (T.8EFFKLRG1+CD127-.D8.LISOVA)
T cells T cells (T.8MEMKLRG1-CD127+.D8.LISOVA)
T cells T cells (T.4+8int)
T cells T cells (T.4FP3+25+)
T cells T cells (T.4int8+)
T cells T cells (T.4SP24-)
T cells T cells (T.4SP24int)
T cells T cells (T.4SP69+)
T cells T cells (T.8SP24-)
T cells T cells (T.8SP24int)
T cells T cells (T.8SP69+)
T cells T cells (T.DPbl)
T cells T cells (T.DPsm)
T cells T cells (T.ISP)
B cells B cells (B.FrE)
B cells B cells (B.FrF)
B cells B cells (preB.FrD)
B cells B cells (proB.FrBC)
B cells B cells (preB.FrC)
Stem cells Stem cells (SC.STSL)
T cells T cells (T.CD4+TESTNA)
T cells T cells (T.CD4+TESTDB)
B cells B cells (B.CD19CONTROL)
T cells T cells (T.CD4CONTROL)
T cells T cells (T.CD4TESTJS)
T cells T cells (T.CD4TESTCJ)
Stem cells Stem cells (SC.CD150-CD48-)
Tgd Tgd (Tgd.imm.vg2+)
Tgd Tgd (Tgd.imm.vg2)
Tgd Tgd (Tgd.mat.vg3)
Tgd Tgd (Tgd.mat.vg3.)
Tgd Tgd (Tgd)
Tgd Tgd (Tgd.vg2+.act)
Tgd Tgd (Tgd.vg2-.act)
Tgd Tgd (Tgd.vg2-)
B cells B cells (B.Fo)
B cells B cells (B.FRE)
B cells B cells (B.GC)
B cells B cells (B.MZ)
B cells B cells (B.T1)
B cells B cells (B.T2)
B cells B cells (B.T3)
B cells B cells (B1a)
B cells B cells (B1b)
DC DC (DC)
DC DC (DC.103+11B-)
DC DC (DC.8-4-11B+)
DC DC (DC.LC)
NK cells NK cells (NK.49CI+)
NK cells NK cells (NK.49CI-)
NK cells NK cells (NK.B2M-)
NK cells NK cells (NK.DAP10-)
NK cells NK cells (NK.DAP12-)
NK cells NK cells (NK.H+.MCMV1)
NK cells NK cells (NK.H+.MCMV7)
NK cells NK cells (NK.H+MCMV1)
NK cells NK cells (NK.MCMV7)
NK cells NK cells (NK)
NKT NKT (NKT.4+)
NKT NKT (NKT.4-)
NKT NKT (NKT.44+NK1.1+)
NKT NKT (NKT.44+NK1.1-)
NKT NKT (NKT.44-NK1.1-)
B cells B cells (preB.FRD)
B cells B cells (proB.CLP)
Stem cells Stem cells (proB.CLP)
B cells B cells (proB.FrA)
B cells B cells (proB.FRA)
B cells, pro B cells, pro (proB.FrA)
T cells T cells (T.4MEM)
T cells T cells (T.4Mem)
T cells T cells (T.4MEM44H62L)
T cells T cells (T.4Nve)
T cells T cells (T.4NVE)
T cells T cells (T.8EFF.OT1.D15.VSVOVA)
T cells T cells (T.8EFF.OT1.D5.VSVOVA)
T cells T cells (T.8EFF.OT1.VSVOVA)
T cells T cells (T.8EFF.OT1.D8.VSVOVA)
T cells T cells (T.8MEM)
T cells T cells (T.8Mem)
T cells T cells (T.8MEM.OT1.D106.VSVOVA)
T cells T cells (T.8EFF.OT1.D45VSV)
T cells T cells (T.8Nve)
T cells T cells (T.8NVE)
B cells B cells (proB.FRBC)
T cells T cells (T.4)
T cells T cells (T.4.Pa)
T cells T cells (T.4.PLN)
T cells T cells (T.4FP3-)
Tgd Tgd (Tgd.VG2+)
Tgd Tgd (Tgd.vg2+.TCRbko)
Tgd Tgd (Tgd.vg2-.TCRbko)
Tgd Tgd (Tgd.vg5+.act)
Tgd Tgd (Tgd.VG5+.ACT)
Tgd Tgd (Tgd.VG5+)
Tgd Tgd (Tgd.vg5-.act)
Tgd Tgd (Tgd.VG5-)
NK cells NK cells (NK.49H+)
NK cells NK cells (NK.49H-)
DC DC (DC.8+)
DC DC (DC.8-)
DC DC (DC.8-4-11B-)
DC DC (DC.PDC.8+)
DC DC (DC.PDC.8-)
Macrophages Macrophages (MF.II-480HI)
Macrophages Macrophages (MF.RP)
Macrophages Macrophages (MFIO5.II+480INT)
Macrophages Macrophages (MFIO5.II+480LO)
Macrophages Macrophages (MFIO5.II-480HI)
Macrophages Macrophages (MFIO5.II-480INT)
Monocytes Monocytes (MO.6C+II+)
Monocytes Monocytes (MO.6C+II-)
Monocytes Monocytes (MO.6C-II+)
Monocytes Monocytes (MO.6C-II-)
Monocytes Monocytes (MO.6C-IIINT)
T cells T cells (T.8EFF.OT1.D10LIS)
T cells T cells (T.8EFF.OT1.D10.LISOVA)
T cells T cells (T.8EFF.OT1.D15LIS)
T cells T cells (T.8EFF.OT1.D15.LISOVA)
T cells T cells (T.8EFF.OT1LISO)
T cells T cells (T.8EFF.OT1.LISOVA)
T cells T cells (T.8EFF.OT1.D8LISO)
T cells T cells (T.8EFF.OT1.D8.LISOVA)
T cells T cells (T.8MEM.OT1.D100.LISOVA)
T cells T cells (T.8MEM.OT1.D45.LISOVA)
T cells T cells (T.8NVE.OT1)
B cells B cells (B.FO)
Endothelial cells Endothelial cells (BEC)
Epithelial cells Epithelial cells (EP.MECHI)
Fibroblasts Fibroblasts (FI.MTS15+)
Fibroblasts Fibroblasts (FI)
Stromal cells Stromal cells (ST.31-38-44-)
Stem cells Stem cells (SC.LT34F)
Stem cells Stem cells (SC.MDP)
Stem cells Stem cells (SC.MEP)
Stem cells Stem cells (SC.MPP34F)
Stem cells Stem cells (SC.ST34F)
Stem cells Stem cells (SC.CDP)
Stem cells Stem cells (SC.CMP.DR)
Stem cells Stem cells (GMP)
Stem cells Stem cells (MLP)
Stem cells Stem cells (LTHSC)
T cells T cells (T.DN2-3)
T cells T cells (T.DN2)
T cells T cells (T.DN2A)
T cells T cells (T.DN2B)
T cells T cells (T.DN3-4)
T cells T cells (T.DN3A)
T cells T cells (T.DN3B)
T cells T cells (T.DN1-2)
T cells T cells (T.DN4)
Macrophages Macrophages (MF.103-11B+.SALM3)
Macrophages Macrophages (MF.103-11B+)
DC DC (DC.103-11B+24+)
Macrophages Macrophages (MF.103-11B+24-)
DC DC (DC.103-11B+F4-80LO.KD)
Macrophages Macrophages (MF.11CLOSER.SALM3)
Macrophages Macrophages (MF.11CLOSER)
Macrophages Macrophages (MF.103CLOSER)
Macrophages Macrophages (MF.II+480LO)
Neutrophils Neutrophils (GN.ARTH)
Neutrophils Neutrophils (GN.Thio)
Neutrophils Neutrophils (GN.URAC)
Macrophages Macrophages (MF.169+11CHI)
Macrophages Macrophages (MF.MEDL)
Macrophages Macrophages (MF.SBCAPS)
Microglia Microglia (Microglia)
T cells T cells (T.ETP)
Tgd Tgd (Tgd.imm.VG1+)
Tgd Tgd (Tgd.imm.VG1+VD6+)
Tgd Tgd (Tgd.mat.VG1+)
Tgd Tgd (Tgd.mat.VG1+VD6+)
Tgd Tgd (Tgd.mat.VG2+)
Tgd Tgd (Tgd.VG3+24AHI)
Tgd Tgd (Tgd.VG5+24AHI)
T cells T cells (T.8EFF.OT1.12HR.LISOVA)
T cells T cells (T.8EFF.OT1.24HR.LISOVA)
T cells T cells (T.8EFF.OT1.48HR.LISOVA)
T cells T cells (T.Tregs)
Tgd Tgd (Tgd.VG2+24AHI)
Tgd Tgd (Tgd.VG4+24AHI)
Tgd Tgd (Tgd.VG4+24ALO)

MouseRNAseqData Labels

label.main label.fine
Adipocytes Adipocytes
Neurons aNSCs
Astrocytes Astrocytes
Astrocytes Astrocytes activated
Endothelial cells Endothelial cells
Erythrocytes Erythrocytes
Fibroblasts Fibroblasts
Fibroblasts Fibroblasts activated
Fibroblasts Fibroblasts senescent
Granulocytes Granulocytes
Macrophages Macrophages
Microglia Microglia
Microglia Microglia activated
Monocytes Monocytes
Neurons Neurons
Neurons Neurons activated
NK cells NK cells
Neurons NPCs
Oligodendrocytes Oligodendrocytes
Neurons qNSCs
T cells T cells
Dendritic cells Dendritic cells
Cardiomyocytes Cardiomyocytes
Hepatocytes Hepatocytes
B cells B cells
Epithelial cells Ependymal
Oligodendrocytes OPCs
Macrophages Macrophages activated

6 Reference options

6.1 Pseudo-bulk aggregation

Single-cell reference datasets provide a like-for-like comparison to our test datasets, yielding a more accurate classification of the cells in the latter (hopefully). However, there are frequently many more samples in single-cell references compared to bulk references, increasing the computational work involved in classification. We avoid this by aggregating cells into one “pseudo-bulk” sample per label (e.g., by averaging across log-expression values) and using those as the reference, which allows us to achieve the same efficiency as the use of bulk references.

The obvious cost of this approach is that we discard potentially useful information about the distribution of cells within each label. Cells that belong to a heterogeneous population may not be correctly assigned if they are far from the population center. We attempt to preserve some of this information by using \(k\)-means clustering within each cell to create pseudo-bulk samples that are representative of a particular region of the expression space (i.e., vector quantization). We create \(\sqrt{N}\) clusters given a label with \(N\) cells, which provides a reasonable compromise between reducing computational work and preserving the label’s internal distribution.

This aggregation approach is implemented in the aggregateReferences function, which is shown in action below for the Muraro et al. (2016) dataset. The function returns a SummarizedExperiment object containing the pseudo-bulk expression profiles and the corresponding labels.

set.seed(100) # for the k-means step.
aggr <- aggregateReference(sceM, labels=sceM$label)
aggr
## class: SummarizedExperiment 
## dim: 19059 116 
## metadata(0):
## assays(1): logcounts
## rownames(19059): A1BG-AS1__chr19 A1BG__chr19 ... ZZEF1__chr17
##   ZZZ3__chr1
## rowData names(0):
## colnames(116): alpha.1 alpha.2 ... mesenchymal.8 epsilon.1
## colData names(1): label

The resulting SummarizedExperiment can then be used as a reference in SingleR().

pred.aggr <- SingleR(sceG, aggr, labels=aggr$label)
table(pred.aggr$labels)
## 
## acinar   beta  delta   duct 
##     52      4      1     43

6.2 Using multiple references

In some cases, we may wish to use multiple references for annotation of a test dataset. This yield a more comprehensive set of cell types that are not covered by any individual reference, especially when differences in resolution are also considered. Use of multiple references is supported by simply passing multiple objects to the ref= and label= argument in SingleR(). We demonstrate below by including another reference (from Blueprint-Encode) in our annotation of the La Manno et al. (2016) dataset:

bp.se <- BlueprintEncodeData()

pred.combined <- SingleR(test = hESCs, 
    ref = list(BP=bp.se, HPCA=hpca.se), 
    labels = list(bp.se$label.main, hpca.se$label.main))

The output is the same form as previously described, and we can easily gain access to the combined set of labels:

table(pred.combined$labels)
## 
##            Astrocyte Neuroepithelial_cell              Neurons 
##                    4                   63                   33

Our strategy is to perform annotation on each reference separately and then take the highest-scoring label across references. This provides a light-weight approach to combining information from multiple references while avoiding batch effects and the need for up-front harmonization. (Of course, the main practical difficulty of this approach is that the same cell type may have different labels across references, which will require some implicit harmonization during interpretation.) Further comments on the justification behind the choice of this method can be found at ?combineResults.

6.3 Harmonizing labels

The matchReferences() function provides a simple yet elegant approach for label harmonization between two references. Each reference is used to annotate the other and the probability of mutual assignment between each pair of labels is computed. Probabilities close to 1 indicate there is a 1:1 relation between that pair of labels; on the other hand, an all-zero probability vector indicates that a label is unique to a particular reference.

matched <- matchReferences(bp.se, hpca.se,
    bp.se$label.main, hpca.se$label.main)
pheatmap::pheatmap(matched, col=viridis::plasma(100))

A heatmap like the one above can be used to guide harmonization to enforce a consistent vocabulary across all labels representing the same cell type or state. The most obvious benefit of harmonization is that interpretation of the results is simplified. However, an even more important effect is that the presence of harmonized labels from multiple references allows the classification machinery to protect against irrelevant batch effects between references. For example, in SingleR()’s case, marker genes are favored if they are consistently upregulated across multiple references, improving robustness to technical idiosyncrasies in any test dataset.

We stress that some manual intervention is still required in this process, given the risks posed by differences in biological systems and technologies. For example, neurons are considered unique to each reference while smooth muscle cells in the HPCA data are incorrectly matched to fibroblasts in the Blueprint/ENCODE data. CD4+ and CD8+ T cells are also both assigned to “T cells”, so some decision about the acceptable resolution of the harmonized labels is required here.

As an aside, we can also use this function to identify the matching clusters between two independent scRNA-seq analyses. This is an “off-label” use that involves substituting the cluster assignments as proxies for the labels. We can then match up clusters and integrate conclusions from multiple datasets without the difficulty of batch correction and reclustering.

7 Advanced use

7.1 Improving efficiency

Advanced users can split the SingleR() workflow into two separate training and classification steps. This means that training (e.g., marker detection, assembling of nearest-neighbor indices) only needs to be performed once. The resulting data structures can then be re-used across multiple classifications with different test datasets, provided the test feature set is identical to or a superset of the features in the training set. For example:

common <- intersect(rownames(hESCs), rownames(hpca.se))
trained <- trainSingleR(hpca.se[common,], labels=hpca.se$label.main)
pred.hesc2 <- classifySingleR(hESCs[common,], trained)
table(pred.hesc$labels, pred.hesc2$labels)
##                       
##                        Astrocyte Neuroepithelial_cell Neurons
##   Astrocyte                   14                    0       0
##   Neuroepithelial_cell         0                   81       0
##   Neurons                      0                    0       5

Other efficiency improvements are possible through several arguments:

  • Switching to an approximate algorithm for the nearest neighbor search in trainSingleR() via the BNPARAM= argument from the BiocNeighbors package.
  • Parallelizing the fine-tuning step in classifySingleR() with the BPPARAM= argument from the BiocParallel package.

These arguments can also be specified in the SingleR() command.

7.2 Defining custom markers

Users can also construct their own marker lists with any DE testing machinery. For example, we can perform pairwise \(t\)-tests using methods from scran and obtain the top 10 marker genes from each pairwise comparison.

library(scran)
out <- pairwiseTTests(logcounts(sceM), sceM$label, direction="up")
markers <- getTopMarkers(out$statistics, out$pairs, n=10)

We then supply these genes to SingleR() directly via the genes= argument. A more focused gene set also allows annotation to be performed more quickly compared to the default approach.

pred.grun2 <- SingleR(test=sceG, ref=sceM, labels=sceM$label, genes=markers)
table(pred.grun2$labels)
## 
##  acinar    beta   delta    duct      pp unclear 
##      59       4       1      34       1       1

In some cases, markers may only be available for specific labels rather than for pairwise comparisons between labels. This is accommodated by supplying a named list of character vectors to genes. Note that this is likely to be less powerful than the list-of-lists approach as information about pairwise differences is discarded.

label.markers <- lapply(markers, unlist, recursive=FALSE)
pred.grun3 <- SingleR(test=sceG, ref=sceM, labels=sceM$label, genes=label.markers)
table(pred.grun$labels, pred.grun3$labels)
##              
##               acinar beta delta duct pp
##   acinar          51    0     0    2  0
##   beta             0    4     0    0  0
##   delta            0    0     1    0  0
##   duct             2    0     0   39  0
##   endothelial      0    0     0    0  1

8 FAQs

How can I use this with my Seurat, SingleCellExperiment, or cell_data_set object?

SingleR is workflow agnostic - all it needs is normalized counts. An example showing how to map its results back to common single-cell data objects is available in the README.

Where can I find reference sets appropriate for my data?

scRNAseq contains many single-cell datasets with more continually being added. ArrayExpress and GEOquery can be used to download any of the bulk or single-cell datasets in ArrayExpress or GEO, respectively.

9 Session information

sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scran_1.14.6                knitr_1.28                 
##  [3] scater_1.14.6               ggplot2_3.3.0              
##  [5] scRNAseq_2.0.2              SingleCellExperiment_1.8.0 
##  [7] SingleR_1.0.6               SummarizedExperiment_1.16.1
##  [9] DelayedArray_0.12.2         BiocParallel_1.20.1        
## [11] matrixStats_0.56.0          Biobase_2.46.0             
## [13] GenomicRanges_1.38.0        GenomeInfoDb_1.22.1        
## [15] IRanges_2.20.2              S4Vectors_0.24.3           
## [17] BiocGenerics_0.32.0         BiocStyle_2.14.4           
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6                  bit64_0.9-7                  
##  [3] RColorBrewer_1.1-2            httr_1.4.1                   
##  [5] tools_3.6.3                   R6_2.4.1                     
##  [7] irlba_2.3.3                   vipor_0.4.5                  
##  [9] DBI_1.1.0                     colorspace_1.4-1             
## [11] withr_2.1.2                   gridExtra_2.3                
## [13] tidyselect_1.0.0              bit_1.1-15.2                 
## [15] curl_4.3                      compiler_3.6.3               
## [17] cli_2.0.2                     BiocNeighbors_1.4.2          
## [19] labeling_0.3                  bookdown_0.18                
## [21] scales_1.1.0                  rappdirs_0.3.1               
## [23] stringr_1.4.0                 digest_0.6.25                
## [25] rmarkdown_2.1                 XVector_0.26.0               
## [27] pkgconfig_2.0.3               htmltools_0.4.0              
## [29] highr_0.8                     limma_3.42.2                 
## [31] dbplyr_1.4.2                  fastmap_1.0.1                
## [33] rlang_0.4.5                   RSQLite_2.2.0                
## [35] shiny_1.4.0.2                 DelayedMatrixStats_1.8.0     
## [37] farver_2.0.3                  dplyr_0.8.5                  
## [39] RCurl_1.98-1.1                magrittr_1.5                 
## [41] BiocSingular_1.2.2            GenomeInfoDbData_1.2.2       
## [43] Matrix_1.2-18                 Rcpp_1.0.4                   
## [45] ggbeeswarm_0.6.0              munsell_0.5.0                
## [47] fansi_0.4.1                   viridis_0.5.1                
## [49] lifecycle_0.2.0               edgeR_3.28.1                 
## [51] stringi_1.4.6                 yaml_2.2.1                   
## [53] zlibbioc_1.32.0               BiocFileCache_1.10.2         
## [55] AnnotationHub_2.18.0          grid_3.6.3                   
## [57] blob_1.2.1                    dqrng_0.2.1                  
## [59] promises_1.1.0                ExperimentHub_1.12.0         
## [61] crayon_1.3.4                  lattice_0.20-41              
## [63] magick_2.3                    locfit_1.5-9.4               
## [65] pillar_1.4.3                  igraph_1.2.5                 
## [67] glue_1.4.0                    BiocVersion_3.10.1           
## [69] evaluate_0.14                 BiocManager_1.30.10          
## [71] vctrs_0.2.4                   httpuv_1.5.2                 
## [73] gtable_0.3.0                  purrr_0.3.3                  
## [75] assertthat_0.2.1              xfun_0.12                    
## [77] rsvd_1.0.3                    mime_0.9                     
## [79] xtable_1.8-4                  later_1.0.0                  
## [81] viridisLite_0.3.0             pheatmap_1.0.12              
## [83] tibble_3.0.0                  AnnotationDbi_1.48.0         
## [85] beeswarm_0.2.3                memoise_1.1.0                
## [87] statmod_1.4.34                ellipsis_0.3.0               
## [89] interactiveDisplayBase_1.24.0

References

Aran, D., A. P. Looney, L. Liu, E. Wu, V. Fong, A. Hsu, S. Chak, et al. 2019. “Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage.” Nat. Immunol. 20 (2):163–72.

Benayoun, Bérénice A., Elizabeth A. Pollina, Param Priya Singh, Salah Mahmoudi, Itamar Harel, Kerriann M. Casey, Ben W. Dulken, Anshul Kundaje, and Anne Brunet. 2019. “Remodeling of epigenome and transcriptome landscapes with aging in mice reveals widespread induction of inflammatory responses.” Genome Research 29:697–709. https://doi.org/10.1101/gr.240093.118.

Grun, D., M. J. Muraro, J. C. Boisset, K. Wiebrands, A. Lyubimova, G. Dharmadhikari, M. van den Born, et al. 2016. “De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data.” Cell Stem Cell 19 (2):266–77.

Heng, Tracy S.P., Michio W. Painter, Kutlu Elpek, Veronika Lukacs-Kornek, Nora Mauermann, Shannon J. Turley, Daphne Koller, et al. 2008. “The immunological genome project: Networks of gene expression in immune cells.” Nature Immunology 9 (10):1091–4. https://doi.org/10.1038/ni1008-1091.

La Manno, G., D. Gyllborg, S. Codeluppi, K. Nishimura, C. Salto, A. Zeisel, L. E. Borm, et al. 2016. “Molecular Diversity of Midbrain Development in Mouse, Human, and Stem Cells.” Cell 167 (2):566–80.

Mabbott, Neil A., J. K. Baillie, Helen Brown, Tom C. Freeman, and David A. Hume. 2013. “An expression atlas of human primary cells: Inference of gene function from coexpression networks.” BMC Genomics 14. https://doi.org/10.1186/1471-2164-14-632.

Martens, Joost H A, and Hendrik G. Stunnenberg. 2013. “BLUEPRINT: Mapping human blood cell epigenomes.” Haematologica 98:1487–9. https://doi.org/10.3324/haematol.2013.094243.

Monaco, Gianni, Bernett Lee, Weili Xu, Seri Mustafah, You Yi Hwang, Christophe Carré, Nicolas Burdin, et al. 2019. “RNA-Seq Signatures Normalized by mRNA Abundance Allow Absolute Deconvolution of Human Immune Cell Types.” Cell Reports 26 (6):1627–1640.e7. https://doi.org/10.1016/j.celrep.2019.01.041.

Muraro, M. J., G. Dharmadhikari, D. Grun, N. Groen, T. Dielen, E. Jansen, L. van Gurp, et al. 2016. “A Single-Cell Transcriptome Atlas of the Human Pancreas.” Cell Syst 3 (4):385–94.

Novershtern, Noa, Aravind Subramanian, Lee N. Lawton, Raymond H. Mak, W. Nicholas Haining, Marie E. McConkey, Naomi Habib, et al. 2011. “Densely Interconnected Transcriptional Circuits Control Cell States in Human Hematopoiesis.” Cell 144 (2):296–309. https://doi.org/10.1016/j.cell.2011.01.004.

Schmiedel, Benjamin J., Divya Singh, Ariel Madrigal, Alan G. Valdovino-Gonzalez, Brandie M. White, Jose Zapardiel-Gonzalo, Brendan Ha, et al. 2018. “Impact of Genetic Polymorphisms on Human Immune Cell Gene Expression.” Cell 175 (6):1701–1715.e16. https://doi.org/10.1016/j.cell.2018.10.022.

The ENCODE Project Consortium. 2012. “An integrated encyclopedia of DNA elements in the human genome.” Nature. https://doi.org/10.1038/nature11247.