Single cell RNA sequencing (scRNAseq) has made it possible to examine the cellular heterogeneity within a tissue or sample, and observe changes and characteristics in specific cell types. To do this, we need to group the cells into clusters and figure out what they are.
In a typical scRNAseq experiment the gene expression levels are first quantified to per-cell counts. Then, cells are clustered into related groups (or clusters) on the basis of transcriptional similarity. There are many different cell-clustering tools that can do this (Freytag et al. 2017).
Clustering tools generally define groups of similar cells - but do not offer explanation as to their biological contents. The annotation of the ‘cell type’ of each cluster is performed by a domain expert biologist - who can examine the known marker genes, or differential expression to understand what type of cell each cluster might describe. This can be a time-consuming semi-manual process, and must be performed before addressing the actual biological question of interest.
The celaref package aims to streamline this cell-type identification step, by suggesting cluster labels on the basis of similarity to an already-characterised reference dataset - whether that’s from a similar experiment performed previously in the same lab, or from a public dataset from a similar sample.
Celaref differs from other cell-type identification tools like scmap (Kiselev, Yiu, and Hemberg 2018) or (functions in) MUDAN in that it operates at the cluster-level.
Celaref requires a table of read counts per cell per gene, and a list of the cells belonging to each of the clusters, (for both test and reference data). It compares the reference sample rankings of the most distinctly enriched genes in each query group to match cell types.
A typical celaref workflow is below, characterising a query dataset’s cell clusters on the basis of transcriptomic similarity to a annotated reference dataset.
To compare scRNAseq datasets with celaref, two inputs are needed for each dataset:
See Input for details.
Cell clusters might be defined by any cell-clustering technique, such as those implemented in tools such as Seurat (Satija et al. 2015), cellRanger (10X genomics), SC3(Kiselev et al. 2017), among many others.
Every dataset, whether a query or a reference, is prepared the same way. For each cluster, cells within that cluster are compared to the rest of the cells pooled together, calculating differential gene expression using MAST (Finak et al. 2015). Because of the low counts and potential drop-out issues in single cell RNAseq data, only genes enriched in each cluster are considered. For every cluster – cells are ranked from most to least enriched according to their lower 95% CI of fold-change. Each gene is assigned a ‘rescaled rank’ from 0 (most enriched) to 1 (most absent).
That this step is the most time consuming, but only needs to be done once per dataset.
A list of ‘Up’ genes are extracted for each query cluster – defined as those that have significantly higher expression in that cluster versus the rest of the sample (p<0.01 after BH multiple hypothesis correction). The ‘Up’ gene list is capped at the top 100 (ranked by lower 95% FC). Then, those genes are looked up in the ranking of genes in each reference cell cluster. The distribution of these ‘up gene’ ranks is plotted to evaluate similarity of the query cell-group to a reference cell-group.
Output plots are described here.
Typically, every cell cluster in the query data (each box) is plotted against everything in the reference data (X-axis). Each of the ‘up’ genes is represented by a tick mark, and the median generank is shown as a thick bar. A biased distribution near the top (i.e.. rescaled rank of 0) indicates similarity of the groups – essentially the same genes are representative of the clusters within their respective samples.
A median gene rank of 0.5 would indicate a completely random distribution. However, much lower values are common. The reciprocal nature of the within-dataset differential expression can cause this - what’s up in one cluster is down in another.
A small or heterogeneous cell group will not have much statistical power to select many ‘top’ genes (few tick marks) and these distributions will not be particuarly informative. If there are no ‘top’ genes it won’t be plotted at all.
Because ‘top’ genes are compared to total reference rankings - the comparison between two datasets is not symmetrical. In ambiguous cases, it might helpful to plot the reverse comparison from reference to query. Note that these receiprocal comparisons are considered in Assigning labels to clusters. For instance - if a query cluster happens to be a mix of two reference cell groups, a reciprocal plot may make this more obvious.
Lastly, there is a function to suggest some semi-sensible query cluster labels.
The first 4 columns of output (below) are the most interesting, the rest are described at bottom of section. The suggested cluster label is in the shortlab column. e.g.
|cluster_5||cluster_5:pyramidal SS|interneurons||3.49e-10||pyramidal SS:0.362,interneurons:3.49e-10,endothelial-mural:0.09,astrocytes_ependymal:0.0449,microglia:7.68e-19,oligodendrocytes:NA|
There can be none, one or multiple reference group similarities for the query group. This is expected when there are similar cell sub-populations in the reference data. This can be visualised throught the relative shapes of the top gene distribution for the reference group, and reference group similarity labels are calculated as follows:
These labels are based on the distributions of the ranks of the query cluster’s ‘top’ genes in each of the reference groups (as plotted in the violin plots), rescaled to be in the 0-1 range.
The median gene rank for the ‘top’ genes in each reference group is calculated.
Reference groups are ordered from most to least similar (ascending median rank).
Mann-Whitney U tests are calculated between the adjacent reference groups - ie. 1st-2nd most similar, 2nd-3rd, 3rd-4th e.t.c. These are the stepped_pvals reported above - the last value will always be undefined NA. Essentially this is testing if the ‘top’ genes representative of the query group are significantly lower ranked (more similar) in one reference group vs the next most similar reference group. A genuine similarity of cell types should result in an abrupt change in these gene rank distributions.
Initial calls are made on which reference groups to include in the group label.
The group assignment from step 4 is checked to ensure that the (median of the) gene ranks is significantly above a random distribution. Ie. above the 0.5 halfway point in the violin plots.
Reciprocal-only matches are added to cluster labels in brackets.
The full version of this table is:
|hybrid||hybrid:No similarity||astrocytes_ependymal:0.41,microglia:0.634,oligodendrocytes:0.305,endothelial-mural:0.512,interneurons:0.204,pyramidal SS:NA|
|neurons||neurons:pyramidal SS|interneurons||3.49e-10||pyramidal SS:0.362,interneurons:3.49e-10,endothelial-mural:0.09,astrocytes_ependymal:0.0449,microglia:7.68e-19,oligodendrocytes:NA||2.19e-12||pyramidal SS|interneurons||interneurons|pyramidal SS|
The next few columns of the ouput describe some of the heuristics used in the cluster labelling.
pval_to_random : P-value of test of median rank (of last matched reference group) < random, from binomial test on top gene ranks (<0.5). If this isn’t signiicant, ‘No similarity’ will be reported. A completely random distribution would have a median rank in the middle of the violin plots, at 0.5.
matches : List of all reference groups that ‘match’, as described, except it also includes (rare) examples where pval_to_random is not significant. “|” separated, in descending order of match.
reciprocal_matches : List of all reference groups that flagged test group as a match when directon of comparison is reversed. (significant pval and pval_to_random). “|” separated, in descending order of match.
The last 3 columns of the output are usually empty. When defined they may indicate borderline labelling or edge cases - checking the violin plots is advised! Tests are again Mann-Whitney U, but on non-adjacently ranked groups.
The bioconductor landing page with information about this package is at https://bioconductor.org/packages/celaref
To install from bioconductor via BiocManager
# Installing BiocManager if necessary: # install.packages("BiocManager") BiocManager::install("celaref")
Or, to use the dev version from github
devtools::install_github("MonashBioinformaticsPlatform/celaref") # Or BiocManager::install("MonashBioinformaticsPlatform/celaref")
Suppose there’s a new scRNAseq dataset (the query), whose cells have already been clustered into 4 groups : Groups 1-4. But we don’t know which group corresponds to which cell type yet.
Luckily, there’s an older dataset (the reference) of the same tissue type in which someone else has already determined the cell types. They very helpfully named them ‘Weird subtype’, ‘Exciting’, ‘Mystery cell type’ and ‘Dunno’.
This example uses the reference dataset to flag likely cell types in the new experiment.
It is a tiny simulated dataset (using splatter (Zappia, Phipson, and Oshlack 2017)) of 200 genes included in the package that can be copy-pasted, and will complete fairly quickly.
library(celaref) # Paths to data files. counts_filepath.query <- system.file("extdata", "sim_query_counts.tab", package = "celaref") cell_info_filepath.query <- system.file("extdata", "sim_query_cell_info.tab", package = "celaref") counts_filepath.ref <- system.file("extdata", "sim_ref_counts.tab", package = "celaref") cell_info_filepath.ref <- system.file("extdata", "sim_ref_cell_info.tab", package = "celaref") # Load data toy_ref_se <- load_se_from_files(counts_file=counts_filepath.ref, cell_info_file=cell_info_filepath.ref) toy_query_se <- load_se_from_files(counts_file=counts_filepath.query, cell_info_file=cell_info_filepath.query) # Filter data toy_ref_se <- trim_small_groups_and_low_expression_genes(toy_ref_se) toy_query_se <- trim_small_groups_and_low_expression_genes(toy_query_se) # Setup within-experiment differential expression de_table.toy_ref <- contrast_each_group_to_the_rest(toy_ref_se, dataset_name="ref") de_table.toy_query <- contrast_each_group_to_the_rest(toy_query_se, dataset_name="query")
# Plot make_ranking_violin_plot(de_table.test=de_table.toy_query, de_table.ref=de_table.toy_ref) #> Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0. #> ℹ Please use the `fun` argument instead. #> ℹ The deprecated feature was likely used in the celaref package. #> Please report the issue to the authors. #> This warning is displayed once every 8 hours. #> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was #> generated. #> Warning: The `fun.ymin` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0. #> ℹ Please use the `fun.min` argument instead. #> ℹ The deprecated feature was likely used in the celaref package. #> Please report the issue to the authors. #> This warning is displayed once every 8 hours. #> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was #> generated. #> Warning: The `fun.ymax` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0. #> ℹ Please use the `fun.max` argument instead. #> ℹ The deprecated feature was likely used in the celaref package. #> Please report the issue to the authors. #> This warning is displayed once every 8 hours. #> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was #> generated.