scDiagnostics
scDiagnostics 1.1.0
Annotation transfer from a reference dataset is a key process for annotating cell types in new single-cell RNA-sequencing (scRNA-seq) experiments. This approach provides a quick, automated, and reproducible alternative to manual annotation based on marker gene expression. Despite its advantages, challenges such as dataset imbalance and unrecognized discrepancies between query and reference datasets can lead to inaccurate annotations and affect subsequent analyses.
The scDiagnostics
package is designed to address these issues by
offering a suite of diagnostic tools for the systematic evaluation of
cell type assignments in scRNA-seq data. It provides functionality to
assess whether query and reference datasets are well-aligned, which is
crucial for ensuring accurate annotation transfer. In addition,
scDiagnostics
helps evaluate annotation ambiguity, cluster
heterogeneity, and marker gene alignment. By providing insights into
these aspects, scDiagnostics
enables researchers to determine the
precision with which cells from a new scRNA-seq experiment can be
assigned to known cell types, thereby supporting more accurate and
reliable downstream analysis.
Users interested in using the stable release version of the scDiagnostics
package: please follow the installation instructions here. This is the recommended way of installing the package.
To install the development version of the package from Github, use the following command:
BiocManager::install("ccb-hms/scDiagnostics")
To build the package vignettes upon installation use:
BiocManager::install("ccb-hms/scDiagnostics",
build_vignettes = TRUE,
dependencies = TRUE)
Once you have installed the package, you can load it with the following code:
library(scDiagnostics)
To explore the full capabilities of the scDiagnostics
package, you
have the option to use your own data or leverage the datasets included
within the scDiagnostics
package itself. In this guide, we will focus
on utilizing these built-in datasets, which provide a practical and
convenient resource for demonstrating the features of scDiagnostics.
These datasets are specifically designed to facilitate the exploration
of the package’s functionalities and to help evaluate the accuracy of
cell type assignments. You can learn more about the datasets by looking
at the documentation of the datasets available in the reference manual.
In these datasets available in the scDiagnostics
package,
reference_data
, query_data
, and qc_data
are all
SingleCellExperiment objects that include a
logcounts
assay, which stores the log-transformed expression values
for the genes.
The reference_data
and query_data
objects both originate from
scRNA-seq experiments on hematopoietic tissues, specifically bone marrow
samples, as provided by the scRNAseq package.
These datasets have undergone comprehensive processing and cleaning,
ensuring high-quality data for downstream analysis. Log-normalized
counts were added to both datasets using the
scuttle package. The query_data
object has
been further annotated with cell type assignments using the
SingleR package, and it includes
annotation_scores
that reflect the confidence in these annotations.
Additionally, gene set scores were computed and incorporated into the
query_data
using the AUCell package. For
feature selection, the top 500 highly variable genes (HVGs) common to
both datasets were identified and retained using the
scran package. Finally, dimensionality
reduction techniques including PCA, t-SNE, and UMAP were applied to both
datasets, with the results stored within each object using the
scater package.
The qc_data
dataset in this package is derived from the hpca
dataset
available in the celldex package. Like the
other datasets, qc_data
has undergone significant cleaning and
processing to ensure high data quality. Quality control (QC) metrics
were added using the scuttle package. Cell
type annotations and associated annotation_scores were generated using
the SingleR package. Additionally, the top
highly variable genes were selected using the
scran package to enhance the dataset’s utility
for downstream analyses.
# Load datasets
data("reference_data")
data("query_data")
data("qc_data")
# Set seed for reproducibility
set.seed(0)
The reference_data
contains a column data labeled expert_annotation
,
which provides cell type annotations assigned by experts. On the other
hand, query_data
also includes expert_annotation
, but it
additionally features SingleR_annotation
, which is the cell type
annotation generated by the SingleR package, a
popular package for cell type assignment based on reference datasets.
The qc_data
object contains a special column called
annotation_scores
, which holds the scores from the SingleR
annotations, providing a measure of confidence or relevance for the
assigned cell types.
By working with these datasets, you can gain hands-on experience with
the various diagnostic tools and functions offered by scDiagnostics
,
allowing you to better understand how well it aligns query and reference
datasets, assesses annotation ambiguity, and evaluates cluster
heterogeneity and marker gene alignment.
Some functions in the vignette are designed to work with SingleCellExperiment objects that contain data from only one cell type. We will create separate SingleCellExperiment objects that only CD4 cells, to ensure compatibility with these functions.
# Load library
library(scran)
library(scater)
# Subset to CD4 cells
ref_data_cd4 <- reference_data[, which(
reference_data$expert_annotation == "CD4")]
query_data_cd4 <- query_data_cd4 <- query_data[, which(
query_data$expert_annotation == "CD4")]
# Select highly variable genes
ref_top_genes <- getTopHVGs(ref_data_cd4, n = 500)
query_top_genes <- getTopHVGs(query_data_cd4, n = 500)
common_genes <- intersect(ref_top_genes, query_top_genes)
# Subset data by common genes
ref_data_cd4 <- ref_data_cd4[common_genes,]
query_data_cd4 <- query_data_cd4[common_genes,]
# Run PCA on both datasets
ref_data_cd4 <- runPCA(ref_data_cd4)
query_data_cd4 <- runPCA(query_data_cd4)
scDiagnostics
The functions introduced in this section represent just a subset of
the functions available in the scDiagnostics
package.
For a complete overview and detailed demonstrations of all the functions
included in the package, please refer to the designated vignettes which
you may browse from the pkgdown site for
scDiagnostics
. Each
vignette is designed to address specific aspects of scDiagnostics
, and
this vignette highlights key functionalities to illustrate their
applications. These vignettes provide in-depth guidance and examples for
each function, helping users fully leverage the capabilities of
scDiagnostics
in their single-cell analyses.
For a detailed example of all possible functions to visualize reference and query datasets, please refer to the Visualization of Cell Type Annotations vignette.
plotCellTypePCA()
The plotCellTypePCA()
function provides a visual comparison of
principal components (PCs) for different cell types across query and
reference datasets. By projecting the query data onto the PCA space of
the reference dataset, it creates informative plots to help you
understand how various cell types are distributed in the principal
component space.
# Plot PCA data
pc_plot <- plotCellTypePCA(
query_data = query_data,
reference_data = reference_data,
cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
query_cell_type_col = "expert_annotation",
ref_cell_type_col = "expert_annotation",
pc_subset = 1:3
)
# Display plot
pc_plot
The function returns a ggplot
object featuring pairwise scatter plots
of the selected principal components. Each plot compares how different
cell types from the query and reference datasets project onto the PCA
space. This visualization aids in identifying how cell types distribute
across PCs and facilitates comparisons between datasets.
The reference_data
argument contains the reference cell data, which
serves as the foundation for defining the PC space. The query_data
parameter includes the query cell data that will be projected. The
function uses the ref_cell_type_col
and query_cell_type_col
to
identify the relevant cell type annotations in the reference and query
datasets.
calculateDiscriminantSpace()
Alternatively, you can also use the calculateDiscriminantSpace()
function, which projects query single-cell RNA-seq data onto a
discriminant space defined by a reference dataset. This approach helps
evaluate the similarity between the query and reference data, offering
insights into the classification of query cells.
disc_output <- calculateDiscriminantSpace(
reference_data = reference_data,
query_data = query_data,
ref_cell_type_col = "expert_annotation",
query_cell_type_col = "SingleR_annotation"
)
The function returns a comprehensive output that includes discriminant eigenvalues and eigenvectors, which represent the variance explained by each discriminant axis and are used to project the data. It also provides the projections of the reference and query data onto the discriminant space. The Mahalanobis distances between the query and reference cell types are calculated, offering insights into how close the query projections are to the reference. The cosine similarity scores provide another metric to assess the similarity between the datasets.
plot(disc_output, plot_type = "scatterplot")