scDiagnostics 1.1.0
The scDiagnostics
package provides powerful tools for anomaly detection in single-cell data, enabling researchers to identify and analyze outliers in complex datasets. Central to this process is the detectAnomaly()
function, which integrates dimensionality reduction through Principal Component Analysis (PCA) with the robust capabilities of the isolation forest algorithm.
In single-cell analysis, detecting anomalies is crucial for identifying potential data issues, such as mislabeled cells, technical artifacts, or biologically distinct subpopulations. The detectAnomaly()
function offers a versatile approach to anomaly detection by allowing users to project data onto a PCA space and apply isolation forests to uncover outliers. Whether working solely with a reference dataset or comparing a query dataset against a well-characterized reference, this function provides detailed insights into potential anomalies.
This vignette illustrates how to effectively use the detectAnomaly()
function in various scenarios. We explore both cell-type-specific and global anomaly detection, demonstrate the utility of integrating query data with reference data, and offer guidance on interpreting the results. Additionally, we show how to extend this analysis by combining anomaly detection with PCA loadings using the calculateCellSimilarityPCA()
function, providing a comprehensive toolkit for investigating the structure and quality of single-cell data.
This vignette also demonstrates the use of two functions, calculateCellDistances()
and calculateCellDistancesSimilarity()
, to analyze the distances between cells in query and reference datasets and to measure the similarity of density distributions. These functions are useful for investigating how cells in different datasets relate to each other, particularly in the context of identifying anomalies and understanding distribution overlaps.
In the context of the scDiagnostics
package, the following datasets illustrate the application of these tools:
reference_data
: A curated and processed dataset containing expert-assigned cell type annotations. This dataset serves as a reference for comparison and can be used alone to detect anomalies within the reference annotations.
query_data
: A dataset that also includes expert-assigned cell type annotations, but additionally features annotations generated by the SingleR package. This allows for the comparison of expert annotations with those produced by an automated method to detect inconsistencies or anomalies.
# Load library
library(scDiagnostics)
# Load datasets
data("reference_data")
data("query_data")
# Set seed for reproducibility
set.seed(0)
By using these datasets, you can leverage the package’s tools to compare annotations between reference_data
and query_data
, or analyze reference_data
alone to identify potential issues. The package’s flexibility supports various analysis scenarios, whether you need to assess overall annotation quality or focus on specific cell types.
Through these capabilities, scDiagnostics
empowers you to perform thorough and nuanced assessments of cell type annotations, enhancing the accuracy and reliability of your analyses.
detectAnomaly()
FunctionThe detectAnomaly()
function integrates dimensionality reduction via PCA with the isolation forest algorithm to detect anomalies in single-cell data. By projecting both reference and query datasets (if available) onto the PCA space of the reference data, the function trains an isolation forest model on reference data in PCA space to pinpoint anomalies in the reference or query data. This approach is highly versatile:
The function takes a SingleCellExperiment object as reference_data
and trains an isolation forest model on the reference PCA-projected data, with an optional query_data
for projecting onto this PCA space for anomaly detection. You can specify cell type annotations through ref_cell_type_col
and query_cell_type_col
, and limit the analysis to certain cell types using the cell_types
parameter. The function allows you to select specific principal components to use to train the isolation forest via pc_subset
, adjust the number of trees with n_tree
, and set an anomaly_threshold
for classifying anomalies.
The function returns several outputs: anomaly_scores
indicating the likelihood of each cell being an anomaly, a logical vector (anomaly
) identifying these anomalies, PCA projections for the reference data (reference_mat_subset
) and optionally for the query data (query_mat_subset
), and the proportion of variance explained by the selected principal components (var_explained
).
detectAnomaly()
ExamplesThis section demonstrates how to use the detectAnomaly()
function when both reference and query datasets are provided. It includes examples of analyzing anomalies for specific cell types and globally across all data.
In this example, we analyze anomalies specifically for the “CD4” cell type. The anomaly scores are trained on the PCA projections of the “CD4” cells from the reference dataset. If query data is provided, anomaly scores for the query data are predicted based on the PCA projections of the query data onto the reference PCA space for the “CD4” cell type.
# Perform anomaly detection
anomaly_output <- detectAnomaly(reference_data = reference_data,
query_data = query_data,
ref_cell_type_col = "expert_annotation",
query_cell_type_col = "SingleR_annotation",
pc_subset = 1:5,
n_tree = 500,
anomaly_treshold = 0.6)
# Plot the output for the "CD4" cell type
plot(anomaly_output,
cell_type = "CD4",
pc_subset = 1:5,
data_type = "query")
In this example, we analyze anomalies specifically for the “CD4” cell type. The anomaly scores are trained on the PCA projections of the “CD4” cells from the reference dataset. If query data is provided, anomaly scores for the query data are predicted based on the PCA projections of the query data onto the reference PCA space for the “CD4” cell type.
Here, we perform global anomaly detection by setting cell_type = NULL
. In this case, the isolation forest is trained on PCA projections of all cells in the reference data combined. The global anomaly scores are then computed for both reference and query datasets.
# Perform anomaly detection
anomaly_output <- detectAnomaly(reference_data = reference_data,
query_data = query_data,
ref_cell_type_col = "expert_annotation",
query_cell_type_col = "SingleR_annotation",
pc_subset = 1:5,
n_tree = 500,
anomaly_treshold = 0.6)
# Plot the global anomaly scores
plot(anomaly_output,
cell_type = NULL, # Plot all cell types
pc_subset = 1:5,
data_type = "query")