CDI 1.4.0
This tutorial shows how to apply CDI to select optimal clustering labels among different candidate cell labels for UMI counts of single-cell RNA-sequencing dataset. Sections 1-3 are for datasets from one batch; Section 4 is for datasets from multiple batches. Datasets used in this tutorial were simulated for illustration purpose.
CDI can be installed from Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("CDI")
The latest version can be installed from Github:
if (!requireNamespace("remotes", quietly = TRUE))
install.packages("remotes")
remotes::install_github("jichunxie/CDI", build_vignettes = TRUE)
CDI provides simulated toy single-cell RNA-seq UMI count matrices for illustration purpose. To use other datasets, please change the code in the following two blocks. Filtering of cells/genes can be applied before feature gene selection; normalization/imputation is not applicable here, as CDI models the raw UMI counts.
Inputs:
data(one_batch_matrix, package = "CDI")
dim(one_batch_matrix)
## [1] 3000 2000
data(one_batch_matrix_celltype, package = "CDI")
table(one_batch_matrix_celltype)
## one_batch_matrix_celltype
## type1 type2 type3 type4 type5
## 400 400 400 400 400
Here we provide 12 label sets generated from PCA\(+\)KMeans and Seurat v3.1.5, where the number of clusters range from 2 to 7.
For better visualization and comparison, the column names of this data frame indicate the clustering method and the number of clusters. For example, “KMeans_k5” refers to the set of cell labels generated by KMeans with 5 clusters.
data(one_batch_matrix_label_df, package = "CDI")
knitr::kable(head(one_batch_matrix_label_df[,c("KMeans_k2", "KMeans_k4", "Seurat_k2", "Seurat_k3")], 3))
KMeans_k2 | KMeans_k4 | Seurat_k2 | Seurat_k3 |
---|---|---|---|
2 | 3 | 1 | 2 |
2 | 3 | 1 | 2 |
2 | 3 | 1 | 2 |
feature_gene_selection
Feature genes are those genes that express differently across cell types. Several methods are available for feature gene selection. Here we propose a new method called working dispersion score (WDS). We select genes with highest WDS as the feature genes.
feature_gene_indx <- feature_gene_selection(
X = one_batch_matrix,
batch_label = NULL,
method = "wds",
nfeature = 500)
sub_one_batch_matrix <- one_batch_matrix[feature_gene_indx,]
calculate_CDI
First, calculate the size factor of each gene with size_factor
. The output of size_factor
will be one of the inputs of calculate_CDI
.
one_batch_matrix_size_factor <- size_factor(X = one_batch_matrix)
Second, calculate CDI for each candidate set of cell labels:
start_time <- Sys.time()
CDI_return1 <- calculate_CDI(
X = sub_one_batch_matrix,
cand_lab_df = one_batch_matrix_label_df,
batch_label = NULL,
cell_size_factor = one_batch_matrix_size_factor)
end_time <- Sys.time()
print(difftime(end_time, start_time))
## Time difference of 55.42861 secs
knitr::kable(CDI_return1)
Label_name | Cluster_method | CDI_AIC | CDI_BIC | neg_llk_val | N_cluster | |
---|---|---|---|---|---|---|
KMeans_k2 | KMeans_k2 | KMeans | 1135287 | 1146489 | 565643.4 | 2 |
KMeans_k3 | KMeans_k3 | KMeans | 1120096 | 1136899 | 557047.9 | 3 |
KMeans_k4 | KMeans_k4 | KMeans | 1112561 | 1134964 | 552280.4 | 4 |
KMeans_k5 | KMeans_k5 | KMeans | 1101970 | 1129974 | 545984.9 | 5 |
KMeans_k6 | KMeans_k6 | KMeans | 1103150 | 1136756 | 545575.1 | 6 |
KMeans_k7 | KMeans_k7 | KMeans | 1105022 | 1144229 | 545511.2 | 7 |
Seurat_k2 | Seurat_k2 | Seurat | 1129063 | 1140264 | 562531.3 | 2 |
Seurat_k3 | Seurat_k3 | Seurat | 1120002 | 1136805 | 557001.2 | 3 |
Seurat_k4 | Seurat_k4 | Seurat | 1109115 | 1131518 | 550557.3 | 4 |
Seurat_k5 | Seurat_k5 | Seurat | 1102321 | 1130325 | 546160.4 | 5 |
Seurat_k6 | Seurat_k6 | Seurat | 1103152 | 1136751 | 545575.9 | 6 |
Seurat_k7 | Seurat_k7 | Seurat | 1104099 | 1143292 | 545049.6 | 7 |
CDI counts the number of unique characters in each label set as the “N_cluster”. If the column names of label set data frame are provided with the format "[ClusteringMethod]_k[NumberOfClusters]" such as “KMeans_K5, calculate_CDI
will extract the”[ClusteringMethod]" as the Cluster_method. The clustering method can also be provided in the argument clustering_method
for each label set.
The outputs of calculate_CDI
include CDI-AIC and CDI-BIC. CDI calculates AIC and BIC of cell-type-specific gene-specific NB model for UMI counts, where the cell types are based on each candidate label set, and only the selected subset of genes are considered. Whether to use CDI-AIC or CDI-BIC depend on the goals. We suggest using CDI-BIC to select optimal main cell types and using CDI-AIC to select optimal subtypes, because BIC puts more penalty on the complexity of models (number of clusters).
Output visualization
The outputs of CDI are demonstrated with a lineplot. The x-axis is for the number of clusters. The y-axis is for the CDI values. Different colors represent different clustering methods: The orange line represents label sets from Seurat; the blue line represents label sets from K-Means. The red triangle represents the optimal clustering result corresponding to the smallest CDI value.
The cell population in this simulated dataset doesn’t have a hierarchical structure. We use CDI-BIC to select the optimal set of cell labels. The label set “KMeans_k5” has the smallest CDI-BIC and is selected as the optimal.
CDI_lineplot(cdi_dataframe = CDI_return1, cdi_type = "CDI_BIC")