1 Introduction

This tutorial shows how to apply CDI to select optimal clustering labels among different candidate cell labels for UMI counts of single-cell RNA-sequencing dataset. Sections 1-3 are for datasets from one batch; Section 4 is for datasets from multiple batches. Datasets used in this tutorial were simulated for illustration purpose.

2 Installation

CDI can be installed from Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("CDI")

The latest version can be installed from Github:

if (!requireNamespace("remotes", quietly = TRUE))
    install.packages("remotes")
remotes::install_github("jichunxie/CDI", build_vignettes = TRUE) 

3 Load datasets

CDI provides simulated toy single-cell RNA-seq UMI count matrices for illustration purpose. To use other datasets, please change the code in the following two blocks. Filtering of cells/genes can be applied before feature gene selection; normalization/imputation is not applicable here, as CDI models the raw UMI counts.

Inputs:

  • one_batch_matrix: a single-cell RNA-seq UMI count matrix from one batch (each row represents a gene and each column represent a cell). To apply CDI to datasets from multiple batches, please check section 4.
  • one_batch_matrix_label_df: the label sets of the one_batch_matrix, where each row represents a cell, and each column represents a set of cell labels generated by existing clustering methods such as K-Means.
  • one_batch_matrix_celltype: a vector of characters indicating the benchmark cell types of cells in one_batch_matrix (not necessary except for evaluation purpose).
data(one_batch_matrix, package = "CDI")
dim(one_batch_matrix)
## [1] 3000 2000
data(one_batch_matrix_celltype, package = "CDI")
table(one_batch_matrix_celltype)
## one_batch_matrix_celltype
## type1 type2 type3 type4 type5 
##   400   400   400   400   400

Here we provide 12 label sets generated from PCA\(+\)KMeans and Seurat v3.1.5, where the number of clusters range from 2 to 7.

For better visualization and comparison, the column names of this data frame indicate the clustering method and the number of clusters. For example, “KMeans_k5” refers to the set of cell labels generated by KMeans with 5 clusters.

data(one_batch_matrix_label_df, package = "CDI")
knitr::kable(head(one_batch_matrix_label_df[,c("KMeans_k2", "KMeans_k4", "Seurat_k2", "Seurat_k3")], 3))
KMeans_k2 KMeans_k4 Seurat_k2 Seurat_k3
2 3 1 2
2 3 1 2
2 3 1 2

4 Select feature genes with feature_gene_selection

Feature genes are those genes that express differently across cell types. Several methods are available for feature gene selection. Here we propose a new method called working dispersion score (WDS). We select genes with highest WDS as the feature genes.

feature_gene_indx <- feature_gene_selection(
    X = one_batch_matrix, 
    batch_label = NULL, 
    method = "wds", 
    nfeature = 500)
sub_one_batch_matrix <- one_batch_matrix[feature_gene_indx,]

5 calculate_CDI

First, calculate the size factor of each gene with size_factor. The output of size_factor will be one of the inputs of calculate_CDI.

one_batch_matrix_size_factor <- size_factor(X = one_batch_matrix)

Second, calculate CDI for each candidate set of cell labels:

start_time <- Sys.time()
CDI_return1 <- calculate_CDI(
    X = sub_one_batch_matrix, 
    cand_lab_df = one_batch_matrix_label_df, 
    batch_label = NULL, 
    cell_size_factor = one_batch_matrix_size_factor)
end_time <- Sys.time()
print(difftime(end_time, start_time))
## Time difference of 55.42861 secs

6 Select the optimal label set with minimum CDI-AIC/CDI-BIC

knitr::kable(CDI_return1)
Label_name Cluster_method CDI_AIC CDI_BIC neg_llk_val N_cluster
KMeans_k2 KMeans_k2 KMeans 1135287 1146489 565643.4 2
KMeans_k3 KMeans_k3 KMeans 1120096 1136899 557047.9 3
KMeans_k4 KMeans_k4 KMeans 1112561 1134964 552280.4 4
KMeans_k5 KMeans_k5 KMeans 1101970 1129974 545984.9 5
KMeans_k6 KMeans_k6 KMeans 1103150 1136756 545575.1 6
KMeans_k7 KMeans_k7 KMeans 1105022 1144229 545511.2 7
Seurat_k2 Seurat_k2 Seurat 1129063 1140264 562531.3 2
Seurat_k3 Seurat_k3 Seurat 1120002 1136805 557001.2 3
Seurat_k4 Seurat_k4 Seurat 1109115 1131518 550557.3 4
Seurat_k5 Seurat_k5 Seurat 1102321 1130325 546160.4 5
Seurat_k6 Seurat_k6 Seurat 1103152 1136751 545575.9 6
Seurat_k7 Seurat_k7 Seurat 1104099 1143292 545049.6 7

CDI counts the number of unique characters in each label set as the “N_cluster”. If the column names of label set data frame are provided with the format "[ClusteringMethod]_k[NumberOfClusters]" such as “KMeans_K5, calculate_CDI will extract the”[ClusteringMethod]" as the Cluster_method. The clustering method can also be provided in the argument clustering_method for each label set.

The outputs of calculate_CDI include CDI-AIC and CDI-BIC. CDI calculates AIC and BIC of cell-type-specific gene-specific NB model for UMI counts, where the cell types are based on each candidate label set, and only the selected subset of genes are considered. Whether to use CDI-AIC or CDI-BIC depend on the goals. We suggest using CDI-BIC to select optimal main cell types and using CDI-AIC to select optimal subtypes, because BIC puts more penalty on the complexity of models (number of clusters).

Output visualization

The outputs of CDI are demonstrated with a lineplot. The x-axis is for the number of clusters. The y-axis is for the CDI values. Different colors represent different clustering methods: The orange line represents label sets from Seurat; the blue line represents label sets from K-Means. The red triangle represents the optimal clustering result corresponding to the smallest CDI value.

The cell population in this simulated dataset doesn’t have a hierarchical structure. We use CDI-BIC to select the optimal set of cell labels. The label set “KMeans_k5” has the smallest CDI-BIC and is selected as the optimal.

CDI_lineplot(cdi_dataframe = CDI_return1, cdi_type = "CDI_BIC")