Gene Detection Analysis for scRNA-seq

Ruoxin Li, Gerald Quon

2023-10-24

Introduction

This tutorial provides an example analysis for modelling gene detection pattern as outlined in R.Li et al, 2018. The goal of this tutorial is to provide an overview of the cell type classification and visualization tasks by learning a low dimensional embedding through a class of gene detection models: that is BFA and Binary PCA.

Summary of workflow

The following workflow summarizes a typical dimensionality reduction procedure performed by BFA or Binary PCA.

  1. Data processing
  2. Dimensionality reduction
  3. Visualization

Installation

Let’s start with the installation

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("scBFA")

next we can load dependent packages

library(zinbwave)
library(SingleCellExperiment)
library(ggplot2)
library(scBFA)

Information of example dataset

The example dataset is generated from our scRNA-seq pre-DC/cDC dataset sourced from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE89232 After performing all quality control procedure of genes and cells (as outlined in the paper), we then select 500 most variable genes for illustration purpose. The example dataset consists of 950 cells and 500 genes


# raw counts matrix with rows are genes and columns are cells
data("exprdata")

# a vector specify the ground truth of cell types provided by conquer database
data("celltype")

Working with SingleCellExperiment class

The design of BFA and Binary PCA allows three kinds of input object(scData):

For illustration, here we construct a singleCellExperiment class to be the input of BFA and Binary PCA.

sce <- SingleCellExperiment(assay = list(counts = exprdata))

Gene Detection Model analysis

Binary Factor Analysis

Let \(N\) stands for number of cells, \(G\) stands for the number of genes, and \(K\) stands for the number of latent dimensions.

A bfa model object computes the following parameters after fitting the gene detection matrix.

  1. \(Z\) is \(N\) by \(K\) embedding matrix and is named as “ZZ” in the model object
  2. \(A\) is \(G\) by \(K\) loading matrix and is named as “AA” in the model object
  3. \(\beta\) if there is cell-level covariates (e.g batch effect), \(\beta\) is corresponding coefficient matrix and is named as “beta” in the model object
  4. \(\gamma\) if there is gene-level covariates (e.g QC measures), \(\gamma\) is corresponding coefficient matrix and is named as “gamma” in the model object

We choose 3 as number of latent dimensions and project the gene detection matrix on the embedding space.

bfa_model = scBFA(scData = sce, numFactors = 2) 

We then visualize the low dimensional embedding of BFA in tSNE space. Points are colored by their corresponding cell types.

set.seed(5)
df = as.data.frame(bfa_model$ZZ)
df$celltype = celltype

p1 <- ggplot(df,aes(x = V1,y = V2,colour = celltype))
p1 <- p1 + geom_jitter(size=2.5,alpha = 0.8) 
colorvalue <- c("#43d5f9","#24b71f","#E41A1C", "#ffc935","#3d014c","#39ddb2",
                "slateblue2","maroon","#f7df27","palevioletred1","olivedrab3",
                "#377EB8","#5043c1","blue","aquamarine2","chartreuse4",
                "burlywood2","indianred1","mediumorchid1")
p1 <- p1 + xlab("tsne axis 1") + ylab("tsne axis 2") 
p1 <- p1 + scale_color_manual(values = colorvalue)
p1 <- p1 + theme(panel.background = element_blank(),
                  legend.position = "right",
                  axis.text=element_blank(),
                  axis.line.x = element_line(color="black"),
                  axis.line.y = element_line(color="black"),
                  plot.title = element_blank()
                   )
p1

Binary PCA

bpca = BinaryPCA(scData = sce) 

We then visualize the low dimensional embedding of Binary PCA in tSNE space. Points are colored by their corresponding cell types.


df = as.data.frame(bpca$x[,c(1:2)])
colnames(df) = c("V1","V2")
df$celltype = celltype

p1 <- ggplot(df,aes(x = V1,y = V2,colour = celltype))
p1 <- p1 + geom_jitter(size=2.5,alpha = 0.8) 
colorvalue <- c("#43d5f9","#24b71f","#E41A1C", "#ffc935","#3d014c","#39ddb2",
                "slateblue2","maroon","#f7df27","palevioletred1","olivedrab3",
                "#377EB8","#5043c1","blue","aquamarine2","chartreuse4",
                "burlywood2","indianred1","mediumorchid1")
p1 <- p1 + xlab("tsne axis 1") + ylab("tsne axis 2") 
p1 <- p1 + scale_color_manual(values = colorvalue)
p1 <- p1 + theme(panel.background = element_blank(),
                legend.position = "right",
                axis.text=element_blank(),
                axis.line.x = element_line(color="black"),
                axis.line.y = element_line(color="black"),
                plot.title = element_blank()
                )
p1

Session Info

sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] scBFA_1.16.0                ggplot2_3.4.4              
#>  [3] zinbwave_1.24.0             SingleCellExperiment_1.24.0
#>  [5] SummarizedExperiment_1.32.0 Biobase_2.62.0             
#>  [7] GenomicRanges_1.54.0        GenomeInfoDb_1.38.0        
#>  [9] IRanges_2.36.0              S4Vectors_0.40.0           
#> [11] BiocGenerics_0.48.0         MatrixGenerics_1.14.0      
#> [13] matrixStats_1.0.0          
#> 
#> loaded via a namespace (and not attached):
#>   [1] RcppAnnoy_0.0.21        splines_4.3.1           later_1.3.1            
#>   [4] bitops_1.0-7            tibble_3.2.1            polyclip_1.10-6        
#>   [7] XML_3.99-0.14           lifecycle_1.0.3         edgeR_4.0.0            
#>  [10] globals_0.16.2          lattice_0.22-5          MASS_7.3-60            
#>  [13] magrittr_2.0.3          limma_3.58.0            plotly_4.10.3          
#>  [16] sass_0.4.7              rmarkdown_2.25          jquerylib_0.1.4        
#>  [19] yaml_2.3.7              httpuv_1.6.12           Seurat_4.4.0           
#>  [22] sctransform_0.4.1       sp_2.1-1                spatstat.sparse_3.0-3  
#>  [25] reticulate_1.34.0       cowplot_1.1.1           pbapply_1.7-2          
#>  [28] DBI_1.1.3               RColorBrewer_1.1-3      ADGofTest_0.3          
#>  [31] abind_1.4-5             zlibbioc_1.48.0         Rtsne_0.16             
#>  [34] pspline_1.0-19          purrr_1.0.2             RCurl_1.98-1.12        
#>  [37] GenomeInfoDbData_1.2.11 ggrepel_0.9.4           irlba_2.3.5.1          
#>  [40] listenv_0.9.0           spatstat.utils_3.0-4    genefilter_1.84.0      
#>  [43] goftest_1.2-3           spatstat.random_3.2-1   annotate_1.80.0        
#>  [46] fitdistrplus_1.1-11     parallelly_1.36.0       leiden_0.4.3           
#>  [49] codetools_0.2-19        DelayedArray_0.28.0     tidyselect_1.2.0       
#>  [52] farver_2.1.1            spatstat.explore_3.2-5  jsonlite_1.8.7         
#>  [55] ellipsis_0.3.2          progressr_0.14.0        ggridges_0.5.4         
#>  [58] survival_3.5-7          tools_4.3.1             ica_1.0-3              
#>  [61] Rcpp_1.0.11             glue_1.6.2              gridExtra_2.3          
#>  [64] SparseArray_1.2.0       xfun_0.40               DESeq2_1.42.0          
#>  [67] dplyr_1.1.3             numDeriv_2016.8-1.1     withr_2.5.1            
#>  [70] fastmap_1.1.1           fansi_1.0.5             digest_0.6.33          
#>  [73] R6_2.5.1                mime_0.12               colorspace_2.1-0       
#>  [76] scattermore_1.2         tensor_1.5              spatstat.data_3.0-3    
#>  [79] RSQLite_2.3.1           copula_1.1-2            utf8_1.2.4             
#>  [82] tidyr_1.3.0             generics_0.1.3          data.table_1.14.8      
#>  [85] httr_1.4.7              htmlwidgets_1.6.2       S4Arrays_1.2.0         
#>  [88] uwot_0.1.16             pkgconfig_2.0.3         gtable_0.3.4           
#>  [91] blob_1.2.4              lmtest_0.9-40           XVector_0.42.0         
#>  [94] pcaPP_2.0-3             htmltools_0.5.6.1       SeuratObject_4.1.4     
#>  [97] scales_1.2.1            png_0.1-8               knitr_1.44             
#> [100] reshape2_1.4.4          nlme_3.1-163            cachem_1.0.8           
#> [103] zoo_1.8-12              stringr_1.5.0           KernSmooth_2.23-22     
#> [106] parallel_4.3.1          miniUI_0.1.1.1          softImpute_1.4-1       
#> [109] AnnotationDbi_1.64.0    pillar_1.9.0            grid_4.3.1             
#> [112] vctrs_0.6.4             RANN_2.6.1              promises_1.2.1         
#> [115] xtable_1.8-4            cluster_2.1.4           evaluate_0.22          
#> [118] mvtnorm_1.2-3           cli_3.6.1               locfit_1.5-9.8         
#> [121] compiler_4.3.1          rlang_1.1.1             crayon_1.5.2           
#> [124] future.apply_1.11.0     labeling_0.4.3          plyr_1.8.9             
#> [127] stringi_1.7.12          viridisLite_0.4.2       deldir_1.0-9           
#> [130] BiocParallel_1.36.0     munsell_0.5.0           Biostrings_2.70.0      
#> [133] gsl_2.1-8               lazyeval_0.2.2          spatstat.geom_3.2-7    
#> [136] Matrix_1.6-1.1          stabledist_0.7-1        patchwork_1.1.3        
#> [139] bit64_4.0.5             future_1.33.0           KEGGREST_1.42.0        
#> [142] statmod_1.5.0           shiny_1.7.5.1           ROCR_1.0-11            
#> [145] igraph_1.5.1            memoise_2.0.1           bslib_0.5.1            
#> [148] bit_4.0.5