CelliD is a clustering-free multivariate statistical method for the robust extraction of per-cell gene signatures from single-cell RNA-seq. CelliD allows unbiased cell identity recognition across different donors, tissues-of-origin, model organisms and single-cell omics protocols. We present here the main functionalities of the CelliD R package through several use cases. This vignette illustrates the use of the CelliD R package.
CelliD is based on R version >= 4.1. It contains dependencies with several CRAN and Biocondutor packages as described in the Description file
To install CelliD, set the repositories option to enable downloading Bioconductor dependencies:
if(!require("tidyverse")) install.packages("tidyverse") if(!require("ggpubr")) install.packages("ggpubr") BiocManager::install("CelliD")
macOS users might experience installation issues related to Gfortran library. To solve this, download and install the appropriate gfortran dmg file from https://github.com/fxcoudert/gfortran-for-macOS
## Warning: replacing previous import 'data.table::shift' by 'tictoc::shift' when ## loading 'CelliD'
library(tidyverse) # general purpose library for data handling library(ggpubr) #library for plotting
To illustrate CelliD usage we will use throughout this vignette two publicly available pancreas single-cell RNA-seq data sets provided in Baron et al. 2016 and Segerstolpe et al. 2016. We provide here for convenience the R objects with the raw counts gene expression matrices and associated metadata
#read data BaronMatrix <- readRDS(url("https://storage.googleapis.com/cellid-cbl/BaronMatrix.rds")) BaronMetaData <- readRDS(url("https://storage.googleapis.com/cellid-cbl/BaronMetaData.rds")) SegerMatrix <- readRDS(url("https://storage.googleapis.com/cellid-cbl/SegerstolpeMatrix.rds")) SegerMetaData <- readRDS(url("https://storage.googleapis.com/cellid-cbl/SegerstolpeMetaData2.rds"))
While CelliD can handle all types of genes, we recommend restricting the analysis to protein-coding genes. CelliD package provides two gene lists (HgProteinCodingGenes and MgProteinCodingGenes) containing a background set of 19308 and 21914 protein-coding genes from human and mouse, respectively, obtained from BioMart Ensembl release 100, version April 2020 (GrCH38.p13 for human, and GRCm38.p6 for mouse). Gene identifiers here correspond to official gene symbols.
# Restricting to protein-coding genes data("HgProteinCodingGenes") BaronMatrixProt <- BaronMatrix[rownames(BaronMatrix) %in% HgProteinCodingGenes,] SegerMatrixProt <- SegerMatrix[rownames(SegerMatrix) %in% HgProteinCodingGenes,]
CelliD use as input single cell data in the form of specific S4objects. Currently supported files are SingleCellExperiment from Bioconductor and Seurat Version 3 from CRAN. For downstream analyses, gene identifiers corresponding to official gene symbols are used.
Library size normalization is carried out by rescaling counts to a common library size of 10000. Log transformation is performed after adding a pseudo-count of 1. Genes expressing at least one count in less than 5 cells are removed. These steps mimic the standard Seurat workflow described here.
# Create Seurat object and remove remove low detection genes Baron <- CreateSeuratObject(counts = BaronMatrixProt, project = "Baron", min.cells = 5, meta.data = BaronMetaData) Seger <- CreateSeuratObject(counts = SegerMatrixProt, project = "Segerstolpe", min.cells = 5, meta.data = SegerMetaData) # Library-size normalization, log-transformation, and centering and scaling of gene expression values Baron <- NormalizeData(Baron) Baron <- ScaleData(Baron, features = rownames(Baron))
While this vignette only illustrates the use of CelliD on single-cell RNA-seq data, we note that the package can handle other types of gene matrices, e.g sci-ATAC gene activity score matrices.
CelliD is based on Multiple Correspondence Analysis (MCA), a multivariate method that allows the simultaneous representation of both cells and genes in the same low dimensional vector space. In such space, euclidean distances between genes and cells are computed, and per-cell gene rankings are obtained. The top ‘n’ closest genes to a given cell will be defined as its gene signature. Per-cell gene signatures will be obtained in a later section of this vignette.
To perform MCA dimensionality reduction, the command
RunMCA is used:
Baron <- RunMCA(Baron)
## Computing Fuzzy Matrix
## 6.096 sec elapsed
## Computing SVD
## 135.873 sec elapsed
## Computing Coordinates
## 15.543 sec elapsed
DimPlotMC command allows to visualize both cells and selected gene lists in MCA low dimensional space.
DimPlotMC(Baron, reduction = "mca", group.by = "cell.type", features = c("CTRL", "INS", "MYZAP", "CDH11"), as.text = TRUE) + ggtitle("MCA with some key gene markers")
## Warning in geom_point(data = featureData, mapping = aes_string(x = DIM1, : ## Ignoring unknown aesthetics: text
## Warning in ggrepel::geom_text_repel(data = featureData, mapping = aes_string(x ## = DIM1, : Ignoring unknown aesthetics: text
In the previous plot we colored cells according to the pre-established cell type annotations available as metadata (“cell.type”) and as provided in Baron et al. 2016. No clustering step was performed here. In the common scenario where such annotations were not available, the
DimPlotMC function would represent cells as red colored dots, and genes as black crosses. To represent all genes in the previous plot, just remove the “features” parameter from the previous command, so that the
DimPlotMC function takes the default value.
For the sake of comparisson, state-of-the-art dimensionality reduction techniques such as PCA, UMAP and tSNE can be obtained as follows:
Baron <- RunPCA(Baron, features = rownames(Baron))
## PC_ 1 ## Positive: IFITM3, ZFP36L1, CDC42EP1, TMSB10, PMEPA1, YBX3, SOX4, IL32, TMSB4X, RHOC ## S100A11, TACSTD2, TM4SF1, KRT7, IFITM2, CD44, FLNA, MSN, S100A16, MYH9 ## SDC4, LGALS3, SERPING1, PRSS8, SERPINA3, ITGA2, COL18A1, GSTP1, SAT1, DDIT4 ## Negative: PCSK1N, CPE, SCG5, PTPRN, CHGA, CHGB, GNAS, SCGN, SCG2, CD99 ## PAM, PEMT, VGF, PPP1R1A, BEX1, PCSK2, KIAA1324, TTR, PAX6, IDS ## NLRP1, SLC30A8, EEF1A2, UCHL1, GAD2, APLP1, SYT7, NEUROD1, ERO1B, TMOD1 ## PC_ 2 ## Positive: CELA3A, PRSS1, CELA3B, CTRC, CTRB1, PRSS2, CLPS, CPA1, PNLIPRP1, PLA2G1B ## CPA2, CELA2A, CTRB2, CPB1, KLK1, SYCN, REG1A, PNLIP, REG1B, GP2 ## CUZD1, SPINK1, PRSS3, CTRL, CELA2B, FOXD2, REG3G, MGST1, PDIA2, AMY2A ## Negative: PSAP, IGFBP7, COL18A1, APP, CALR, ITGB1, PKM, LGALS3BP, CD81, ITM2B ## GNAI2, GRN, APLP2, CTSD, CDK2AP1, LAMP1, PXDN, SPARC, ATP1A1, MMP14 ## ITGA5, PLXNB2, TIMP1, FSTL1, COL4A1, CALD1, CTSB, NBL1, HSPG2, FLNA ## PC_ 3 ## Positive: SPARC, COL4A1, ENG, IGFBP4, COL6A2, PDGFRB, BGN, PXDN, COL15A1, COL3A1 ## COL1A2, CYGB, AEBP1, ADAMTS4, COL6A3, HTRA3, EMILIN1, LRRC32, CRISPLD2, COL1A1 ## COL5A3, C11orf96, LAMA4, MMP2, COL6A1, COL5A1, ITGA1, TIMP3, HIC1, COL5A2 ## Negative: KRT19, TACSTD2, KRT7, SPINT2, ELF3, KRT8, ANXA4, PRSS8, SERINC2, SPINT1 ## CDH1, LCN2, CLDN7, CLDN4, CFB, CD24, SERPINA5, CFTR, KRT18, CLDN1 ## MMP7, LGALS4, ST14, C3, SDC4, LAD1, S100A14, SLC4A4, PTPRF, ABCC3 ## PC_ 4 ## Positive: INS, HADH, IAPP, PCSK1, ADCYAP1, TIMP2, IGF2, DLK1, GSN, NPTX2 ## TSPAN13, PDX1, VAT1L, RPS25, WLS, RPL7, RPL24, MAFA, SLC6A6, SYT13 ## ELMO1, DNAJB9, UCHL1, RPL3, RGS16, RPS6, RBP4, PKIB, TGFBR3, RPL17 ## Negative: GCG, GC, TM4SF4, CLU, TMEM176B, IRX2, TTR, TMEM176A, FEV, F10 ## FXYD5, CRYBA2, SMIM24, SERPINA1, FXYD3, PCSK2, ARX, ALDH1A1, LOXL4, B2M ## PAPPA2, CD82, GPX3, FXYD6, GLS, MUC13, VSTM2L, EGFL7, RGS4, COTL1 ## PC_ 5 ## Positive: COL6A3, COL1A2, PDGFRB, COL1A1, BGN, COL3A1, CRLF1, COL5A1, SFRP2, CYGB ## THBS2, EMILIN1, COL6A1, FMOD, VCAN, COL6A2, PRRX1, COL5A2, COL5A3, ADAMTS12 ## CRISPLD2, LAMC3, AEBP1, NOTCH3, MXRA8, LUM, VSTM4, TPM2, INHBA, FN1 ## Negative: PLVAP, PECAM1, CD93, FLT1, KDR, SOX18, PODXL, ACVRL1, RGCC, VWF ## ROBO4, TIE1, ESM1, ADGRL4, ESAM, ECSCR, S1PR1, SEMA3F, CXCR4, CLDN5 ## PASK, NOTCH4, ERG, DYSF, CDH5, CYP1A1, ANGPT2, BCL6B, IFI27, PTPRB
Baron <- RunUMAP(Baron, dims = 1:30)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric ## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation' ## This message will be shown once per session
## 16:14:03 UMAP embedding parameters a = 0.9922 b = 1.112
## 16:14:03 Read 8569 rows and found 30 numeric columns
## 16:14:03 Using Annoy for neighbor search, n_neighbors = 30
## 16:14:03 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## **************************************************| ## 16:14:04 Writing NN index file to temp file /tmp/RtmpiviH37/file1c99affd9a748 ## 16:14:04 Searching Annoy index using 1 thread, search_k = 3000 ## 16:14:07 Annoy recall = 100% ## 16:14:08 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30 ## 16:14:10 Initializing from normalized Laplacian + noise (using RSpectra) ## 16:14:11 Commencing optimization for 500 epochs, with 351608 positive edges ## 16:14:23 Optimization finished
Baron <- RunTSNE(Baron, dims = 1:30) PCA <- DimPlot(Baron, reduction = "pca", group.by = "cell.type") + ggtitle("PCA") + theme(legend.text = element_text(size =10), aspect.ratio = 1) tSNE <- DimPlot(Baron, reduction = "tsne", group.by = "cell.type")+ ggtitle("tSNE") + theme(legend.text = element_text(size =10), aspect.ratio = 1) UMAP <- DimPlot(Baron, reduction = "umap", group.by = "cell.type") + ggtitle("UMAP") + theme(legend.text = element_text(size =10), aspect.ratio = 1) MCA <- DimPlot(Baron, reduction = "mca", group.by = "cell.type") + ggtitle("MCA") + theme(legend.text = element_text(size =10), aspect.ratio = 1) ggarrange(PCA, MCA, common.legend = TRUE, legend = "top")