Example data
Load the example dataset, a small and randomly sampled subset of the Cellbench dataset consisting of 3 cell types 895 cells and 894 genes.
library(Cepo)
library(SingleCellExperiment)
data("cellbench", package = "Cepo")
cellbench
## class: SingleCellExperiment
## dim: 894 895
## metadata(3): scPipe Biomart log.exprs.offset
## assays(2): counts logcounts
## rownames(894): AP000902.1 TNNI3 ... SCMH1 IGF2BP2
## rowData names(0):
## colnames(895): CELL_000001 CELL_000003 ... CELL_000955 CELL_000965
## colData names(17): unaligned aligned_unmapped ... sizeFactor celltype
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
cellbench = cellbench[!duplicated(rownames(cellbench)),]
Columns of the colData
indicate the individual id and various metadata for each cell. colData
contains celltype
labels, which will be required to run Cepo. Differential stability analysis performed on the entire cell type repertoire.
colData(cellbench)[1:5,]
## DataFrame with 5 rows and 17 columns
## unaligned aligned_unmapped mapped_to_exon mapped_to_intron
## <integer> <integer> <integer> <integer>
## CELL_000001 167234 8341 526950 40991
## CELL_000003 174510 8608 513021 42270
## CELL_000004 158346 7796 504676 39684
## CELL_000005 159070 6968 486645 38252
## CELL_000006 144914 8610 465126 33435
## ambiguous_mapping mapped_to_ERCC mapped_to_MT number_of_genes
## <integer> <integer> <integer> <numeric>
## CELL_000001 21392 0 22342 11237
## CELL_000003 20170 0 20943 11203
## CELL_000004 18628 0 14021 11237
## CELL_000005 20029 0 14100 10920
## CELL_000006 21732 0 11855 11157
## total_count_per_cell non_mt_percent non_ribo_percent outliers
## <numeric> <numeric> <numeric> <factor>
## CELL_000001 266880 0.823985 0.797096 FALSE
## CELL_000003 251204 0.828956 0.801715 FALSE
## CELL_000004 250040 0.839618 0.816149 FALSE
## CELL_000005 244441 0.838746 0.798577 FALSE
## CELL_000006 235288 0.817904 0.788378 FALSE
## cell_line cell_line_demuxlet demuxlet_cls sizeFactor celltype
## <character> <character> <character> <numeric> <character>
## CELL_000001 HCC827 HCC827 SNG 2.15032 HCC827
## CELL_000003 HCC827 HCC827 SNG 2.00889 HCC827
## CELL_000004 HCC827 HCC827 SNG 2.06447 HCC827
## CELL_000005 HCC827 HCC827 SNG 1.92110 HCC827
## CELL_000006 H1975 H1975 SNG 1.92716 H1975
Note that, if cell-type labels are unknown, we would need to cluster cells into groups via some clustering algorithm. In the example dataset, we have 3 cell types, H1975, H2228 and HCC827, all of which are commonly used cell lines of lung adenocarcinomas.
unique(cellbench$celltype)
## [1] "HCC827" "H1975" "H2228"
Run Cepo to generate list of cell identity genes
Main arguments of Cepo
There are two main arguments to Cepo:
1) exprsMat
is the input data, which should be normalized data, such as counts per million (CPM) or log2-CPM (e.g., logcounts
as created via scater::logNormCounts
).
2) cellTypes
receives as input a vector of cell-type labels. Note that the cell-type labels should be equal in length and ordered the same as the column names in exprsMat
.
ds_res = Cepo(exprsMat = logcounts(cellbench),
cellType = cellbench$celltype)
The Cepo
function returns a list of two elements by default. The first element is a DataFrame
of DS statistics. In this DataFrame
, each column corresponds to the DS statistics for that celltype across all genes. A higher DS statistic value denotes a gene that is more prioritized as a differentially stable gene in that given cell type. In the output DataFrame, the columns correspond to each cell type and each row correspond to a gene.
ds_res
## $stats
## DataFrame with 889 rows and 3 columns
## H1975 H2228 HCC827
## <numeric> <numeric> <numeric>
## AC092447.7 0.852809 -0.450000 -0.402809
## CT45A3 0.834831 -0.408146 -0.426685
## AL049870.3 0.815309 -0.465169 -0.350140
## TDRD9 0.753652 -0.440871 -0.312781
## TNNI3 0.748876 -0.358848 -0.390028
## ... ... ... ...
## STK24 -0.655478 0.369663 0.285815
## CPVL -0.669382 0.136236 0.533146
## BBOX1-AS1 -0.674860 0.436657 0.238202
## COL4A2 -0.689747 0.397331 0.292416
## KCNK1 -0.700702 0.329916 0.370787
##
## $pvalues
## NULL
##
## attr(,"class")
## [1] "Cepo" "list"
Filtering
In many cases, it is beneficial to perform filtering of lowly expressed genes prior to differential analysis. The parameter exprsPct
specifies the threshold for filtering of lowly expressed genes should be performed. By default, this is set of NULL
. A value between 0 and 1 should be provided. Whilst there is no set rule to the threshold, we recommend a value between 0.05
and 0.07
, which will keep any genes that are expressed in 5-7% in at least one cell type, for microfluidic-based data.
ds_res_zprop = Cepo::Cepo(exprsMat = logcounts(cellbench),
cellTypes = cellbench$celltype,
exprsPct = 0.5)
The parameter logfc
specifies minimum log fold-change in gene expression. A value of 0.2
will keep any genes that show at least abs(0.2)
log fold change in gene expression in at least one cell type. By default, this value is NULL
.
ds_res_logfc = Cepo(exprsMat = logcounts(cellbench),
cellTypes = cellbench$celltype,
logfc = 1)
Cepo
outputs some useful stats, including the number of genes nrow
and gene names rownames
. By checking nrow
, we can see that as expected with filtering the number of genes included in the Cepo
run becomes fewer.
nrow(ds_res$stats)
## [1] 889
nrow(ds_res_zprop$stats)
## [1] 841
nrow(ds_res_logfc$stats)
## [1] 853
Computing p-values
There are two methods to compute p-values in Cepo
. The fast approach uses normal approximation of the Cepo statistics to estimate the null distribution. As this only required 100-200 sample runs of Cepo, it is much quicker, and the default approach, than the second permutation approach.
The output of running the p-value computation is a DataFrame
of p-values associated with the DS statistics. In this DataFrame
, each column corresponds to the p-values associated with the DS statistics.
ds_res_pvalues = Cepo(exprsMat = logcounts(cellbench),
cellType = cellbench$celltype,
computePvalue = 200,
prefilter_pzero = 0.4)
## Prefiltering 235 genes....
We can visualise the correlation between the Cepo statistics and -log10 p-values.
idx = rownames(ds_res_pvalues$stats)
par(mfrow=c(1,3))
for (i in unique(cellbench$celltype)) {
plot(rank(ds_res_pvalues$stats[[i]]),
rank(-log10(ds_res_pvalues$pvalues[idx, i])),
main = i,
xlab = "rank Cepo statistics",
ylab = "rank -log10 p-values")
}
par(mfrow=c(1,1))
The permutation approach requires the users to set the computePvalue
argument to a number of bootstrap runs required (we recommend this to be at least 10000). Each column of the DataFrame
corresponds to the p-values associated with the DS statistics obtained through bootstrap on the cells.
ds_res_pvalues = Cepo(exprsMat = logcounts(cellbench),
cellType = cellbench$celltype,
# we use a low value for demonstration purposes
computePvalue = 100,
computeFastPvalue = FALSE)
ds_res_pvalues
## $stats
## DataFrame with 889 rows and 3 columns
## H1975 H2228 HCC827
## <numeric> <numeric> <numeric>
## AC092447.7 0.852809 -0.450000 -0.402809
## CT45A3 0.834831 -0.408146 -0.426685
## AL049870.3 0.815309 -0.465169 -0.350140
## TDRD9 0.753652 -0.440871 -0.312781
## TNNI3 0.748876 -0.358848 -0.390028
## ... ... ... ...
## STK24 -0.655478 0.369663 0.285815
## CPVL -0.669382 0.136236 0.533146
## BBOX1-AS1 -0.674860 0.436657 0.238202
## COL4A2 -0.689747 0.397331 0.292416
## KCNK1 -0.700702 0.329916 0.370787
##
## $pvalues
## DataFrame with 889 rows and 3 columns
## H1975 H2228 HCC827
## <numeric> <numeric> <numeric>
## AC092447.7 0 1 1
## CT45A3 0 1 1
## AL049870.3 0 1 1
## TDRD9 0 1 1
## TNNI3 0 1 1
## ... ... ... ...
## STK24 1 0.00 0
## CPVL 1 0.02 0
## BBOX1-AS1 1 0.00 0
## COL4A2 1 0.00 0
## KCNK1 1 0.00 0
##
## attr(,"class")
## [1] "Cepo" "list"
Visualizing results
We can visualize the overlap of differential stability genes between cell types.
library(UpSetR)
res_name = topGenes(object = ds_res, n = 500)
upset(fromList(res_name), nsets = 3)
Density plot of two genes from each cell type.
plotDensities(x = cellbench,
cepoOutput = ds_res,
nGenes = 2,
assay = "logcounts",
celltypeColumn = "celltype")
## AC092447.7, CT45A3, HLA-DRB6, AR, CASC9, AC011632.1 will be plotted
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## ℹ The deprecated feature was likely used in the Cepo package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We can also specify the genes to be plotted.
plotDensities(x = cellbench,
cepoOutput = ds_res,
genes = c("PLTP", "CPT1C", "MEG3", "SYCE1", "MICOS10P3", "HOXB7"),
assay = "logcounts",
celltypeColumn = "celltype")