## Example data

Load the example dataset, a small and randomly sampled subset of the Cellbench dataset consisting of 3 cell types 895 cells and 894 genes.

`library(Cepo)`

```
## Warning: replacing previous import 'utils::findMatches' by
## 'S4Vectors::findMatches' when loading 'AnnotationDbi'
```

```
library(SingleCellExperiment)
data("cellbench", package = "Cepo")
cellbench
```

```
## class: SingleCellExperiment
## dim: 894 895
## metadata(3): scPipe Biomart log.exprs.offset
## assays(2): counts logcounts
## rownames(894): AP000902.1 TNNI3 ... SCMH1 IGF2BP2
## rowData names(0):
## colnames(895): CELL_000001 CELL_000003 ... CELL_000955 CELL_000965
## colData names(17): unaligned aligned_unmapped ... sizeFactor celltype
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
```

`cellbench = cellbench[!duplicated(rownames(cellbench)),]`

Columns of the `colData`

indicate the individual id and various metadata for each cell. `colData`

contains `celltype`

labels, which will be required to run *Cepo*. Differential stability analysis performed on the entire cell type repertoire.

`colData(cellbench)[1:5,]`

```
## DataFrame with 5 rows and 17 columns
## unaligned aligned_unmapped mapped_to_exon mapped_to_intron
## <integer> <integer> <integer> <integer>
## CELL_000001 167234 8341 526950 40991
## CELL_000003 174510 8608 513021 42270
## CELL_000004 158346 7796 504676 39684
## CELL_000005 159070 6968 486645 38252
## CELL_000006 144914 8610 465126 33435
## ambiguous_mapping mapped_to_ERCC mapped_to_MT number_of_genes
## <integer> <integer> <integer> <numeric>
## CELL_000001 21392 0 22342 11237
## CELL_000003 20170 0 20943 11203
## CELL_000004 18628 0 14021 11237
## CELL_000005 20029 0 14100 10920
## CELL_000006 21732 0 11855 11157
## total_count_per_cell non_mt_percent non_ribo_percent outliers
## <numeric> <numeric> <numeric> <factor>
## CELL_000001 266880 0.823985 0.797096 FALSE
## CELL_000003 251204 0.828956 0.801715 FALSE
## CELL_000004 250040 0.839618 0.816149 FALSE
## CELL_000005 244441 0.838746 0.798577 FALSE
## CELL_000006 235288 0.817904 0.788378 FALSE
## cell_line cell_line_demuxlet demuxlet_cls sizeFactor celltype
## <character> <character> <character> <numeric> <character>
## CELL_000001 HCC827 HCC827 SNG 2.15032 HCC827
## CELL_000003 HCC827 HCC827 SNG 2.00889 HCC827
## CELL_000004 HCC827 HCC827 SNG 2.06447 HCC827
## CELL_000005 HCC827 HCC827 SNG 1.92110 HCC827
## CELL_000006 H1975 H1975 SNG 1.92716 H1975
```

Note that, if cell-type labels are unknown, we would need to cluster cells into groups via some clustering algorithm. In the example dataset, we have 3 cell types, H1975, H2228 and HCC827, all of which are commonly used cell lines of lung adenocarcinomas.

`unique(cellbench$celltype)`

`## [1] "HCC827" "H1975" "H2228"`

## Run Cepo to generate list of cell identity genes

### Main arguments of Cepo

There are two main arguments to *Cepo*:
1) `exprsMat`

is the input data, which should be normalized data, such as counts per million (CPM) or log2-CPM (e.g., `logcounts`

as created via `scater::logNormCounts`

).
2) `cellTypes`

receives as input a vector of cell-type labels. Note that the cell-type labels should be equal in length and ordered the same as the column names in `exprsMat`

.

```
ds_res = Cepo(exprsMat = logcounts(cellbench),
cellType = cellbench$celltype)
```

The `Cepo`

function returns a list of two elements by default. The first element is a `DataFrame`

of DS statistics. In this `DataFrame`

, each column corresponds to the DS statistics for that celltype across all genes. A higher DS statistic value denotes a gene that is more prioritized as a differentially stable gene in that given cell type. In the output DataFrame, the columns correspond to each cell type and each row correspond to a gene.

`ds_res`

```
## $stats
## DataFrame with 889 rows and 3 columns
## H1975 H2228 HCC827
## <numeric> <numeric> <numeric>
## AC092447.7 0.852809 -0.450000 -0.402809
## CT45A3 0.834972 -0.408006 -0.426966
## AL049870.3 0.815309 -0.465169 -0.350140
## TDRD9 0.753652 -0.440871 -0.312781
## TNNI3 0.748876 -0.358848 -0.390028
## ... ... ... ...
## STK24 -0.655478 0.369663 0.285815
## CPVL -0.669382 0.136236 0.533146
## BBOX1-AS1 -0.674860 0.436657 0.238202
## COL4A2 -0.689747 0.397331 0.292416
## KCNK1 -0.700702 0.329916 0.370787
##
## $pvalues
## NULL
##
## attr(,"class")
## [1] "Cepo" "list"
```

### Filtering

In many cases, it is beneficial to perform filtering of lowly expressed genes prior to differential analysis. The parameter `exprsPct`

specifies the threshold for filtering of lowly expressed genes should be performed. By default, this is set of `NULL`

. A value between 0 and 1 should be provided. Whilst there is no set rule to the threshold, we recommend a value between `0.05`

and `0.07`

, which will keep any genes that are expressed in 5-7% in at least one cell type, for microfluidic-based data.

```
ds_res_zprop = Cepo::Cepo(exprsMat = logcounts(cellbench),
cellTypes = cellbench$celltype,
exprsPct = 0.5)
```

The parameter `logfc`

specifies minimum log fold-change in gene expression. A value of `0.2`

will keep any genes that show at least `abs(0.2)`

log fold change in gene expression in at least one cell type. By default, this value is `NULL`

.

```
ds_res_logfc = Cepo(exprsMat = logcounts(cellbench),
cellTypes = cellbench$celltype,
logfc = 1)
```

`Cepo`

outputs some useful stats, including the number of genes `nrow`

and gene names `rownames`

. By checking `nrow`

, we can see that as expected with filtering the number of genes included in the `Cepo`

run becomes fewer.

`nrow(ds_res$stats)`

`## [1] 889`

`nrow(ds_res_zprop$stats)`

`## [1] 841`

`nrow(ds_res_logfc$stats)`

`## [1] 853`

### Computing p-values

There are two methods to compute p-values in `Cepo`

. The fast approach uses normal approximation of the Cepo statistics to estimate the null distribution. As this only required 100-200 sample runs of Cepo, it is much quicker, and the default approach, than the second permutation approach.

The output of running the p-value computation is a `DataFrame`

of p-values associated with the DS statistics. In this `DataFrame`

, each column corresponds to the p-values associated with the DS statistics.

```
ds_res_pvalues = Cepo(exprsMat = logcounts(cellbench),
cellType = cellbench$celltype,
computePvalue = 200,
prefilter_pzero = 0.4)
```

`## Prefiltering 233 genes....`

We can visualise the correlation between the Cepo statistics and -log10 p-values.

```
idx = rownames(ds_res_pvalues$stats)
par(mfrow=c(1,3))
for (i in unique(cellbench$celltype)) {
plot(rank(ds_res_pvalues$stats[[i]]),
rank(-log10(ds_res_pvalues$pvalues[idx, i])),
main = i,
xlab = "rank Cepo statistics",
ylab = "rank -log10 p-values")
}
```

`par(mfrow=c(1,1))`

The permutation approach requires the users to set the `computePvalue`

argument to a number of bootstrap runs required (we recommend this to be at least 10000). Each column of the `DataFrame`

corresponds to the p-values associated with the DS statistics obtained through bootstrap on the cells.

```
ds_res_pvalues = Cepo(exprsMat = logcounts(cellbench),
cellType = cellbench$celltype,
# we use a low value for demonstration purposes
computePvalue = 100,
computeFastPvalue = FALSE)
ds_res_pvalues
```

```
## $stats
## DataFrame with 889 rows and 3 columns
## H1975 H2228 HCC827
## <numeric> <numeric> <numeric>
## AC092447.7 0.852809 -0.450000 -0.402809
## CT45A3 0.834972 -0.408006 -0.426966
## AL049870.3 0.815309 -0.465169 -0.350140
## TDRD9 0.753652 -0.440871 -0.312781
## TNNI3 0.748876 -0.358848 -0.390028
## ... ... ... ...
## STK24 -0.655478 0.369663 0.285815
## CPVL -0.669382 0.136236 0.533146
## BBOX1-AS1 -0.674860 0.436657 0.238202
## COL4A2 -0.689747 0.397331 0.292416
## KCNK1 -0.700702 0.329916 0.370787
##
## $pvalues
## DataFrame with 889 rows and 3 columns
## H1975 H2228 HCC827
## <numeric> <numeric> <numeric>
## AC092447.7 0 1 1
## CT45A3 0 1 1
## AL049870.3 0 1 1
## TDRD9 0 1 1
## TNNI3 0 1 1
## ... ... ... ...
## STK24 1 0 0
## CPVL 1 0 0
## BBOX1-AS1 1 0 0
## COL4A2 1 0 0
## KCNK1 1 0 0
##
## attr(,"class")
## [1] "Cepo" "list"
```

## Visualizing results

We can visualize the overlap of differential stability genes between cell types.

```
library(UpSetR)
res_name = topGenes(object = ds_res, n = 500)
upset(fromList(res_name), nsets = 3)
```

Density plot of two genes from each cell type.

```
plotDensities(x = cellbench,
cepoOutput = ds_res,
nGenes = 2,
assay = "logcounts",
celltypeColumn = "celltype")
```

`## AC092447.7, CT45A3, HLA-DRB6, AR, CASC9, AC011632.1 will be plotted`

```
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## â„¹ Please use `after_stat(density)` instead.
## â„¹ The deprecated feature was likely used in the Cepo package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
```

We can also specify the genes to be plotted.

```
plotDensities(x = cellbench,
cepoOutput = ds_res,
genes = c("PLTP", "CPT1C", "MEG3", "SYCE1", "MICOS10P3", "HOXB7"),
assay = "logcounts",
celltypeColumn = "celltype")
```