`sccomp`

tests differences in cell type proportions from single-cell data. It is robust against outliers, it models continuous and discrete factors, and capable of random-effect/intercept modelling.

Please cite PNAS - sccomp: Robust differential composition and variability analysis for single-cell data

## 0.1 Characteristics

- Complex linear models with continuous and categorical covariates
- Multilevel modelling, with population fixed and random effects/intercept
- Modelling data from counts
- Testing differences in cell-type proportionality
- Testing differences in cell-type specific variability
- Cell-type information share for variability adaptive shrinkage
- Testing differential variability
- Probabilistic outlier identification
- Cross-dataset learning (hyperpriors).

# 1 Installation

`sccomp`

is based on `cmdstanr`

which provides the latest version of `cmdstan`

the Bayesian modelling tool. `cmdstanr`

is not on CRAN, so we need to have 3 simple step process (that will be prompted to the user is forgot).

- R installation of
`sccomp`

- R installation of
`cmdstanr`

`cmdstanr`

call to`cmdstan`

installation

**Bioconductor**

```
if (!requireNamespace("BiocManager")) install.packages("BiocManager")
# Step 1
BiocManager::install("sccomp")
# Step 2
install.packages("cmdstanr", repos = c("https://stan-dev.r-universe.dev/", getOption("repos")))
# Step 3
cmdstanr::check_cmdstan_toolchain(fix = TRUE) # Just checking system setting
cmdstanr::install_cmdstan()
```

**Github**

```
# Step 1
devtools::install_github("MangiolaLaboratory/sccomp")
# Step 2
install.packages("cmdstanr", repos = c("https://stan-dev.r-universe.dev/", getOption("repos")))
# Step 3
cmdstanr::check_cmdstan_toolchain(fix = TRUE) # Just checking system setting
cmdstanr::install_cmdstan()
```

Function | Description |
---|---|

`sccomp_estimate` |
Fit the model onto the data, and estimate the coefficients |

`sccomp_remove_outliers` |
Identify outliers probabilistically based on the model fit, and exclude them from the estimation |

`sccomp_test` |
Calculate the probability that the coefficients are outside the H0 interval (i.e. test_composition_above_logit_fold_change) |

`sccomp_replicate` |
Simulate data from the model, or part of the model |

`sccomp_predict` |
Predicts proportions, based on the model, or part of the model |

`sccomp_remove_unwanted_variation` |
Removes the variability for unwanted factors |

`plot` |
Plots summary plots to asses significance |

# 2 Analysis

```
library(dplyr)
library(sccomp)
library(ggplot2)
library(forcats)
library(tidyr)
data("seurat_obj")
data("sce_obj")
data("counts_obj")
```

`sccomp`

can model changes in composition and variability. By default, the formula for variability is either `~1`

, which assumes that the cell-group variability is independent of any covariate or `~ factor_of_interest`

, which assumes that the model is dependent on the factor of interest only. The variability model must be a subset of the model for composition.

## 2.1 Binary factor

Of the output table, the estimate columns start with the prefix `c_`

indicate `composition`

, or with `v_`

indicate `variability`

(when formula_variability is set).

### 2.1.1 From Seurat, SingleCellExperiment, metadata objects

### 2.1.2 From counts

```
sccomp_result =
counts_obj |>
sccomp_estimate(
formula_composition = ~ type,
.sample = sample,
.cell_group = cell_group,
.count = count,
cores = 1, verbose = FALSE
) |>
sccomp_remove_outliers(cores = 1, verbose = FALSE) |> # Optional
sccomp_test()
```

Here you see the results of the fit, the effects of the factor on composition and variability. You also can see the uncertainty around those effects.

The output is a tibble containing the **Following columns**

`cell_group`

- The cell groups being tested.`parameter`

- The parameter being estimated from the design matrix described by the input`formula_composition`

and`formula_variability`

.`factor`

- The covariate factor in the formula, if applicable (e.g., not present for Intercept or contrasts).`c_lower`

- Lower (2.5%) quantile of the posterior distribution for a composition (c) parameter.`c_effect`

- Mean of the posterior distribution for a composition (c) parameter.`c_upper`

- Upper (97.5%) quantile of the posterior distribution for a composition (c) parameter.`c_pH0`

- Probability of the null hypothesis (no difference) for a composition (c). This is not a p-value.`c_FDR`

- False-discovery rate of the null hypothesis for a composition (c).`v_lower`

- Lower (2.5%) quantile of the posterior distribution for a variability (v) parameter.`v_effect`

- Mean of the posterior distribution for a variability (v) parameter.`v_upper`

- Upper (97.5%) quantile of the posterior distribution for a variability (v) parameter.`v_pH0`

- Probability of the null hypothesis for a variability (v).`v_FDR`

- False-discovery rate of the null hypothesis for a variability (v).`count_data`

- Nested input count data.

## 2.2 Summary plots

A plot of group proportions, faceted by groups. The blue boxplots represent the posterior predictive check. If the model is descriptively adequate for the data, the blue boxplots should roughly overlay the black boxplots, which represent the observed data. The outliers are coloured in red. A boxplot will be returned for every (discrete) covariate present in formula_composition. The colour coding represents the significant associations for composition and/or variability.

A plot of estimates of differential composition (c_) on the x-axis and differential variability (v_) on the y-axis. The error bars represent 95% credible intervals. The dashed lines represent the minimal effect that the hypothesis test is based on. An effect is labelled as significant if it exceeds the minimal effect according to the 95% credible interval. Facets represent the covariates in the model.

We can plot the relationship between abundance and variability. As we can see below, they are positively correlated. sccomp models this relationship to obtain a shrinkage effect on the estimates of both the abundance and the variability. This shrinkage is adaptive as it is modelled jointly, thanks to Bayesian inference.

You can produce the series of plots calling the `plot`

method.

## 2.3 Model proportions directly (e.g. from deconvolution)

**Note:** If counts are available, we strongly discourage the use of proportions, as an important source of uncertainty (i.e., for rare groups/cell types) is not modeled.

The use of proportions is better suited for modelling deconvolution results (e.g., of bulk RNA data), in which case counts are not available.

Proportions should be greater than 0. Assuming that zeros derive from a precision threshold (e.g., deconvolution), zeros are converted to the smallest non-zero value.

## 2.4 Continuous factor

`sccomp`

is able to fit erbitrary complex models. In this example we have a continuous and binary covariate.

## 2.5 Random Effect Modeling

`sccomp`

supports multilevel modeling by allowing the inclusion of random effects in the compositional and variability formulas. This is particularly useful when your data has hierarchical or grouped structures, such as measurements nested within subjects, batches, or experimental units. By incorporating random effects, sccomp can account for variability at different levels of your data, improving model fit and inference accuracy.

### 2.5.1 Random Intercept Model

In this example, we demonstrate how to fit a random intercept model using sccomp. We’ll model the cell-type proportions with both fixed effects (e.g., treatment) and random effects (e.g., subject-specific variability).

Here is the input data

`## Loading required package: SeuratObject`

`## Loading required package: sp`

```
## 'SeuratObject' was built with package 'Matrix' 1.7.0 but the current
## version is 1.7.1; it is recomended that you reinstall 'SeuratObject' as
## the ABI for 'Matrix' may have changed
```

```
##
## Attaching package: 'SeuratObject'
```

```
## The following objects are masked from 'package:base':
##
## intersect, t
```