Contents

require(IgGeneUsage)
require(knitr)
require(ggplot2)
require(ggforce)
require(gridExtra)
require(ggrepel)
require(rstan)
require(reshape2)
rstan_options(auto_write = TRUE)

1 Introduction

Decoding the properties of immune repertoires is key in understanding the response of adaptive immunity to challenges such as viral infection. One important property is biases in immunoglobulin (Ig) gene usage between biological conditions (e.g. healthy vs tumor). Yet, most analyses for differential gene usage (DGU) are performed qualitatively, or with inadequate statistical methods. Here we introduce IgGeneUsage, a computational tool for DGU analysis.

2 Input

The main input of IgGeneUsage is a data.frame that has the following 4 columns:

  1. sample_id: name of the repertoire (e.g. Patient-1)
  2. condition: name of the condition to which each repertoire belongs (e.g. healthy or tumor)
  3. gene_name: gene name (e.g. IGHV1-10 or family TRVB1)
  4. gene_usage_count: numeric (count) of usage related in sample x gene x condition specified in columns 1-3

The sum of all gene usage counts (column 4) for a given repertoire is equal to the repertoire size (number of cells in the repertoire).

3 Model

IgGeneUsage transforms the provided input in the following way. Given \(R\) repertoires, each having \(G\) genes, IgGeneUsage generates a gene usage matrix \(Y^{R \times G}\). Row sums in \(Y\) define the total usage in each repertoire (\(N\)). The design variable \(X\) is set to \(X = 1\) for repertoires that belong to the first condition, and \(X = -1\) otherwise.

For the analysis of DGU between two biological conditions, we designed the following Bayesian model (\(M\)) for zero-inflated beta-binomial regression. This model can fit over-dispersed gene usage data. The immune repertoire data is also not exhaustive, which leads to misdetection of genes that are systematically rearranged at low probability. The zero-inflated component of our model accounts for this:

\[\begin{align} p(Y_{ij} \mid M) = \begin{cases} \kappa + (1 - \kappa) \operatorname{BB}\left(0 \mid N_{i}, \theta_{ij}, \phi \right), & \text{if $Y_{ij}$ = 0} \\ (1 - \kappa) \operatorname{BB}\left(Y_{ij} \mid N_{i}, \theta_{ij}, \phi \right), & \text{if $Y_{ij}$ > 0} \end{cases}\\ \theta_{ij}=\operatorname{logit^{-1}}\left(\alpha_{j}+\beta_{ij}X_{i}\right)\\ \beta_{ij}\sim\operatorname{Normal}\left(\gamma_{j},\gamma_{\sigma} \right)\\ \gamma_{j}\sim\operatorname{Normal}\left(\hat{\gamma},\hat{\gamma}_{\sigma} \right) \\ \alpha_{j}\sim\operatorname{Normal}\left(\hat{\alpha},\hat{\alpha}_{\sigma} \right) \\ \hat{\gamma} \sim \operatorname{Normal}\left(0, 5\right) \\ \hat{\alpha} \sim \operatorname{Normal}\left(0, 10\right) \\ \gamma_{\sigma}, \hat{\gamma}_{\sigma}, \hat{\alpha}_{\sigma} \sim \operatorname{Cauchy^{+}}\left(0, 1\right) \\ \phi \sim \operatorname{Exponential}\left(\tau\right) \\ \tau \sim \operatorname{Gamma}\left(3, 0.1\right) \\ \kappa \sim \operatorname{Beta}\left(1, 3\right) \end{align}\]

Model \(M\) legend:

In the output of IgGeneUsage, we report the mean effect size (\(\gamma\)) and its 95% highest density interval (HDI). Genes with \(\gamma \neq 0\) (e.g. if 95% HDI of \(\gamma\) excludes 0) are most likely to experience differential usage. Additionally, we report the probability of differential gene usage (\(\pi\)): \[\begin{align} \pi = 2 \cdot \max\left(\int_{\gamma = -\infty}^{0} p(\gamma)\mathrm{d}\gamma, \int_{\gamma = 0}^{\infty} p(\gamma)\mathrm{d}\gamma\right) - 1 \end{align}\] with \(\pi = 1\) for genes with strong differential usage, and \(\pi = 0\) for genes with negligible differential gene usage. Both metrics are computed based on the posterior distribution of \(\gamma\), and are thus related. We find \(\pi\) slightly easier to interpret.

4 Updated model for IgGeneUsage version > 1.7.25

\[\begin{align} p(Y_{ij} \mid M) = \begin{cases} \kappa + (1 - \kappa) \operatorname{BB}\left(0 \mid N_{i}, \theta_{ij}, \phi \right), & \text{if $Y_{ij}$ = 0} \\ (1 - \kappa) \operatorname{BB}\left(Y_{ij} \mid N_{i}, \theta_{ij}, \phi \right), & \text{if $Y_{ij}$ > 0} \end{cases}\\ \theta_{ij}=\operatorname{logit^{-1}}\left(\alpha_{ij}+\beta_{ij}X_{i}\right)\\ \alpha_{ij}\sim\operatorname{Normal}\left(\delta_{j},\delta_{\sigma} \right)\\ \beta_{ij}\sim\operatorname{Normal}\left(\gamma_{j},\gamma_{\sigma} \right)\\ \gamma_{j}\sim\operatorname{Normal}\left(0.0,\hat{\gamma}_{\sigma} \right) \\ \delta_{j}\sim\operatorname{Normal}\left(0.0,\hat{\delta}_{\sigma} \right) \\ \gamma_{\sigma}, \hat{\gamma}_{\sigma}, \delta_{\sigma}, \hat{\delta}_{\sigma} \sim \operatorname{Cauchy^{+}}\left(0, 1\right) \\ \phi \sim \operatorname{Exponential}\left(\tau\right) \\ \tau \sim \operatorname{Gamma}\left(3, 0.1\right) \\ \kappa \sim \operatorname{Beta}\left(1, 3\right) \end{align}\]

5 Case Study A

IgGeneUsage has a couple of built-in Ig gene usage datasets. Some were obtained from studies and others were simulated.

Lets look into the simulated dataset d_zibb. This dataset was generated by a zero-inflated beta-binomial (ZIBB) model, and IgGeneUsage was designed to fit ZIBB-distributed data.

data("d_zibb", package = "IgGeneUsage")
knitr::kable(head(d_zibb))
sample_id condition gene_name gene_usage_count
S1 C1 G1 25
S1 C1 G2 123
S1 C1 G3 29
S1 C1 G4 442
S1 C1 G5 60
S1 C1 G6 106

We can also visualize d_zibb with ggplot:

ggplot(data = d_zibb)+
  geom_point(aes(x = gene_name, y = gene_usage_count, col = condition),
             position = position_dodge(width = .7), shape = 21)+
  theme_bw(base_size = 11)+
  ylab(label = "Gene usage")+
  xlab(label = '')+
  theme(legend.position = "top")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))

5.1 DGU analysis

As main input IgGeneUsage uses a data.frame formatted as d_zibb. Other input parameters allow you to configure specific settings of the rstan sampler.

In this example we analyze d_zibb with 3 MCMC chains, 1500 iterations each including 500 warm-ups using a single CPU core (Hint: for parallel chain execution set parameter mcmc.cores = 3). We report for each model parameter its mean and 95% highest density interval (HDIs).

Important remark: you should run DGU analyses using default IgGeneUsage parameters. If warnings or errors are reported with regard to the MCMC sampling, please consult the Stan manual1 https://mc-stan.org/misc/warnings.html and adjust the inputs accordingly. If the warnings persist, please submit an issue with a reproducible script at the Bioconductor support site or on Github2 https://github.com/snaketron/IgGeneUsage/issues.

M <- DGU(usage.data = d_zibb, # input data
         mcmc.warmup = 500, # how many MCMC warm-ups per chain (default: 500)
         mcmc.steps = 1500, # how many MCMC steps per chain (default: 1,500)
         mcmc.chains = 3, # how many MCMC chain to run (default: 4)
         mcmc.cores = 1, # how many PC cores to use? (e.g. parallel chains)
         hdi.level = 0.95, # highest density interval level (de fault: 0.95)
         adapt.delta = 0.8, # MCMC target acceptance rate (default: 0.95)
         max.treedepth = 10) # tree depth evaluated at each step (default: 12)
FALSE 
FALSE SAMPLING FOR MODEL 'zibb' NOW (CHAIN 1).
FALSE Chain 1: 
FALSE Chain 1: Gradient evaluation took 0.000224 seconds
FALSE Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 2.24 seconds.
FALSE Chain 1: Adjust your expectations accordingly!
FALSE Chain 1: 
FALSE Chain 1: 
FALSE Chain 1: Iteration:    1 / 1500 [  0%]  (Warmup)
FALSE Chain 1: Iteration:  250 / 1500 [ 16%]  (Warmup)
FALSE Chain 1: Iteration:  500 / 1500 [ 33%]  (Warmup)
FALSE Chain 1: Iteration:  501 / 1500 [ 33%]  (Sampling)
FALSE Chain 1: Iteration:  750 / 1500 [ 50%]  (Sampling)
FALSE Chain 1: Iteration: 1000 / 1500 [ 66%]  (Sampling)
FALSE Chain 1: Iteration: 1250 / 1500 [ 83%]  (Sampling)
FALSE Chain 1: Iteration: 1500 / 1500 [100%]  (Sampling)
FALSE Chain 1: 
FALSE Chain 1:  Elapsed Time: 10.5546 seconds (Warm-up)
FALSE Chain 1:                12.5872 seconds (Sampling)
FALSE Chain 1:                23.1418 seconds (Total)
FALSE Chain 1: 
FALSE 
FALSE SAMPLING FOR MODEL 'zibb' NOW (CHAIN 2).
FALSE Chain 2: 
FALSE Chain 2: Gradient evaluation took 0.000213 seconds
FALSE Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 2.13 seconds.
FALSE Chain 2: Adjust your expectations accordingly!
FALSE Chain 2: 
FALSE Chain 2: 
FALSE Chain 2: Iteration:    1 / 1500 [  0%]  (Warmup)
FALSE Chain 2: Iteration:  250 / 1500 [ 16%]  (Warmup)
FALSE Chain 2: Iteration:  500 / 1500 [ 33%]  (Warmup)
FALSE Chain 2: Iteration:  501 / 1500 [ 33%]  (Sampling)
FALSE Chain 2: Iteration:  750 / 1500 [ 50%]  (Sampling)
FALSE Chain 2: Iteration: 1000 / 1500 [ 66%]  (Sampling)
FALSE Chain 2: Iteration: 1250 / 1500 [ 83%]  (Sampling)
FALSE Chain 2: Iteration: 1500 / 1500 [100%]  (Sampling)
FALSE Chain 2: 
FALSE Chain 2:  Elapsed Time: 10.2598 seconds (Warm-up)
FALSE Chain 2:                15.9602 seconds (Sampling)
FALSE Chain 2:                26.2201 seconds (Total)
FALSE Chain 2: 
FALSE 
FALSE SAMPLING FOR MODEL 'zibb' NOW (CHAIN 3).
FALSE Chain 3: 
FALSE Chain 3: Gradient evaluation took 0.000212 seconds
FALSE Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 2.12 seconds.
FALSE Chain 3: Adjust your expectations accordingly!
FALSE Chain 3: 
FALSE Chain 3: 
FALSE Chain 3: Iteration:    1 / 1500 [  0%]  (Warmup)
FALSE Chain 3: Iteration:  250 / 1500 [ 16%]  (Warmup)
FALSE Chain 3: Iteration:  500 / 1500 [ 33%]  (Warmup)
FALSE Chain 3: Iteration:  501 / 1500 [ 33%]  (Sampling)
FALSE Chain 3: Iteration:  750 / 1500 [ 50%]  (Sampling)
FALSE Chain 3: Iteration: 1000 / 1500 [ 66%]  (Sampling)
FALSE Chain 3: Iteration: 1250 / 1500 [ 83%]  (Sampling)
FALSE Chain 3: Iteration: 1500 / 1500 [100%]  (Sampling)
FALSE Chain 3: 
FALSE Chain 3:  Elapsed Time: 10.9127 seconds (Warm-up)
FALSE Chain 3:                13.1809 seconds (Sampling)
FALSE Chain 3:                24.0936 seconds (Total)
FALSE Chain 3:

5.2 Output format

The following objects are provided as part of the output of DGU:

  • glm.summary (main results of IgGeneUsage): quantitative DGU summary
  • test.summary: quantitative DGU summary from frequentist methods: Welch’s t-test (T-test) and Wilcoxon signed-rank test (U-test)
  • glm: rstan (‘stanfit’) object of the fitted model \(rightarrow\) used for model checks (see section ‘Model checking’)
  • ppc.data: posterior predictive checks data (see section ‘Model checking’)
summary(M)
FALSE              Length Class      Mode
FALSE glm.summary  9      data.frame list
FALSE test.summary 9      data.frame list
FALSE glm          1      stanfit    S4  
FALSE ppc.data     2      -none-     list
FALSE usage.data   9      -none-     list

5.3 Model checking

  • Check your model fit. For this, you can use the object glm.

    • Minimal checklist of successful MCMC sampling3 https://mc-stan.org/misc/warnings.html:
      • no divergences
      • no excessive warnings from rstan
      • Rhat < 1.05
      • high Neff
    • Minimal checklist for valid model:
      • posterior predictive checks (PPCs): is model consistent with reality, i.e. is there overlap between simulated and observed data?
      • leave-one-out analysis

5.3.1 MCMC sampling

  • divergences, tree-depth, energy
rstan::check_hmc_diagnostics(M$glm)
FALSE 
FALSE Divergences:
FALSE 
FALSE Tree depth:
FALSE 
FALSE Energy:
  • Rhat and Neff
gridExtra::grid.arrange(rstan::stan_rhat(object = M$glm),
                        rstan::stan_ess(object = M$glm),
                        nrow = 1)

5.3.2 PPCs: repertoire-specific

The model used by IgGeneUsage is generative, i.e. with the model we can generate usage of each Ig gene in a given repertoire (y-axis). Error bars show 95% HDI of mean posterior prediction. The predictions can be compared with the observed data (x-axis). For points near the diagonal \(\rightarrow\) accurate prediction.

ggplot(data = M$ppc.data$ppc.repertoire)+
  facet_wrap(facets = ~sample_name, nrow = 3)+
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "darkgray")+
  geom_errorbar(aes(x = observed_count, y = ppc_mean_count, 
                    ymin = ppc_L_count, ymax = ppc_H_count), col = "darkgray")+
  geom_point(aes(x = observed_count, y = ppc_mean_count, 
                 fill = condition), shape = 21, size = 1)+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  scale_x_log10()+
  scale_y_log10()+
  xlab(label = "Observed usage [counts]")+
  ylab(label = "Predicted usage [counts]")+
  annotation_logticks(base = 10, sides = "lb")

5.3.3 PPCs: overall

Prediction of generalized gene usage within a biological condition is also possible. We show the predictions (y-axis) of the model, and compare them against the observed mean usage (x-axis). If the points are near the diagonal \(\rightarrow\) accurate prediction. Errors are 95% HDIs of the mean.

ggplot(data = M$ppc.data$ppc.gene)+
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "darkgray")+
  geom_errorbar(aes(x = observed_prop*100, ymin = ppc_L_prop*100, 
                    ymax = ppc_H_prop*100), col = "darkgray")+
  geom_point(aes(x = observed_prop*100, y = ppc_mean_prop*100, 
                 col = condition), size = 2)+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  xlab(label = "Observed usage [%]")+
  ylab(label = "Predicted usage [%]")+
  scale_x_log10()+
  scale_y_log10()+
  annotation_logticks(base = 10, sides = "lb")

5.4 Results

Each row of glm.summary summarizes the degree of DGU observed for specific Igs. Two metrics are reported:

  • es (also referred to as \gamma): effect size encoded in(parameter \(\gamma\) from model \(M\)) on DGU, where contrast gives the direction of the effect (e.g. tumor - healthy or healthy - tumor)
  • pmax (also referred to as \pi): probability of DGU (parameter \(\pi\) from model \(M\))

For es we also have the mean, median standard error (se), standard deviation (sd), L (low bound of 95% HDI), H (high bound of 95% HDI)

kable(x = head(M$glm.summary), row.names = FALSE, digits = 3)
es_mean es_mean_se es_sd es_median es_L es_H contrast pmax gene_name
0.220 0.003 0.242 0.210 -0.231 0.714 C2 - C1 0.634 G1
0.004 0.006 0.484 -0.001 -0.934 0.963 C2 - C1 0.001 G10
0.294 0.003 0.175 0.292 -0.048 0.636 C2 - C1 0.908 G11
0.099 0.003 0.199 0.095 -0.291 0.494 C2 - C1 0.371 G12
-0.200 0.005 0.437 -0.183 -1.095 0.602 C2 - C1 0.352 G13
0.121 0.003 0.203 0.124 -0.281 0.519 C2 - C1 0.477 G14

We know that the values of \gamma and \pi are related to each other. Lets visualize them for all genes (shown as a point). Names are shown for genes associated with \(\pi \geq 0.9\). Dashed horizontal line represents null-effect (\(\gamma = 0\)).

Notice that the gene with \(\pi \approx 1\) also has an effect size whose 95% HDI (error bar) does not overlap the null-effect. The genes with high degree of differential usage are easy to detect with this figure.

# format data
stats <- M$glm.summary
stats <- stats[order(abs(stats$es_mean), decreasing = FALSE), ]
stats$gene_fac <- factor(x = stats$gene_name, levels = stats$gene_name)

stats <- merge(x = stats, y = M$test.summary, by = "gene_name")

ggplot(data = stats)+
  geom_hline(yintercept = 0, linetype = "dashed", col = "gray")+
  geom_errorbar(aes(x = pmax, y = es_mean, ymin = es_L, ymax = es_H), 
                col = "darkgray")+
  geom_point(aes(x = pmax, y = es_mean), col = "darkgray")+
  geom_text_repel(data = stats[stats$pmax >= 0.9, ],
                  aes(x = pmax, y = es_mean, label = gene_fac),
                  min.segment.length = 0, size = 2.75)+
  theme_bw(base_size = 11)+
  xlab(label = expression(pi))+
  xlim(c(0, 1))