# 1 Introduction

This vignette provides an introductory example on how to work with the analysis framework firstly proposed in (Calgaro et al., 2020).

The package is named benchdamic, acronym for “BENCHmarking of Differential Abundance detection methods for MICrobial data”. Not only does the package structure allow the users to test a variety of commonly used methods for differential abundance analysis, but it also enables them to set benchmarks including custom methods on their datasets. Performances of each method are evaluated with respect to i) suitability of distributional assumptions, ii) ability to control false discoveries, iii) concordance of the findings, and iv) enrichment of differentially abundant microbial species in specific conditions. Each step of the assessment is flexible when it comes to the choice of differential abundance methods, their parameters, and input data types. Various graphic outputs lead the users to an informed decision when evaluating the most suitable method to use for their data.

## 1.1 Installation

To install this package, start R (version “4.2”) and enter:

if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")

BiocManager::install("benchdamic")

or use:

if (!require("devtools", quietly = TRUE))
install.packages("devtools")

devtools::install_github("mcalgaro93/benchdamic")

Then, load some packages for basic functions and data:

library(benchdamic)
# Generate simulated data
library(SPsimSeq)
# Data management
library(phyloseq)
library(SummarizedExperiment)
library(plyr)
# Graphics and tables
library(ggplot2)
library(cowplot)
library(kableExtra)

All datasets used in benchdamic are downloaded using the HMP16SData Bioconductor package (Schiffer et al., 2019).

For GOF and TIEC analyses a homogeneous group of samples (e.g. only samples from a specific experimental condition, phenotype, treatment, body site, etc.) ps_stool_16S is used (use help("ps_stool_16S") for details). It contains 16S data from:

• 32 stool samples from participants of the Human Microbiome Project;

• 71 taxa, all features having the same genus-level taxonomic classification are collapsed together (a total of 71 taxa corresponding to 71 genera).

data("ps_stool_16S")
ps_stool_16S
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 71 taxa and 32 samples ]
## sample_data() Sample Data:       [ 32 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 71 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 71 tips and 70 internal nodes ]

For Concordance and Enrichment analyses the ps_plaque_16S dataset is used (use help("ps_plaque_16S") for details). It contains 16S data from:

• 30 participants of the Human Microbiome Project;

• samples collected from subgingival plaque and supragingival plaque for each subject (a total of 60 samples);

• 88 taxa, all features having the same genus-level taxonomic classification are collapsed together (a total of 88 taxa corresponding to 88 genera).

data("ps_plaque_16S")
ps_plaque_16S
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 88 taxa and 60 samples ]
## sample_data() Sample Data:       [ 60 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 88 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 88 tips and 87 internal nodes ]

# 2 Goodness of Fit

Assumption: Many DA detection methods are based on parametric distributions.

Research question: Which are the parametric distributions that can fit both the proportion of zeros and the counts in your data?

## 2.1 GOF structure

As different methods rely on different statistical distributions to perform DA analysis, the goodness of fit (GOF) of the statistical models underlying some of the DA methods on a 16S dataset is assessed. For each model, its ability to correctly estimate the average counts and the proportion of zeroes by taxon is evaluated.

Five distributions are considered: (1) the negative binomial (NB) used in edgeR and DeSeq2 (Robinson et al., 2010; Love et al., 2014), (2) the zero-inflated negative binomial (ZINB) used in ZINB-WaVE (Risso et al., 2018), (3) the truncated Gaussian Hurdle model of MAST (Finak et al., 2015), (4) the zero-inflated Gaussian (ZIG) mixture model of metagenomeSeq (Paulson et al., 2013), and (5) the Dirichlet-Multinomial (DM) distribution underlying ALDEx2 Monte-Carlo sampling (Fernandes et al., 2014) and multivariate extension of the beta-binomial distribution used by conrcob (Martin et al., 2020).

The relationships between the functions used in this section are explained by the diagram in Figure 1. To help with the reading: green boxes represent the inputs or the outputs, red boxes are the methods and blue boxes are the main parameters of those method.

## 2.2 Parametric distributions

### 2.2.1 Negative Binomial and Zero-Inflated Negative Binomial Models

For any $$\mu \ge 0$$ and $$\theta > 0$$, let $$f_{NB}(\cdot;\mu,\theta)$$ denote the probability mass function (PMF) of the negative binomial (NB) distribution with mean $$\mu$$ and inverse dispersion parameter $$\theta$$, namely:$f_{NB} = \frac{\Gamma(y+\theta)}{\Gamma(y+1)\Gamma(\theta)}\left(\frac{\theta}{\theta+1} \right)^\theta\left(\frac{\mu}{\mu+\theta} \right)^y, \forall y \in \mathbb{N}$Note that another parametrization of the NB PMF is in terms of the dispersion parameter $$\psi = \theta^{-1}$$ (although $$\theta$$ is also sometimes called dispersion parameter in the literature). In both cases, the mean of the NB distribution is $$\mu$$ and its variance is:$\sigma^2 = \mu + \frac{\mu^2}{\theta} = \mu+\psi\mu^2$In particular, the NB distribution boils down to a Poisson distribution when $$\psi=0 \iff \theta=+ \infty$$.

For any $$\pi\in[0,1]$$, let $$f_{ZINB}(\cdot;\mu,\theta,\pi)$$ be the PMF of the ZINB distribution given by:

$f_{ZINB}(\cdot;\mu,\theta,\pi) = \pi\delta_0(y)+(1-\pi)f_{NB}(y;\mu,\theta), \forall y\in\mathbb{N}$

where $$\delta_0(\cdot)$$ is the Dirac function. Here, $$\pi$$ can be interpreted as the probability that a 0 is observed instead of the actual count, resulting in an inflation of zeros compared to the NB distribution, hence the name ZINB.

To fit these distributions on real count data the edgeR (Robinson et al., 2010) and zinbwave (Risso et al., 2018) packages are used. In benchdamic they are implemented in the fitNB() and fitZINB() functions.

### 2.2.2 Zero-Inflated Gaussian Model

The raw count for sample j and feature i is denoted by $$c_{ij}$$. The zero-inflated model is defined for the continuity-corrected logarithm of the raw count data: $$y_{ij} = log_2(c_{ij}+1)$$ as a mixture of a point mass at zero $$I_{0}(y)$$ and a count distribution $$f_{count}(y;\mu,\sigma^2) \sim N(\mu,\sigma^2)$$. Given mixture parameters $$\pi_j$$, we have that the density of the ZIG distribution for feature i, in sample j with $$s_j$$ total counts is: $f_{ZIG}(y_{ij};s_j,\beta,\mu_i,\sigma^2_i) = \pi_j(s_j)\cdot I_{0}(y_{ij})+(1-\pi_j(s_j))\cdot f_{count}(y_{ij};\mu,\sigma^2)$

The mean model is specified as:$E(y_{ij})=\pi_{j} + (1-\pi_j)\cdot\left(b_{i0}+\eta_ilog_2\left( \frac{s_j^{\hat{l}}}{N}+1 \right) \right)$

In this case, parameter $$b_{i0}$$ is the intercept of the model while the term including the logged normalization factor $$log_2\left(\frac{s_j^{\hat{l}}}{N}+1 \right)$$ captures feature-specific normalization factors through parameter $$\eta_i$$. In details, $$s_j^{\hat{l}}$$ is the median scaling factor resulted from the Cumulative Sum Scaling (CSS) normalization procedure. $$N$$ is a constant fixed by default at 1000 but it should be a number close to the scaling factors to be used as a reference, for this reason a good choice could be the median of the scaling factors (which is used instead of 1000). The mixture parameters $$\pi_j(s_j)$$ are modeled as a binomial process:

$log\frac{\pi_j}{1-\pi_j} = \beta_0+\beta_1\cdot log(s_j)$

To fit this distribution on real count data the metagenomeSeq package (Paulson et al., 2013) is used. In benchdamic it is implemented in the fitZIG() function.

### 2.2.3 Truncated Gaussian Hurdle Model

The original field of application of this method was the single-cell RNAseq data, where $$y = log_2(TPM+1)$$ expression matrix was modeled as a two-part generalized regression model (Finak et al., 2015). In microbiome data that starting point translates to a $$y_{ij} = log_2\left(counts_{ij}\cdot\frac{10^6}{libSize_{j}}+1 \right)$$ or a $$log_2\left(counts_{ij}\cdot\frac{ median(libSize)}{libSize_{j}}+1\right)$$.

The taxon presence rate is modeled using logistic regression and, conditioning on a sample with the taxon, the transformed abundance level is modeled as Gaussian.

Given normalized, possibly thresholded, abundance $$y_{ij}$$, the rate of presence and the level of abundance for the samples were the taxon is present, are modeled conditionally independent for each gene $$i$$. Define the indicator $$z_{ij}$$, indicating whether taxon $$i$$ is expressed in sample $$j$$ (i.e., $$z_{ij} = 0$$ if $$y_{ij} = 0$$ and $$z_{ij} = 1$$ if $$y_{ij} > 0$$). We fit logistic regression models for the discrete variable $$Z$$ and a Gaussian linear model for the continuous variable $$(Y|Z=1)$$ independently, as follows:

$logit(Pr(Z_{ij}=1))=X_j\beta_i^D$

$P(Y_{ij}=y|Z_{ij}=1) \sim N(X_j\beta^C_i,\sigma^2_i)$

To estimate this distribution on real count data the MAST package (Finak et al., 2015) is used. In benchdamic it is implemented in the fitHURDLE() function.

### 2.2.4 Dirichlet-Multinomial Mixture Model

The probability mass function of a $$n$$ dimensional multinomial sample $$y = (y_1,...,y_n)^T$$ with library size $$libSize = \sum_{i=1}^ny_i$$ and parameter $$p=(p_1,...,p_n)$$ is:

$f(y;p)= {libSize\choose y}\prod_{i=1}^np_i^{y_i}$

The mean-variance structure of the MN model doesn’t allow over-dispersion, which is common in real data. DM distribution models the probability parameter $$p$$ in the MN model by a Dirichlet distribution. The probability mass of a n-category count vector $$y$$ over $$libSize$$ trials under DM with parameter $$\alpha=(\alpha_1,...,\alpha_n)$$, $$a_i>0$$ and proportion vector $$p \in \Delta_n=\{(p_1,...,p_n):p_i\ge0,\sum_ip_i=1 \}$$ is:

$f(y|\alpha)={libSize\choose y}\frac{\prod_{i=1}^n(a_i)y_i}{(\sum_i\alpha_i)\cdot libSize}$

The mean value for the $$i^{th}$$ taxon and $$j^{th}$$ sample of the count matrix is given by $$libSize_j\cdot \frac{\alpha_{ij}}{\sum_i a_{ij}}$$.

To estimate this distribution on real count data the MGLM package (Zhang et al., 2017; Zhang and Zhou, 2022) is used. In benchdamic it is implemented in the fitDM() function.

## 2.3 Comparing estimated and observed values

The goodness of fit for the previously described distributions is assessed comparing estimated and observed values. For each taxon the following measures are compared:

• the Mean Difference (MD) i.e. the difference between the estimated mean and the observed mean abundance (log scale);

• the Zero Probability Difference (ZPD) i.e. the difference between the probability to observe a zero and the observed proportion of samples which have zero counts.

To easily compare estimated and observed mean values the natural logarithm transformation, with the continuity correction ($$log(counts+1)$$), is well suited, indeed it reduces the count range making the differences more stable.

Except for the fitHURDLE() function, which performs a CPM transformation on the counts (or the one with the median library size), and the fitZIG() function which models the $$log_2(counts+1)$$, the other methods, fitNB(), fitZINB(), and fitDM(), model the $$counts$$ directly. For these reasons, fitHURDLE()’s output should not be compared directly to the observed $$log(counts+1)$$ mean values as for the other methods. Instead, the logarithm of the observed CPM (or the one with the median library size) should be used.

Here an example on how to fit a Truncated Gaussian hurdle model:

example_HURDLE <- fitHURDLE(
object = ps_stool_16S,
scale = "median"
)
head(example_HURDLE)
##                     Y          Y0
## OTU_97.192   2.152592 0.312760011
## OTU_97.1666        NA 0.967681879
## OTU_97.31213 4.057621 0.094509772
## OTU_97.39094 1.123198 0.749635900
## OTU_97.40451 7.327534 0.001949454
## OTU_97.19587 4.716914 0.063341989

The values above are those estimated by the fitHURDLE() function. Some NA values could be present due to taxa sparsity. The internally used function to prepare for the comparisons the observed counts is prepareObserved(), specifying the scale parameter if the HURDLE model is considered (if scale = “median”, the median library size is used to scale counts instead of $$10^6$$):

observed_hurdle <- prepareObserved(
object = ps_stool_16S,
scale = "median")
head(observed_hurdle)
##                       Y      Y0
## OTU_97.192   3.21387169 0.31250
## OTU_97.1666  0.03923904 0.96875
## OTU_97.31213 4.88350981 0.09375
## OTU_97.39094 4.89778959 0.75000
## OTU_97.40451 7.51646781 0.00000
## OTU_97.19587 5.26661203 0.06250

Which are different from the non-scaled observed values:

head(prepareObserved(object = ps_stool_16S))
##                       Y      Y0
## OTU_97.192   3.34109346 0.31250
## OTU_97.1666  0.03077166 0.96875
## OTU_97.31213 4.77675725 0.09375
## OTU_97.39094 4.44081133 0.75000
## OTU_97.40451 7.53824550 0.00000
## OTU_97.19587 5.36801832 0.06250

The function to compute MD and ZPD values, is meanDifferences():

head(meanDifferences(
estimated = example_HURDLE,
observed = observed_hurdle
))
##           MD           ZPD
## 1 -1.0612798  0.0002600107
## 2         NA -0.0010681206
## 3 -0.8258883  0.0007597716
## 4 -3.7745917 -0.0003641001
## 5 -0.1889343  0.0019494545
## 6 -0.5496976  0.0008419888

A wrapper function to simultaneously perform the estimates and the mean differences is fitModels():

GOF_stool_16S <- fitModels(
object = ps_stool_16S,
models = c("NB", "ZINB", "DM", "ZIG", "HURDLE"),
scale_HURDLE = c("median", "default"),
verbose = FALSE # TRUE is always suggested
)

Exploiting the internal structure of the fitModels()’s output the Root Mean Squared Error (RMSE) values for MD values can be extracted (the lower, the better):

plotRMSE(GOF_stool_16S, difference = "MD", plotIt = FALSE)
##            Model       RMSE
## 1             NB 0.07057518
## 2           ZINB 0.14298816
## 3             DM 0.97766699
## 4            ZIG 0.53150840
## 5  HURDLE_median 0.85437589
## 6 HURDLE_default 3.11439757

Similarly, they are extracted for ZPD values:

plotRMSE(GOF_stool_16S, difference = "ZPD", plotIt = FALSE)
##            Model         RMSE
## 1             NB 0.0741340631
## 2           ZINB 0.0219382365
## 3             DM 0.0436436801
## 4            ZIG 0.1129285606
## 5  HURDLE_median 0.0008776059
## 6 HURDLE_default 0.0008832596

## 2.4 Visualization

### 2.4.1 Mean Differences

To plot estimated and observed values the plotMD() function can be used (Figure 2). No systematic trend are expected, moreover, the closer the values to the dotted line are (representing equality between observed and estimated values), the better the goodness of fit relative to the model.

plotMD(
data = GOF_stool_16S,
difference = "MD",
split = TRUE
)

If some warning messages are shown with this graph, they are likely due to sparse taxa. To address this, the number of NA values generated by each model can be investigated (which are 24 for each HURDLE model):

plyr::ldply(GOF_stool_16S, function(model)
c("Number of NAs" = sum(is.na(model))),
.id = "Distribution")

To summarize the goodness of fit, the Root Mean Squared Error (RMSE) metric is also displayed for each model. For the HURDLE_default model, a quite different range of values of mean differences is displayed because of the excessive default scaling proposed (1 million). It is also possible to plot only a subset of the estimated models (Figure 3).

plotMD(
data = GOF_stool_16S[1:5],
difference = "MD",
split = TRUE
)