Contents

1 Installation

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("roastgsa")

2 Paper associated to R package

library( roastgsa )

The R package roastgsa contains several functions to perform gene set analysis, for both competitive and self-contained hypothesis testing.

It follows the work by Gordon Smyth’s group on rotation based methods for gene set analysis [1], code available in R through functions roast and romer from the limma package [2].

They propose to reconsider the sample randomization approach based on permutations of GSEA [3] to the more general procedure scheme using rotations of the residual space of the data, which can be used even when the number of degrees of freedom left is very small.

In our understanding, the test statistics provided in romer, which are all functions of the moderated t-statistics ranks, are too limited. In this R package we propose to complete the romer functionality by providing other statistics used in the GSA context [4]. We consider the KS based test statistics in the GSEA and GSVA [5] as well as computationally more efficient procedures such as restandardized statistics based on summary statistics measures. The methodology is presented and compared using both simulated and benchmarking data in the following publication:

roastgsa: a comparison of rotation-based scores for gene set enrichment analysis (2023).

We encourage anybody wanting to use the R package to first read the associated paper.

In this R package, we have also provided several tools to summarize and visualize the roastgsa results. This vignette will show a work-flow for gene set analysis with roastgsa using microarray data (available in the GSEABenchmarkeR R package [6]).

3 Data: array experiment from GSEABenchmarkeR

We consider the fourth dataset available in the GSEABenchmarkeR R package, which consists of a microarray study with 30 samples, 15 paired samples corresponding to two different groups (that take values 0 and 1 respectively).

library(GSEABenchmarkeR)
geo2keggR <- loadEData("geo2kegg", nr.datasets=4)
geo2kegg <- maPreproc(list(geo2keggR[[4]]))
y <- assays(geo2kegg[[1]])$expr
covar <- colData(geo2kegg[[1]])
covar$GROUP <- as.factor(covar$GROUP)
covar$BLOCK <- as.factor(covar$BLOCK)

The objective is to understand which KEGG processes might be affected by such two genotyping conditions. These are made available through the EnrichmentBrowser R package. Here we will consider gene sets of size between 11 and 499.

#library(EnrichmentBrowser)
#kegg.hs <- getGenesets(org="hsa", db="kegg")
data(kegg.hs)
kegg.gs <- kegg.hs[which(sapply(kegg.hs,length)>10&sapply(kegg.hs,length)<500)]

4 Gene set enrichment analysis using roastgsa

4.1 Data treatment

There are some important elements that should be taken into consideration before undertaking the roastgsa. First, the expression data should be approximately normally distributed, at least in their univariate form. For RNA-seq data or any other form of counts data, prior normalization, e.g., rlog or vst (DESeq2), zscoreDGE (edgeR) or voom (limma), should be used. Besides, filtering for expression intensity and variability could be considered, especially when the gene variation across individuals is limited by the experimental coverage.

For microarray intensities, as it happens in the example of this vignette, we can perform a quantile normalization to balance out the libraries.

library(preprocessCore)
ynorm <- normalize.quantiles(y)
rownames(ynorm) <- rownames(y)
colnames(ynorm) <- colnames(y)
par(mfrow=c(1,2))
boxplot(y, las = 2)
boxplot(ynorm, las = 2)

y <- ynorm

4.2 Model setting

Indicating the model to be fitted is essential for the roastgsa implementation. This follows a similar strategy to the limma specifications. With the attributes form, covar and contrast it can be set the linear model and the specific estimated coefficient to be used in roastgsa for hypothesis testing. The matrix design is found automatically taking the form and covar parameters (see below). The contrast can be in a vector object (length equal to the number of columns in the matrix design), an integer with the column of the term of interest in the matrix design, as well as the name of such column.

form = as.formula("~ BLOCK + GROUP")
design <- model.matrix(form, covar)
contrast = "GROUP1"

Three parameters that define the roast hypothesis testing have to be properly specified: -a- the number of rotations to draw the null hypothesis (nrot); -b- the statistic to be used (set.statistic); -c- the type of hypothesis (self.contained argument), competitive (set to FALSE) or self-contained (set to TRUE).

Regarding the test statistics, the maxmean (directional) and the absmean (nondirectional) are recommended to maximise the power, with mean.rank being a good non-parametric alternative for a more robust statistic if it is known that a few highly differentially expressed genes might influence heavily the statistic calculation. For a more “democratic” statistic, one that counterbalance both ends of the distribution (negative and positive), we encourage using the mean statistic.

Below, there is the standard roastgsa instruction (under competitive testing) for maxmean and mean.rank statistics.

fit.maxmean.comp <- roastgsa(y, form = form, covar = covar,
contrast = contrast, index = kegg.gs, nrot = 500,
mccores = 1, set.statistic = "maxmean",
self.contained = FALSE, executation.info = FALSE)
f1 <- fit.maxmean.comp$res
rownames(f1) <- substr(rownames(f1),1,8)
head(f1)
##          total_genes measured_genes       est      nes        pval   adj.pval
## hsa05230          70             69 0.5425709 3.879147 0.001996008 0.05112851
## hsa04115          73             73 0.8053681 3.454336 0.001996008 0.05112851
## hsa04215          32             31 0.6703507 3.378919 0.001996008 0.05112851
## hsa04520          93             92 0.6180546 3.301359 0.001996008 0.05112851
## hsa04530         169            164 0.4898137 3.255032 0.001996008 0.05112851
## hsa05412          77             77 0.4803524 3.141053 0.001996008 0.05112851
fit.meanrank <- roastgsa(y, form = form, covar = covar,
    contrast = 2, index = kegg.gs, nrot = 500,
    mccores = 1, set.statistic = "mean.rank",
    self.contained = FALSE, executation.info = FALSE)
f2 <- fit.meanrank$res
rownames(f2) <- substr(rownames(f2),1,8)
head(f2)
##          total_genes measured_genes       est  pval.diff adj.pval.diff
## hsa04710          34             33  2526.621 0.00998004      0.996008
## hsa04012          85             84  2213.738 0.02195609      0.996008
## hsa00220          22             21  4323.690 0.04191617      0.996008
## hsa00230         128            124 -1623.702 0.04191617      0.996008
## hsa04725         113            111 -1226.869 0.06187625      0.996008
## hsa04964          23             23  4231.543 0.06187625      0.996008
##          pval.mixed adj.pval.mixed
## hsa04710 0.46706587      0.7004377
## hsa04012 0.04990020      0.5574657
## hsa00220 0.05988024      0.5574657
## hsa00230 0.31536926      0.6214081
## hsa04725 0.48103792      0.7054838
## hsa04964 0.02794411      0.5574657

4.3 Visualization of the results

Several functions to summarize or visualize the results can be applied to objects of class roastgsa, which are found as output of the roastgsa function.

The plot function of a roastgsa object produces a general image of the differential expression results within any tested gene set. If type = "stats", it shows the ordered moderated t-statistics in various formats, area for up- and down- expressed genes, barcode plot for these ordered values and density. With the argument whplot it can be selected the gene set of interest (either an integer with the ordered position in the roastgsa output or the name of the gene set).

plot(fit.maxmean.comp, type ="stats", whplot = 2,  gsainfo = TRUE,
    cex.sub = 0.5, lwd = 2)

If the roastgsa statistic is mean.rank, the moderated-t statistic centred ranks are printed instead.

plot(fit.meanrank,  type ="stats",  whplot = 1, gsainfo = TRUE,
    cex.sub = 0.4, lwd = 2)

Even though the type = "GSEA" option is directly interpretable for ksmean and ksmax statistics, we find it useful for seeing the behavior in both Kolgomorov-Smirnov type scores and simple summary statistics:

plot(fit.maxmean.comp, type = "GSEA", whplot = 2, gsainfo = TRUE,
    maintitle = "",  cex.sub = 0.5, statistic = "mean")

plot(fit.meanrank, type = "GSEA",  whplot = 1, gsainfo = TRUE,
    maintitle = "",  cex.sub = 0.5, statistic = "mean")

Another graphic that is proposed in this package to visualize the genes activity within a gene set can be obtained through the function heatmaprgsa_hm. The main intention is to illustrate the variation across samples for the gene set of interest. We highly recommend the generation of this graphic, or any other similar plot that shows sample variation for the tested gene sets. Not only for showing which genes are activated but also as quality control to detect samples that can be highly influential in the GSA analysis.

hm <- heatmaprgsa_hm(fit.maxmean.comp, y, intvar = "GROUP", whplot = 3,
        toplot = TRUE, pathwaylevel = FALSE, mycol = c("orange","green",
        "white"), sample2zero = FALSE)