Contents

1 Introduction

In genetic and epigenetic research, high-throughput sequencing methods generate large amounts of data. The meaningful interpretation is one of the biggest challenges when analyzing multiple genomic datasets.

Sequencing results can be summarized as ‘region sets’: a list of genomic regions defined by specific coordinates in a reference genome. This is true for all frequently used epigenomic methods such as chromatin immuprecipation (ChIP)-seq, CUT&RUN, DNA-methylation profiling and chromosome conformation capture-based methods. Genome-sequencing studies provide information on mutations, copy number variations and inversions that can be expressed as region sets. Transcriptomic data can be converted in region sets for instance by extracting the genomic position of differentially expressed genes or their regulatory elements. All regions can be further associated with metadata further characterizing them. Any annotated feature of interest in the genome itself, such as repetitive elements, conserved regions, or regulatory elements, can be chosen as regions set for further analysis.

To interpret the meaning of overlaps between regions sets, in 2015 we published regioneR, an R package developed for the statistical evaluation of the association between sets of genomic regions. The main limitation of regioneR was that it can only perform pairwise association analysis. The resulting association value (“z-score”) of different pairwise analyses could not be directly compared since z-scores depend on the size and nature of the input region sets. When scaling up the number of pairwise comparisons of larger data sets, the required increase in calculation time becomes a serious limitation.

To address these issues, we now present regioneReloaded: an R package that is the logically designed evolution of regioneR allowing to calculate the statistical association between multiple regions sets at the same time. regionReloaded comprises a set of functions created to analyze clusters of associations and to facilitate the interpretation of results. Importantly, it includes strategies to improve p-value calculations and provides normalized z-scores allowing the direct comparison of multiple analysis. It also builds upon regioneR by adding new plotting functions for obtaining publication-ready graphs.

Taken together, regioneReloaded, aims to be a novel and valuable addition to the repertoire of available tools in the Bioconductor repository for the analysis of high-content genomic datasets.

1.1 regioneR and regioneReloaded

regioneR is an R package published in 2015 created to statistically test the positional association between genomic region sets. The core of regioneR is a permutation test framework specifically designed to work in a genomic environment. For a detailed insight into the permutation test strategy implemented in regioneR, including the different randomization and evaluation strategies, please see the regioneR vignette. The two main results that can be obtained with this statistical method can be summarized in two graphs. Figure 1A shows the association observed between the two region sets under study, highlighting the distance calculated in standard deviations from the random distribution. Figure 1B shows the local z-score: a narrow peak, as the one shown, indicates that the association is highly dependent on the exact location of the region. One the contrary, a flat profile would suggest a regional association.

(figure 1)

regioneR basic Graphs

1.2 Normalized Z-Score

regioneReloaded aims to integrate the framework previously developed for regioneR to calculate associations of different region sets simultaneously. However, the values of z-score obtained in different tests can not be directly compared. In particular, the value of the z-score is directly associated to the square root of the number of regions present in the first region set of the test (RS1).

To compare the z-scores obtained in multi pairwise tests between region sets, we introduce the concept of normalized z-score, which allows not only to directly compare different association events but also to work with subsets of data and speed up the calculations.

The normalized z-score is calculated as follows:

nZS = ZS / \(\sqrt{n}\)

We can empirically evaluate how the value of z-score between two well known strongly associated proteins, CTCF (ENCFF237OKO) and RAD21 (ENCFF072UEX) in HepG2 cell line from ENCODE portal, (Sloan et al. 2016, https://www.encodeproject.org/) , increases with the number of regions included in the test. On the other hand, the value of the normalized z-score is much more stable with different number of regions. Importantly, we see that the value of normalized z-score is stable approximately after including only 30% of the regions in the dataset. This demonstrates than sampling a portion of the dataset is a viable strategy to reduce calculation time without impacting the capacity to obtain meaningful normalized z-score values.

Normal z-score empirical calculations

Another crucial factor for the calculation of the normal z-score is the number of permutations performed in the test (ntimes). We suggest to use a minimum of 5000 permutations to achieve reproducible results.

Ntimes demonstration

2 Installation


if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("regioneReloaded")

3 Quick start

library("regioneReloaded")
#> Loading required package: regioneR
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, table, tapply,
#>     union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb

The permutation test can be performed on a list multiple region sets by the function crosswisePermTest(). This process is computing intensive and its calculation time depends on the number of cores called by the parameter mc.cores (see regioneR). The result of the permutation test obtained by running the code below is included as a pre-computed example dataset which can be loaded into the environment by running the command data("cw_Alien_RaR").

AlienGenome and AlienRSList are toy datasets described in the next section of the vignette.

#NOT RUN
  
  set.seed(42)
  cw_Alien_ReG<-crosswisePermTest(Alist = AlienRSList_narrow,
                            sampling = FALSE,
                            mc.cores= 25,
                            ranFUN = "resampleGenome",
                            evFUN = "numOverlaps",
                            genome = AlienGenome,
                            ntimes= 1000
  )
#

Once the multiple permutation tests are performed, we can generate a matrix of pairwise associations between the region sets with the function makeCrosswiseMatrix() and visualize it with plotCrosswiseMatrix().


data("cw_Alien")

cw_Alien_ReG<-makeCrosswiseMatrix(cw_Alien_ReG, pvcut = 1)
#> [1] "method selected for hclustering: average"
#>  complete   average    single   ward.D2    median  centroid  mcquitty 
#> 0.7497654 0.7857181 0.6514804 0.7363304 0.3459672 0.4561003 0.7516609

plotCrosswiseMatrix(cw_Alien_ReG)