Introduction

In this vignette, we demonstrate the block bootstrap functionality implemented in nullranges. See the main nullranges vignette for an overview of the idea of bootstrapping, or the diagram below.

nullranges contains an implementation of a block bootstrap for genomic data, as proposed by Bickel et al. (2010), such that features (ranges) are sampled from the genome in blocks. The original block bootstrapping algorithm for genomic data is implemented in a python software called Genome Structure Correlation, GSC.

Our description of the bootRanges methods is described in Mu et al. (2023).

Quick start

Minimal code for running bootRanges() is shown below. Genome segmentation seg and excluded regions exclude are optional.

eh <- ExperimentHub()
ah <- AnnotationHub()
# some default resources:
seg <- eh[["EH7307"]] # pre-built genome segmentation for hg38
exclude <- ah[["AH107305"]] # Kundaje excluded regions for hg38, see below

set.seed(5) # set seed for reproducibility
blockLength <- 5e5 # size of blocks to bootstrap
R <- 10 # number of iterations of the bootstrap

# input `ranges` require seqlengths, if missing see `GenomeInfoDb::Seqinfo`
seqlengths(ranges)
# next, we remove non-standard chromosomes ...
ranges <- keepStandardChromosomes(ranges, pruning.mode="coarse")
# ... and mitochondrial genome, as these are too short
seqlevels(ranges, pruning.mode="coarse") <- setdiff(seqlevels(ranges), "MT")

# generate bootstraps
boots <- bootRanges(ranges, blockLength=blockLength, R=R,
                    seg=seg, exclude=exclude)
# `boots` can then be used with plyranges commands, e.g. join_overlap_*

The boots object will contain a column, iter which marks the different bootstrap samples that were generated. This allows for tidy analysis with plyranges, e.g. counting the number of overlapping ranges, per bootstrap iteration. For more examples of combining bootRanges with plyranges operations, see the tidy ranges tutorial.

Method overview

Several algorithms are implemented in bootRanges(), including a segmented and unsegmented version, where in the former, blocks are sampled with respect to a particular genome segmentation. Overall, we recommend the segmented block bootstrap given the heterogeneity of structure across the entire genome. If the purpose is block bootstrapping ranges within a smaller set of sequences, such as motifs within transcript sequence, then the unsegmented algorithm would be sufficient.

In a segmented block bootstrap, the blocks are sampled and placed within regions of a genome segmentation. That is, for a genome segmented into states \(1,2, \dots, S\), only blocks from state s will be used to sample the ranges of state s in each bootstrap sample. The process can be visualized in diagram panel (A) below, where a block with length \(L_b\) is randomly sampled with replacement from state “red” and the features (ranges) that overlap this block are then copied to the first tile (which is in the “red” state). The sampling is allowed across chromosome (as shown here), as long as the two blocks are in the same state.

Note that nullranges provides both functions for generating a genome segmentation from e.g. gene density, described below, as well as a default segmentation for hg38 that can be used directly.

An example workflow of bootRanges() used in combination with plyranges (Lee, Cook, and Lawrence 2019) is diagrammed in panel (B) below, and can be summarized as:

  1. Compute statistics of interest between GRanges of feature \(x\) and GRanges of feature \(y\) to assess association in the original data. This could be an enrichment (amount of overlap) or other possible statistics making use of covariates associated with each range
  2. Generate bootstrap samples of \(y\): bootRanges() with optional arguments seg (segmentation) and exclude (excluded regions as compiled by Ogata et al. (2023)) to create a BootRanges object (\(y'\))
  3. Compute the bootstrap distribution of test statistics between GRanges of feature \(x\) and \(y'\) using plyranges (compute overlaps of all features in \(x\) with all features in \(y'\), grouping by the bootstrap sample)
  4. Conpute a bootstrap p-value or \(z\) test to test the null hypothesis that there is no association between \(x\) and \(y\) (e.g. that the bootstrap data often has as high an enrichment as the observed data)