Introduction

In this vignette, we demonstrate the block bootstrap functionality implemented in nullranges. See the main nullranges vignette for an overview of the idea of bootstrapping, or the diagram below.

nullranges contains an implementation of a block bootstrap for genomic data, as proposed by Bickel et al. (2010), such that features (ranges) are sampled from the genome in blocks. The original block bootstrapping algorithm for genomic data is implemented in a python software called Genome Structure Correlation, GSC.

Our description of the bootRanges methods is described in Mu et al. (2023).

Quick start

Minimal code for running bootRanges() is shown below. Genome segmentation seg and excluded regions exclude are optional.

eh <- ExperimentHub()
ah <- AnnotationHub()
seg <- eh[["EH7307"]] # genome segmentation for hg38
exclude <- ah[["AH107305"]] # Kundaje excluded regions for hg38, see below
set.seed(5) # set seed for reproducibility
blockLength <- 5e5 # size of blocks to bootstrap
R <- 10 # number of iterations of the bootstrap
boots <- bootRanges(ranges, blockLength=blockLength, R=R,
                    seg=seg, exclude=exclude)
# `boots` can then be used with plyranges commands

Method overview

Several algorithms are implemented in bootRanges(), including segmented or not, where in the segmented version, blocks are sampled with respect to a particular genome segmentation. Overall, we recommend segmented block bootstrap given the heterogeneity of structure across the entire genome. If the purpose is block bootstrapping ranges within a smaller set of sequences, such as motifs within transcript sequence, then the unsegmented algorithm would be sufficient.

In a segmented block bootstrap, the blocks are sampled and placed within regions of a genome segmentation. That is, for a genome segmented into states \(1,2, \dots, S\), blocks from state s will be used to tile the ranges of state s in each bootstrap sample. The process can be visualized in (A), a block with length \(L_b\) is randomly sampled with replacement from state “red” and the features (ranges) that overlap this block are then copied to the first tile (which is in the “red” state). The sampling is allowed across chromosome (as shown here), as long as the two blocks are in the same state.

An example workflow of bootRanges() used in combination with plyranges (Lee, Cook, and Lawrence 2019) is diagrammed in (B), and can be summarized as:

  1. Compute statistics of interest between GRanges of feature \(x\) and GRanges of feature \(y\) to assess association in the original data. This could be an enrichment (amount of overlap) or other possible statistics making use of covariates associated with each range
  2. bootRanges() with optional arguments seg (segmentation) and exclude (excluded regions as compiled by Ogata et al. (2023)) to create a BootRanges object (\(y'\))
  3. Compute bootstrap distribution of test statistics between GRanges of feature \(x\) and \(y'\) using plyranges
  4. A bootstrap p-value or \(z\) test can be performed for testing the null hypothesis that there is no true biological enrichment of the original data (that bootstrap data often has as high an enrichment as the observed data)