In this vignette, we demonstrate the block bootstrap functionality implemented in nullranges. See the main nullranges vignette for an overview of the idea of bootstrapping, or the diagram below.
nullranges contains an implementation of a block bootstrap for genomic data, as proposed by Bickel et al. (2010), such that features (ranges) are sampled from the genome in blocks. The original block bootstrapping algorithm for genomic data is implemented in a python software called Genome Structure Correlation, GSC.
Our description of the bootRanges methods is described in Mu et al. (2023).
Minimal code for running bootRanges()
is shown below. Genome segmentation seg
and excluded regions exclude
are optional.
eh <- ExperimentHub()
ah <- AnnotationHub()
# some default resources:
seg <- eh[["EH7307"]] # pre-built genome segmentation for hg38
exclude <- ah[["AH107305"]] # Kundaje excluded regions for hg38, see below
set.seed(5) # set seed for reproducibility
blockLength <- 5e5 # size of blocks to bootstrap
R <- 10 # number of iterations of the bootstrap
# input `ranges` require seqlengths, if missing see `GenomeInfoDb::Seqinfo`
seqlengths(ranges)
# next, we remove non-standard chromosomes ...
ranges <- keepStandardChromosomes(ranges, pruning.mode="coarse")
# ... and mitochondrial genome, as these are too short
seqlevels(ranges, pruning.mode="coarse") <- setdiff(seqlevels(ranges), "MT")
# generate bootstraps
boots <- bootRanges(ranges, blockLength=blockLength, R=R,
seg=seg, exclude=exclude)
# `boots` can then be used with plyranges commands, e.g. join_overlap_*
The boots
object will contain a column, iter
which marks the different bootstrap samples that were generated. This allows for tidy analysis with plyranges, e.g. counting the number of overlapping ranges, per bootstrap iteration. For more examples of combining bootRanges
with plyranges operations, see the tidy ranges tutorial.
Several algorithms are implemented in bootRanges()
, including a segmented and unsegmented version, where in the former, blocks are sampled with respect to a particular genome segmentation. Overall, we recommend the segmented block bootstrap given the heterogeneity of structure across the entire genome. If the purpose is block bootstrapping ranges within a smaller set of sequences, such as motifs within transcript sequence, then the unsegmented algorithm would be sufficient.
In a segmented block bootstrap, the blocks are sampled and placed within regions of a genome segmentation. That is, for a genome segmented into states \(1,2, \dots, S\), only blocks from state s will be used to sample the ranges of state s in each bootstrap sample. The process can be visualized in diagram panel (A) below, where a block with length \(L_b\) is randomly sampled with replacement from state “red” and the features (ranges) that overlap this block are then copied to the first tile (which is in the “red” state). The sampling is allowed across chromosome (as shown here), as long as the two blocks are in the same state.
Note that nullranges provides both functions for generating a genome segmentation from e.g. gene density, described below, as well as a default segmentation for hg38 that can be used directly.
An example workflow of bootRanges()
used in combination with plyranges (Lee, Cook, and Lawrence 2019) is diagrammed in panel (B) below, and can be summarized as:
bootRanges()
with optional arguments seg
(segmentation) and exclude
(excluded regions as compiled by Ogata et al. (2023)) to create a BootRanges object (\(y'\))