CHiCAGO is a method for detecting statistically significant interaction events in Capture HiC data. This vignette will walk you through a typical CHiCAGO analysis.
A typical Chicago job for two biological replicates of CHi-C data takes 2-3 h wall-clock time (including sample pre-processing from bam files) and uses 50G RAM.
NOTE | A wrapper to perform this type of analysis, called runChicago.R, is provided as part of chicagoTools, which is available from our Bitbucket repository. Refer to the chicagoTools README for more information. |
The statistical foundations of CHiCAGO are presented in a separate paper that is currently available as a preprint (Jonathan Cairns*, Paula Freire-Pritchett* et al. 2015). Briefly, CHiCAGO uses a convolution background model accounting for both ‘Brownian collisions’ between fragments (distance-dependent) and ‘technical noise’. It borrows information across interactions (with appropriate normalisation) to estimate these background components separately on different subsets of data. CHiCAGO then uses a p-value weighting procedure based on the expected true positive rates at different distance ranges (estimated from data), with scores representing soft-thresholded -log weighted p-values. The score threshold of 5 is a suggested stringent score threshold for calling significant interactions.
WARNING | The data set used in this tutorial comes from the package PCHiCdata. This package contains small parts (two chromosomes each) of published Promoter Capture HiC data sets in mouse ESCs (Schoenfelder et al. 2015) and GM12878 cells, derived from human LCLs (Mifsud et al. 2015) - note that both papers used a different interaction-calling algorithm and we are only reusing raw data from them. Do not use any of these sample input data for purposes other than training. |
In this vignette, we use the GM12878 data (Mifsud et al. 2015):
library(Chicago)
library(PCHiCdata)
Before you start, you will need:
makeDesignFiles.py
from chicagoTools at our Bitbucket repository. Refer to the chicagoTools
README file for more details.We recommend that you put all five of these files into the same directory (that we refer to as designDir). An example of a valid design folder, for a two-chromosome sample of the GM12878 data used in this vignette, is provided in the PCHiCdata package, as follows.
dataPath <- system.file("extdata", package="PCHiCdata")
testDesignDir <- file.path(dataPath, "hg19TestDesign")
dir(testDesignDir)
## [1] "h19_chr20and21.baitmap" "h19_chr20and21.nbpb"
## [3] "h19_chr20and21.npb" "h19_chr20and21.poe"
## [5] "h19_chr20and21.rmap"
NOTE | Though we talk about “restriction fragments” throughout, any non-overlapping regions can be defined in .rmap (with a subset of these defined as baits). For example, if one wanted to increase power at the cost of resolution, it is possible to use bins of restriction fragments either throughout, or for some selected regions. However, in the binned fragment framework, we advise setting removeAdjacent to FALSE - see ?setExperiment for details on how to do this. |
bam2chicago.sh
, available as part of chicagoTools
. (To obtain BAM files from raw fastq files, use a Hi-C alignment & QC pipeline such as HiCUP.Example .chinput files are provided in the PCHiCdata package, as follows:
testDataPath <- file.path(dataPath, "GMchinputFiles")
dir(testDataPath)
## [1] "GM_rep1.chinput" "GM_rep2.chinput" "GM_rep3.chinput"
files <- c(
file.path(testDataPath, "GM_rep1.chinput"),
file.path(testDataPath, "GM_rep2.chinput"),
file.path(testDataPath, "GM_rep3.chinput")
)
OPTIONAL: The data set in this vignette requires some additional custom settings, both to ensure that the vignette compiles in a reasonable time and to deal with the artificially reduced size of the data set:
settingsFile <- file.path(system.file("extdata", package="PCHiCdata"),
"sGM12878Settings", "sGM12878.settingsFile")
Omit this step unless you know which settings you wish to alter. If you are using non-human samples, or a very unusual cell type, one set of options that you might want to change is the weighting parameters: see Using different weights.
We run CHiCAGO on the test data as follows. First, we create a blank chicagoData
object, and we tell it where the design files are. For this example, we also provide the optional settings file - in your analysis, you can omit the settingsFile
argument.
library(Chicago)
cd <- setExperiment(designDir = testDesignDir, settingsFile = settingsFile)
The properties of chicagoData
objects are discussed more in The chicagoData object.
Next, we read in the input data files:
cd <- readAndMerge(files=files, cd=cd)
Finally, we run the pipeline with chicagoPipeline()
:
cd <- chicagoPipeline(cd)
chicagoPipeline()
produces a number of plots. You can save these to disk by setting the outprefix
argument in chicagoPipeline()
.
The plots are as follows (an explanation follows):
Here, we describe the expected properties of the diagnostic plots.
Note that the diagnostic plots above are computed on the fly using only a small fraction of the full GM12878 dataset. In real-world, genome-wide datasets, more fragment pools will be visible and thus the trends described below should be more pronounced.
average
bait, as a function of distance, plotted on a log-log scale.Two main output methods are provided. Here, we discuss the first: exporting to disk. However, it is also possible to export to a GenomeInteractions object, as described in Further downstream analysis.
You can export the results to disk, using exportResults()
. (If you use runChicago.R, the files appear in ./<results-folder>/data). By default, the function outputs three different output file formats:
outputDirectory <- tempdir()
exportResults(cd, file.path(outputDirectory, "vignetteOutput"))
## Reading the restriction map file...
## Reading the bait map file...
## Preparing the output table...
## Writing out for seqMonk...
## Writing out interBed...
## Preprocessing for WashU outputs...
## Writing out text file for WashU browser upload...
Each called interaction is assigned a score that represents how strong CHiCAGO believes the interaction is (formally, it is based on -log(adjusted P-value)). Thus, a larger score represents a stronger interaction. In each case, the score threshold of 5 is applied.
Summary of output files:
## bait_chr bait_start bait_end bait_name otherEnd_chr
## 1 20 119103 138049 DEFB126 20
## 2 20 119103 138049 DEFB126 20
## 3 20 161620 170741 DEFB128 20
## 4 20 233983 239479 DEFB132 20
## 5 20 268639 284501 AL034548.1;C20orf96;ZCCHC3 20
## 6 20 268639 284501 AL034548.1;C20orf96;ZCCHC3 20
## otherEnd_start otherEnd_end otherEnd_name N_reads score
## 1 161620 170741 DEFB128 11 5.10
## 2 523682 536237 CSNK2A1 6 6.79
## 3 73978 76092 . 16 5.14
## 4 206075 209203 DEFB129 33 6.02
## 5 293143 304037 . 34 7.44
## 6 304038 305698 . 34 9.04
## V1 V2 V3 V4 V5 V6
## 1 20 119103 138049 DEFB126 11 5.10
## 2 20 161620 170741 DEFB128 11 5.10
## 3 20 119103 138049 DEFB126 6 6.79
## 4 20 523682 536237 CSNK2A1 6 6.79
## 5 20 161620 170741 DEFB128 16 5.14
## 6 20 73978 76092 . 16 5.14
## V1 V2 V3
## 1 chr20,119103,138049 chr20,161620,170741 5.10
## 2 chr20,119103,138049 chr20,523682,536237 6.79
## 3 chr20,161620,170741 chr20,73978,76092 5.14
## 4 chr20,233983,239479 chr20,206075,209203 6.02
## 5 chr20,268639,284501 chr20,293143,304037 7.44
## 6 chr20,268639,284501 chr20,304038,305698 9.04
exportResults()
.For bait-to-bait interactions, the interaction can be tested either way round (i.e. either fragment can be considered the “bait”). In most output formats, both of these tests are preserved. The exception is washU output, where these scores are consolidated by taking the maximum.
NOTE | When comparing interactions detected between multiple replicates, the degree of overlap may appear to be lower than expected. This is due to the undersampled nature of most CHi-C datasets. Sampling error can drive down the sensitivity, particularly for interactions that span large distances and have low read counts. As such, low overlap is not necessarily an indication of a high false discovery rate. Undersampling needs to be taken into consideration when interpreting CHiCAGO results. In particular, we recommend performing comparisons at the score-level rather than at the level of thresholded interaction calls. Potentially, differential analysis algorithms for sequencing data such as DESeq2 (Love, Huber, and Anders 2014) may also be used to formally compare the enrichment at CHiCAGO-detected interactions between conditions at the count level, although power will generally be a limiting factor. Formal methods such as sdef (Blangiardo, Cassese, and Richardson 2010) may provide a more balanced view of the consistency between replicates. Alternatively, additional filtering based on the mean number of reads per detected interaction (e.g. removing calls with N<10 reads) will reduce the impact of undersampling on the observed overlap, but at the cost of decreasing the power to detect longer-range interactions. |
The plotBaits()
function can be used to plot the raw read counts versus linear distance from bait for either specific or random baits, labelling significant interactions in a different colour. By default, 16 random baits are plotted, with interactions within 1 Mb from bait passing the threshold of 5 shown in red and those passing the more lenient threshold of 3 shown in blue.
plottedBaitIDs <- plotBaits(cd, n=6)