In this guide, we illustrate here two common downstream analysis workflows for ChIP-seq experiments, one is for comparing and combining peaks for single transcription factor (TF) with replicates, and the other is for comparing binding profiles from ChIP-seq experiments with multiple TFs.
This workflow shows how to convert BED/GFF files to GRanges, find overlapping peaks between two peak sets, and visualize the number of common and specific peaks with Venn diagram.
The input for ChIPpeakAnno1 is a list of called peaks identified from ChIP-seq experiments or any other experiments that yield a set of chromosome coordinates. Although peaks are represented as GRanges in ChIPpeakAnno, other common peak formats such as BED, GFF and MACS can be converted to GRanges easily using a conversion toGRanges
method. For detailed information on how to use this method, please type ?toGRanges
.
The following examples illustrate the usage of this method to convert BED and GFF file to GRanges, add metadata from orignal peaks to the overlap GRanges using function addMetadata
, and visualize the overlapping using function makeVennDiagram
.
library(ChIPpeakAnno)
bed <- system.file("extdata", "MACS_output.bed", package="ChIPpeakAnno")
gr1 <- toGRanges(bed, format="BED", header=FALSE)
## one can also try import from rtracklayer
gff <- system.file("extdata", "GFF_peaks.gff", package="ChIPpeakAnno")
gr2 <- toGRanges(gff, format="GFF", header=FALSE, skip=3)
## must keep the class exactly same as gr1$score, i.e., numeric.
gr2$score <- as.numeric(gr2$score)
ol <- findOverlapsOfPeaks(gr1, gr2)
## add metadata (mean of score) to the overlapping peaks
ol <- addMetadata(ol, colNames="score", FUN=mean)
ol$peaklist[["gr1///gr2"]][1:2]
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | peakNames
## <Rle> <IRanges> <Rle> | <CharacterList>
## [1] chr1 713791-715578 * | gr1__MACS_peak_13,gr2__001,gr2__002
## [2] chr1 724851-727191 * | gr2__003,gr1__MACS_peak_14
## score
## <numeric>
## [1] 850.203
## [2] 29.170
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
makeVennDiagram(ol, fill=c("#009E73", "#F0E442"), # circle fill color
col=c("#D55E00", "#0072B2"), #circle border color
cat.col=c("#D55E00", "#0072B2")) # label color, keep same as circle border color
## INFO [2024-02-16 16:25:09] $fill
## INFO [2024-02-16 16:25:09] [1] "#009E73" "#F0E442"
## INFO [2024-02-16 16:25:09]
## INFO [2024-02-16 16:25:09] $col
## INFO [2024-02-16 16:25:09] [1] "#D55E00" "#0072B2"
## INFO [2024-02-16 16:25:09]
## INFO [2024-02-16 16:25:09] $cat.col
## INFO [2024-02-16 16:25:09] [1] "#D55E00" "#0072B2"
## INFO [2024-02-16 16:25:09]
## INFO [2024-02-16 16:25:09] $cat.cex
## INFO [2024-02-16 16:25:09] [1] 1
## INFO [2024-02-16 16:25:09]
## INFO [2024-02-16 16:25:09] $cat.fontface
## INFO [2024-02-16 16:25:09] [1] "plain"
## INFO [2024-02-16 16:25:09]
## INFO [2024-02-16 16:25:09] $cat.fontfamily
## INFO [2024-02-16 16:25:09] [1] "serif"
## INFO [2024-02-16 16:25:09]
## INFO [2024-02-16 16:25:09] $x
## INFO [2024-02-16 16:25:09] $x$gr1
## INFO [2024-02-16 16:25:09] [1] 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
## INFO [2024-02-16 16:25:09] [19] 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97
## INFO [2024-02-16 16:25:09] [37] 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
## INFO [2024-02-16 16:25:09] [55] 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133
## INFO [2024-02-16 16:25:09] [73] 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151
## INFO [2024-02-16 16:25:09] [91] 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169
## INFO [2024-02-16 16:25:09] [109] 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187
## INFO [2024-02-16 16:25:09] [127] 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205
## INFO [2024-02-16 16:25:09] [145] 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
## INFO [2024-02-16 16:25:09] [163] 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241
## INFO [2024-02-16 16:25:09] [181] 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259
## INFO [2024-02-16 16:25:09] [199] 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277
## INFO [2024-02-16 16:25:09] [217] 278 279 280 281 282 283 284 285 286 287 288 289
## INFO [2024-02-16 16:25:09]
## INFO [2024-02-16 16:25:09] $x$gr2
## INFO [2024-02-16 16:25:09] [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## INFO [2024-02-16 16:25:09] [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## INFO [2024-02-16 16:25:09] [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## INFO [2024-02-16 16:25:09] [55] 55 56 57 58 59 60 61 124 125 126 127 128 129 130 131 132 133 134
## INFO [2024-02-16 16:25:09] [73] 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152
## INFO [2024-02-16 16:25:09] [91] 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170
## INFO [2024-02-16 16:25:09] [109] 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188
## INFO [2024-02-16 16:25:09] [127] 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206
## INFO [2024-02-16 16:25:09] [145] 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
## INFO [2024-02-16 16:25:09] [163] 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242
## INFO [2024-02-16 16:25:09] [181] 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260
## INFO [2024-02-16 16:25:09] [199] 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278
## INFO [2024-02-16 16:25:09] [217] 279 280 281 282 283 284 285 286 287 288 289
## INFO [2024-02-16 16:25:09]
## INFO [2024-02-16 16:25:09]
## INFO [2024-02-16 16:25:09] $filename
## INFO [2024-02-16 16:25:09] NULL
## INFO [2024-02-16 16:25:09]
## INFO [2024-02-16 16:25:09] $disable.logging
## INFO [2024-02-16 16:25:09] [1] TRUE
## INFO [2024-02-16 16:25:09]
## $p.value
## gr1 gr2 pval
## [1,] 1 1 0
##
## $vennCounts
## gr1 gr2 Counts count.gr1 count.gr2
## [1,] 0 0 0 0 0
## [2,] 0 1 61 0 61
## [3,] 1 0 62 62 0
## [4,] 1 1 166 168 169
## attr(,"class")
## [1] "VennCounts"
Annotation data should be an object of GRanges. Same as import peaks, we use the method toGRanges
, which can return an object of GRanges, to represent the annotation data. An annotation data be constructed from not only BED, GFF or user defined readable text files, but also EnsDb or TxDb object, by calling the toGRanges
method. Please type ?toGRanges
for more information.
Note that the version of the annotation data must match with the genome used for mapping because the coordinates may differ for different genome releases. For example, if you are using Mus_musculus.v103 for mapping, you’d best also use EnsDb.Mmusculus.v103 for annotation. For more information about how to prepare the annotation data, please refer ?getAnnotation.
library(EnsDb.Hsapiens.v75) ##(hg19)
## create annotation file from EnsDb or TxDb
annoData <- toGRanges(EnsDb.Hsapiens.v75, feature="gene")
annoData[1:2]
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | gene_name
## <Rle> <IRanges> <Rle> | <character>
## ENSG00000223972 chr1 11869-14412 + | DDX11L1
## ENSG00000227232 chr1 14363-29806 - | WASH7P
## -------
## seqinfo: 273 sequences (1 circular) from 2 genomes (hg19, GRCh37)
After finding the overlapping peaks, the distribution of the distance of overlapped peaks to the nearest feature such as the transcription start sites (TSS) can be plotted by binOverFeature
function. The sample code here plots the distribution of peaks around the TSS.
overlaps <- ol$peaklist[["gr1///gr2"]]
binOverFeature(overlaps, annotationData=annoData,
radius=5000, nbins=20, FUN=length, errFun=0,
xlab="distance from TSS (bp)", ylab="count",
main="Distribution of aggregated peak numbers around TSS")
In addition, genomicElementDistribution
can be used to summarize the distribution of peaks over different type of features such as exon, intron, enhancer, proximal promoter, 5’ UTR and 3’ UTR. This distribution can be summarized in peak centric or nucleotide centric view using the function genomicElementDistribution
. Please note that one peak might span multiple type of features, leading to the number of annotated features greater than the total number of input peaks. At the peak centric view, precedence will dictate the annotation order when peaks span multiple type of features.
## check the genomic element distribution of the duplicates
## the genomic element distribution will indicates the
## the correlation between duplicates.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
peaks <- GRangesList(rep1=gr1,
rep2=gr2)
genomicElementDistribution(peaks,
TxDb = TxDb.Hsapiens.UCSC.hg19.knownGene,
promoterRegion=c(upstream=2000, downstream=500),
geneDownstream=c(upstream=0, downstream=2000))
## check the genomic element distribution for the overlaps
## the genomic element distribution will indicates the
## the best methods for annotation.
## The percentages in the legend show the percentage of peaks in
## each category.
out <- genomicElementDistribution(overlaps,
TxDb = TxDb.Hsapiens.UCSC.hg19.knownGene,
promoterRegion=c(upstream=2000, downstream=500),
geneDownstream=c(upstream=0, downstream=2000),
promoterLevel=list(
# from 5' -> 3', fixed precedence 3' -> 5'
breaks = c(-2000, -1000, -500, 0, 500),
labels = c("upstream 1-2Kb", "upstream 0.5-1Kb",
"upstream <500b", "TSS - 500b"),
colors = c("#FFE5CC", "#FFCA99",
"#FFAD65", "#FF8E32")))
## check the genomic element distribution by upset plot.
## by function genomicElementUpSetR, no precedence will be considered.
library(UpSetR)
x <- genomicElementUpSetR(overlaps,
TxDb.Hsapiens.UCSC.hg19.knownGene)
upset(x$plotData, nsets=13, nintersects=NA)
Metagene plot may also provide information for annotation.
metagenePlot(peaks, TxDb.Hsapiens.UCSC.hg19.knownGene)