Contents

0.1 Peak Annotation

Q: How to import peaks?

A: Function toGRanges is designed to import peak files. Users can also import the peaks by rtracklayer::import method.

Q: How to properly annotate the peaks from ChIP-seq data?

A: Multiple choices are provided in function annotatePeakInBatch with the output argument.

Q: How to prepare the AnnotationData for annotatePeakInBatch?

A: To prepare the lastest released annotations of the assembly will help user getting accurate annotations. ChIPpeakAnno can use multiple sources as AnnotationData. toGRanges can convert TxDb and EnsDb to AnnotationData. getAnnotation will retrieve annotation data from bioMart. AnnotationData could also be a user-defined list in GenomicRanges::GRanges format, e.g., another list of peaks.

Q: If we use annoPeaks to annotate the start site of features, why the number of annotated peaks sometimes are smaller than we do findOverlaps for the peaks and features?

A: annoPeaks will make sure the downstream annotation range for startSite and upstream annotation range for endSite are within the features.

0.2 Find Overlaps of Peaks

Q: Why is the sum of peak numbers in the Venn-diagram not equal to the sum of the numbers in the original peaks list?

A: This question is a very typical one for calculating intersection of peaks. As you know that a peak is a range of continuous points instead of a single point. If we consider intersection of set A (1-2, 4-5, 7-9) with set B (2-8), how many peaks should we output as the overlapping set, ie., 1 or 3? It would be 1 if we uses B as the reference set, and 3 if we use A as the reference set.
In ChIPpeakAnno (release version), by default, we are using the minimal number for intersect peaks, ie., 1 in this case.

Q: Why is the peak number in the Venn-digram not equal to the length of peaklist in the output of findOverlapsOfPeaks?

A: There is an argument called connectedPeaks. In the documentation, we described the argument connectedPeaks as If multiple peaks involved in overlapping in several groups, set it to “merge” will count it as 1, while set it to “min” will count it as the minimal involved peaks in any group of connected/overlapped peaks. The default is “min”. By default, the program will select the minimal number from each peaklist involved in one merged peak. So the number will be no less than the number in the peaklist. If user set connectedPeaks to merge, the number will be exactly the same as the number in the peaklist. Here is a simple example to understand the difference between “min” and “merge”:

p1 <- GRanges("1", IRanges(c(1, 4, 7), width=2))
p2 <- GRanges("1", IRanges(c(2, 5), width=3))
ol_min <- findOverlapsOfPeaks(p1, p2, connectedPeaks="min") 
## the counts will be the minimal peaks involved in that group of 
## connected peaks, so you get 2.
ol_merge <- findOverlapsOfPeaks(p1, p2, connectedPeaks="merge") 
## the counts will be 1 for each group of connected peaks
ol_min$venn_cnt
ol_merge$venn_cnt

Q: Is there a way to show the number of peaks in original peak list?

A: Try to set connectedPeaks=“keepAll” in findOverlapsOfPeaks and makeVennDiagram.

Q: How to extract the original peak IDs of the overlapping peaks?

A: In the output of findOverlapsOfPeaks, there is a column in the metadata of each element in peaklist, called peakNames, which is a CharacterList. The CharacterList is a list of the contributing peak ids with prefix, eg. peaks1__peakname1, peaksi__peaknamej. Users can access the original peak name by split those characters. Here is the sample code:

library(ChIPpeakAnno)
bed <- system.file("extdata", "MACS_output.bed", package="ChIPpeakAnno")
gr1 <- toGRanges(bed, format="BED", header=FALSE)
gff <- system.file("extdata", "GFF_peaks.gff", package="ChIPpeakAnno")
gr2 <- toGRanges(gff, format="GFF", header=FALSE, skip=3)
ol <- findOverlapsOfPeaks(gr1, gr2)
peakNames <- ol$peaklist[['gr1///gr2']]$peakNames
library(reshape2)
peakNames1 <- melt(peakNames, value.name="merged.peak.id")
peakNames1 <- cbind(peakNames1[, 1], do.call(rbind, strsplit(as.character(peakNames1[, 3]), "__")))
colnames(peakNames1) <- c("merged.peak.id", "group", "peakName")
head(peakNames1)
gr1.subset <- gr1[peakNames1[peakNames1[, "group"] %in% "gr1", "peakName"]]
gr2.subset <- gr2[peakNames1[peakNames1[, "group"] %in% "gr2", "peakName"]]

Here is an alternative way to access the original peak IDs.

all.peaks <- ol$all.peaks
gr1.renamed <- all.peaks$gr1
gr2.renamed <- all.peaks$gr2
peakNames <- melt(ol$peaklist[['gr1///gr2']]$peakNames, value.name="merged.peak.id")
gr1.sub <- gr1.renamed[peakNames[grepl("^gr1", peakNames[, 3]), 3]]
gr2.sub <- gr2.renamed[peakNames[grepl("^gr2", peakNames[, 3]), 3]]

Q: How to select the proper number for totalTest in the function makeVennDiagram?

A: When we test the association between two sets of data based on hypergeometric distribution, the number of all potential binding sites is required. The parameter totalTest in the function makeVennDiagram indicates how many potential peaks in total will be used in the hypergeometric test. It should be larger than the largest number of peaks in the peak list. The smaller it is set, the more stringent the test is. The time used to calculate p-value does not depend on the value of the totalTest. For practical guidance on how to choose totalTest, please refer to the post.