Abstract

ChIPseeker is an R package for annotating ChIP-seq data analysis. It supports annotating ChIP peaks and provides functions to visualize ChIP peaks coverage over chromosomes and profiles of peaks binding to TSS regions. Comparison of ChIP peak profiles and annotation are also supported. Moreover, it supports evaluating significant overlap among ChIP-seq datasets. Currently, ChIPseeker contains 17,000 bed file information from GEO database. These datasets can be downloaded and compare with user’s own data to explore significant overlap datasets for inferring co-regulation or transcription factor complex for further investigation.

Citation

If you use ChIPseeker(Yu, Wang, and He 2015) in published research, please cite:

Introduction

Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) has become standard technologies for genome wide identification of DNA-binding protein target sites. After read mappings and peak callings, the peak should be annotated to answer the biological questions. Annotation also create the possibility of integrating expression profile data to predict gene expression regulation. ChIPseeker(Yu, Wang, and He 2015) was developed for annotating nearest genes and genomic features to peaks.

ChIP peak data set comparison is also very important. We can use it as an index to estimate how well biological replications are. Even more important is applying to infer cooperative regulation. If two ChIP seq data, obtained by two different binding proteins, overlap significantly, these two proteins may form a complex or have interaction in regulation chromosome remodelling or gene expression. ChIPseeker(Yu, Wang, and He 2015) support statistical testing of significant overlap among ChIP seq data sets, and incorporate open access database GEO for users to compare their own dataset to those deposited in database. Protein interaction hypothesis can be generated by mining data deposited in database. Converting genome coordinations from one genome version to another is also supported, making this comparison available for different genome version and different species.

Several visualization functions are implemented to visualize the coverage of the ChIP seq data, peak annotation, average profile and heatmap of peaks binding to TSS region.

Functional enrichment analysis of the peaks can be performed by my Bioconductor packages DOSE(Yu et al. 2015), ReactomePA(Yu and He 2016), clusterProfiler(Yu et al. 2012).

## loading packages
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
library(clusterProfiler)

ChIP profiling

The datasets CBX6 and CBX7 in this vignettes were downloaded from GEO (GSE40740)(Pemberton et al. 2014) while ARmo_0M, ARmo_1nM and ARmo_100nM were downloaded from GEO (GSE48308)(Urbanucci et al. 2012) . ChIPseeker provides readPeakFile to load the peak and store in GRanges object.

files <- getSampleFiles()
print(files)
## $ARmo_0M
## [1] "/tmp/RtmpmvoXgR/Rinst23e31e3711bd5b/ChIPseeker/extdata/GEO_sample_data/GSM1174480_ARmo_0M_peaks.bed.gz"
## 
## $ARmo_1nM
## [1] "/tmp/RtmpmvoXgR/Rinst23e31e3711bd5b/ChIPseeker/extdata/GEO_sample_data/GSM1174481_ARmo_1nM_peaks.bed.gz"
## 
## $ARmo_100nM
## [1] "/tmp/RtmpmvoXgR/Rinst23e31e3711bd5b/ChIPseeker/extdata/GEO_sample_data/GSM1174482_ARmo_100nM_peaks.bed.gz"
## 
## $CBX6_BF
## [1] "/tmp/RtmpmvoXgR/Rinst23e31e3711bd5b/ChIPseeker/extdata/GEO_sample_data/GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed.gz"
## 
## $CBX7_BF
## [1] "/tmp/RtmpmvoXgR/Rinst23e31e3711bd5b/ChIPseeker/extdata/GEO_sample_data/GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz"
peak <- readPeakFile(files[[4]])
peak
## GRanges object with 1331 ranges and 2 metadata columns:
##          seqnames              ranges strand |             V4        V5
##             <Rle>           <IRanges>  <Rle> |    <character> <numeric>
##      [1]     chr1       815093-817883      * |    MACS_peak_1    295.76
##      [2]     chr1     1243288-1244338      * |    MACS_peak_2     63.19
##      [3]     chr1     2979977-2981228      * |    MACS_peak_3    100.16
##      [4]     chr1     3566182-3567876      * |    MACS_peak_4    558.89
##      [5]     chr1     3816546-3818111      * |    MACS_peak_5     57.57
##      ...      ...                 ...    ... .            ...       ...
##   [1327]     chrX 135244783-135245821      * | MACS_peak_1327     55.54
##   [1328]     chrX 139171964-139173506      * | MACS_peak_1328    270.19
##   [1329]     chrX 139583954-139586126      * | MACS_peak_1329    918.73
##   [1330]     chrX 139592002-139593238      * | MACS_peak_1330    210.88
##   [1331]     chrY   13845134-13845777      * | MACS_peak_1331     58.39
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

ChIP peaks coverage plot

After peak calling, we would like to know the peak locations over the whole genome, covplot function calculates the coverage of peak regions over chromosomes and generate a figure to visualize. GRangesList is also supported and can be used to compare coverage of multiple bed files.

covplot(peak, weightCol="V5")

covplot(peak, weightCol="V5", chrs=c("chr17", "chr18"), xlim=c(4.5e7, 5e7))

Profile of ChIP peaks binding to TSS regions

First of all, for calculating the profile of ChIP peaks binding to TSS regions, we should prepare the TSS regions, which are defined as the flanking sequence of the TSS sites. Then align the peaks that are mapping to these regions, and generate the tagMatrix.

## promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
## tagMatrix <- getTagMatrix(peak, windows=promoter)
##
## to speed up the compilation of this vignettes, we use a precalculated tagMatrix
data("tagMatrixList")
tagMatrix <- tagMatrixList[[4]]

In the above code, you should notice that tagMatrix is not restricted to TSS regions. The regions can be other types that defined by the user. ChIPseeker expanded the scope of region. Users can input the type and by parameters to get the regions they want.

Heatmap of ChIP binding to TSS regions

tagHeatmap(tagMatrix)
Heatmap of ChIP peaks binding to TSS regions

Heatmap of ChIP peaks binding to TSS regions

ChIPseeker provide a one step function to generate this figure from bed file. The following function will generate the same figure as above.

peakHeatmap(files[[4]], TxDb=txdb, upstream=3000, downstream=3000)

Users can use nbin parameter to speed up.

peakHeatmap(files[[4]],TxDb = txdb,nbin = 800,upstream=3000, downstream=3000)

Users can also use ggplot method to change the details of the figures.

peakHeatmap(files[[4]],TxDb = txdb,nbin = 800,upstream=3000, downstream=3000) +
  scale_fill_distiller(palette = "RdYlGn")

Users can also profile genebody regions with peakHeatmap().

peakHeatmap(peak = files[[4]],
            TxDb = txdb,
            upstream = rel(0.2),
            downstream = rel(0.2),
            by = "gene",
            type = "body",
            nbin = 800)
Heatmap of genebody regions

Heatmap of genebody regions

Sometimes there will be a need to explore the comparison of the peak heatmap over two regions, for example, the following picture is the peak over two gene sets. One possible scenery of using this method is to compare the peak heatmap over up-regulating genes and down-regulating genes. Here txdb1 and txdb2 is the simulated gene sets obtain from TxDb.Hsapiens.UCSC.hg19.knownGene. Using peakHeatmap_multiple_Sets(), accepting list object containing different regions information. The length of each part is correlated to the amount of regions.

txdb1 <- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb2 <- unlist(fiveUTRsByTranscript(TxDb.Hsapiens.UCSC.hg19.knownGene))[1:10000,]

region_list <- list(geneX = txdb1, geneY = txdb2)
peakHeatmap_multiple_Sets(peak = files[[4]],
                          upstream = 1000,downstream = 1000,
                          by = c("geneX","geneY"),
                          type = "start_site",
                          TxDb = region_list,nbin = 800)
Heatmap of over two regions

Heatmap of over two regions

We also meet the need of ploting heatmap and peak profiling together.

peak_Profile_Heatmap(peak = files[[4]],
                     upstream = 1000,
                     downstream = 1000,
                     by = "gene",
                     type = "start_site",
                     TxDb = txdb,
                     nbin = 800)
Combination of heatmap and peak profiling

Combination of heatmap and peak profiling

Exploring several regions with heatmap and peak profiling is also supported.

txdb1 <- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb2 <- unlist(fiveUTRsByTranscript(TxDb.Hsapiens.UCSC.hg19.knownGene))[1:10000,]

region_list <- list(geneX = txdb1, geneY = txdb2)
peak_Profile_Heatmap(peak = files[[4]],
                     upstream = 1000,
                     downstream = 1000,
                     by = c("geneX","geneY"),
                     type = "start_site",
                     TxDb = region_list,nbin = 800)
Combination of heatmap and peak profiling over several regions

Combination of heatmap and peak profiling over several regions

Average Profile of ChIP peaks binding to TSS region

plotAvgProf(tagMatrix, xlim=c(-3000, 3000),
            xlab="Genomic Region (5'->3')", ylab = "Read Count Frequency")
## >> plotting figure...             2023-10-24 16:43:07
Average Profile of ChIP peaks binding to TSS region

Average Profile of ChIP peaks binding to TSS region

The function plotAvgProf2 provide a one step from bed file to average profile plot. The following command will generate the same figure as shown above.

plotAvgProf2(files[[4]], TxDb=txdb, upstream=3000, downstream=3000,
             xlab="Genomic Region (5'->3')", ylab = "Read Count Frequency")

Confidence interval estimated by bootstrap method is also supported for characterizing ChIP binding profiles.

plotAvgProf(tagMatrix, xlim=c(-3000, 3000), conf = 0.95, resample = 1000)

Profile of ChIP peaks binding to different regions

Referring to the issue #16 , we developed and improved several functions support start site region, end site region and body region of Gene/Transcript/Exon/Intron/3UTR/5UTR. getBioRegion can prepare the different regions for ChIP peaks to bind. getTagMatrix can accept type, by, upstream and downstream parameters to get tagmatrix according to different needs. plotPeakProf and plotPeakProf2 supports the plotting of profiles of peaks binding to different regions.Users can also create heatmap or average profile of ChIP peaks binding to these regions.

In order to plot body regions, a new methond binning,was introduced to getTagMatrix. The idea of binning was derived from deeptools(Ramı́rez et al. 2016). binning scaled the regions having different lengths to the equal length by deviding the regions into the same amounts of boxes. Because the amount of boxes is equal, the regions can be thought of scaling to equal length.binning method can speed up the getTagMatrix by changing the precision from bp to box(several bps).

There are three ways to plot these regions. First, users can use getBioRegion to prepare the regions. Then align the peaks that are mapping to these regions, and generate the tagMatrix by getTagMatrix. At Last, plot the figures by plotPeakProf. Second, users can input type and by parameters to getTagMatrix to get the tagMatrix and plot the figures. Third, users can use plotPeakProf2 to do everything in one step.

Binning method for profile of ChIP peaks binding to TSS regions

Here uses the method of inputting type and by parameters. type = "start_site" means the start site region. by = "gene" means that it is the start site region of gene(TSS regions). If users want to use binning method, the nbin method must be set.

## The results of binning method and normal method are nearly the same. 
tagMatrix_binning <- getTagMatrix(peak = peak, TxDb = txdb, 
                                  upstream = 3000, downstream = 3000, 
                                  type = "start_site", by = "gene", 
                                  weightCol = "V5", nbin = 800)

Profile of ChIP peaks binding to body regions

We improved and developed several functions to plot body region of Gene/Transcript/Exon/Intron/3UTR/5UTR. If users want to get more information from the body region, we added upstream and downstream parameters to functions in order to get flank extension of body regions. upstream and downstream can be NULL(default), rel object and actual numbers. NULL(default) reflects body regions with no flank extension. Rel object reflects the percentage of total length of body regions. Actual numbers reflects the actual length of flank extension.

## Here uses `plotPeakProf2` to do all things in one step.
## Gene body regions having lengths smaller than nbin will be filtered
## A message will be given to warning users about that.
## >> 9 peaks(0.872093%), having lengths smaller than 800bp, are filtered...

## the ignore_strand is FALSE in default. We put here to emphasize that.
## We will not show it again in the below example
plotPeakProf2(peak = peak, upstream = rel(0.2), downstream = rel(0.2),
              conf = 0.95, by = "gene", type = "body", nbin = 800,
              TxDb = txdb, weightCol = "V5",ignore_strand = F)

Users can also get the profile ChIP peaks binding to gene body regions with no flank extension or flank extension decided by actual length.

## The first method using getBioRegion(), getTagMatrix() and plotPeakProf() to plot in three steps.
genebody <- getBioRegion(TxDb = txdb,
                         by = "gene",
                         type = "body")

matrix_no_flankextension <- getTagMatrix(peak,windows = genebody, nbin = 800)

plotPeakProf(matrix_no_flankextension,conf = 0.95)

## The second method of using getTagMatrix() and plotPeakProf() to plot in two steps
matrix_actual_extension <- getTagMatrix(peak,windows = genebody, nbin = 800,
                                        upstream = 1000,downstream = 1000)
plotPeakProf(matrix_actual_extension,conf = 0.95)

Users can also get the body region of 5UTR/3UTR.

five_UTR_body <- getTagMatrix(peak = peak, 
                              TxDb = txdb,
                              upstream = rel(0.2),
                              downstream = rel(0.2), 
                              type = "body",
                              by = "5UTR",
                              weightCol = "V5",
                              nbin = 50)

plotPeakProf(tagMatrix = five_UTR_body, conf = 0.95)

Profile of ChIP peaks binding to TTS regions

TTS_matrix <- getTagMatrix(peak = peak, 
                           TxDb = txdb,
                           upstream = 3000,
                           downstream = 3000, 
                           type = "end_site",
                           by = "gene",
                           weightCol = "V5")

plotPeakProf(tagMatrix = TTS_matrix, conf = 0.95)

Peak Annotation

peakAnno <- annotatePeak(files[[4]], tssRegion=c(-3000, 3000),
                         TxDb=txdb, annoDb="org.Hs.eg.db")
## >> loading peak file...               2023-10-24 16:43:08 
## >> preparing features information...      2023-10-24 16:43:08 
## >> identifying nearest features...        2023-10-24 16:43:08 
## >> calculating distance from peak to TSS...   2023-10-24 16:43:08 
## >> assigning genomic annotation...        2023-10-24 16:43:08 
## >> adding gene annotation...          2023-10-24 16:43:22 
## >> assigning chromosome lengths           2023-10-24 16:43:22 
## >> done...                    2023-10-24 16:43:22

Note that it would also be possible to use Ensembl-based EnsDb annotation databases created by the ensembldb package for the peak annotations by providing it with the TxDb parameter. Since UCSC-style chromosome names are used we have to change the style of the chromosome names from Ensembl to UCSC in the example below.

library(EnsDb.Hsapiens.v75)
edb <- EnsDb.Hsapiens.v75
seqlevelsStyle(edb) <- "UCSC"

peakAnno.edb <- annotatePeak(files[[4]], tssRegion=c(-3000, 3000),
                             TxDb=edb, annoDb="org.Hs.eg.db")

Peak Annotation is performed by annotatePeak. User can define TSS (transcription start site) region, by default TSS is defined from -3kb to +3kb. The output of annotatePeak is csAnno instance. ChIPseeker provides as.GRanges to convert csAnno to GRanges instance, and as.data.frame to convert csAnno to data.frame which can be exported to file by write.table.

TxDb object contained transcript-related features of a particular genome. Bioconductor provides several package that containing TxDb object of model organisms with multiple commonly used genome version, for instance TxDb.Hsapiens.UCSC.hg38.knownGene, TxDb.Hsapiens.UCSC.hg19.knownGene for human genome hg38 and hg19, TxDb.Mmusculus.UCSC.mm10.knownGene and TxDb.Mmusculus.UCSC.mm9.knownGene for mouse genome mm10 and mm9, etc. User can also prepare their own TxDb object by retrieving information from UCSC Genome Bioinformatics and BioMart data resources by R function makeTxDbFromBiomart and makeTxDbFromUCSC. TxDb object should be passed for peak annotation.

All the peak information contained in peakfile will be retained in the output of annotatePeak. The position and strand information of nearest genes are reported. The distance from peak to the TSS of its nearest gene is also reported. The genomic region of the peak is reported in annotation column. Since some annotation may overlap, ChIPseeker adopted the following priority in genomic annotation.

Downstream is defined as the downstream of gene end.

ChIPseeker also provides parameter genomicAnnotationPriority for user to prioritize this hierachy.

annotatePeak report detail information when the annotation is Exon or Intron, for instance “Exon (uc002sbe.3/9736, exon 69 of 80)”, means that the peak is overlap with an Exon of transcript uc002sbe.3, and the corresponding Entrez gene ID is 9736 (Transcripts that belong to the same gene ID may differ in splice events), and this overlaped exon is the 69th exon of the 80 exons that this transcript uc002sbe.3 prossess.

Parameter annoDb is optional, if provided, extra columns including SYMBOL, GENENAME, ENSEMBL/ENTREZID will be added. The geneId column in annotation output will be consistent with the geneID in TxDb. If it is ENTREZID, ENSEMBL will be added if annoDb is provided, while if it is ENSEMBL ID, ENTREZID will be added.

Visualize Genomic Annotation

To annotate the location of a given peak in terms of genomic features, annotatePeak assigns peaks to genomic annotation in “annotation” column of the output, which includes whether a peak is in the TSS, Exon, 5’ UTR, 3’ UTR, Intronic or Intergenic. Many researchers are very interesting in these annotations. TSS region can be defined by user and annotatePeak output in details of which exon/intron of which genes as illustrated in previous section.

Pie and Bar plot are supported to visualize the genomic annotation.

plotAnnoPie(peakAnno)
Genomic Annotation by pieplot

Genomic Annotation by pieplot

plotAnnoBar(peakAnno)
Genomic Annotation by barplot

Genomic Annotation by barplot

Since some annotation overlap, user may interested to view the full annotation with their overlap, which can be partially resolved by vennpie function.

vennpie(peakAnno)
Genomic Annotation by vennpie

Genomic Annotation by vennpie

We extend UpSetR to view full annotation overlap. User can user upsetplot function.

upsetplot(peakAnno)

We can combine vennpie with upsetplot by setting vennpie = TRUE.

upsetplot(peakAnno, vennpie=TRUE)

Visualize distribution of TF-binding loci relative to TSS

The distance from the peak (binding site) to the TSS of the nearest gene is calculated by annotatePeak and reported in the output. We provide plotDistToTSS to calculate the percentage of binding sites upstream and downstream from the TSS of the nearest genes, and visualize the distribution.

plotDistToTSS(peakAnno,
              title="Distribution of transcription factor-binding loci\nrelative to TSS")
Distribution of Binding Sites

Distribution of Binding Sites

Functional enrichment analysis

Once we have obtained the annotated nearest genes, we can perform functional enrichment analysis to identify predominant biological themes among these genes by incorporating biological knowledge provided by biological ontologies. For instance, Gene Ontology (GO)(Ashburner et al. 2000) annotates genes to biological processes, molecular functions, and cellular components in a directed acyclic graph structure, Kyoto Encyclopedia of Genes and Genomes (KEGG)(Kanehisa et al. 2004) annotates genes to pathways, Disease Ontology (DO)(Schriml et al. 2011) annotates genes with human disease association, and Reactome(Croft et al. 2013) annotates gene to pathways and reactions.

ChIPseeker also provides a function, seq2gene, for linking genomc regions to genes in a many-to-many mapping. It consider host gene (exon/intron), promoter region and flanking gene from intergenic region that may under control via cis-regulation. This function is designed to link both coding and non-coding genomic regions to coding genes and facilitate functional analysis.

Enrichment analysis is a widely used approach to identify biological themes. I have developed several Bioconductor packages for investigating whether the number of selected genes associated with a particular biological term is larger than expected, including DOSE(Yu et al. 2015) for Disease Ontology, ReactomePA for reactome pathway, clusterProfiler(Yu et al. 2012) for Gene Ontology and KEGG enrichment analysis.

library(ReactomePA)

pathway1 <- enrichPathway(as.data.frame(peakAnno)$geneId)
head(pathway1, 2)
##                          ID                         Description GeneRatio
## R-HSA-9758941 R-HSA-9758941                        Gastrulation    23/487
## R-HSA-186712   R-HSA-186712 Regulation of beta-cell development    12/487
##                 BgRatio       pvalue     p.adjust       qvalue
## R-HSA-9758941 114/11009 7.284830e-10 7.452381e-07 7.452381e-07
## R-HSA-186712   41/11009 1.199636e-07 6.136140e-05 6.136140e-05
##                                                                                                                               geneID
## R-HSA-9758941 5453/5692/5076/5080/7546/3169/652/5015/2303/5717/3975/2627/5714/344022/7849/2637/8320/6657/51176/2296/28514/2626/64321
## R-HSA-186712                                                       2494/5080/3651/3175/6928/390874/3642/4821/4825/2255/222546/389692
##               Count
## R-HSA-9758941    23
## R-HSA-186712     12
gene <- seq2gene(peak, tssRegion = c(-1000, 1000), flankDistance = 3000, TxDb=txdb)
pathway2 <- enrichPathway(gene)
head(pathway2, 2)
##                          ID                         Description GeneRatio
## R-HSA-9758941 R-HSA-9758941                        Gastrulation    22/397
## R-HSA-186712   R-HSA-186712 Regulation of beta-cell development    10/397
##                 BgRatio       pvalue     p.adjust       qvalue
## R-HSA-9758941 114/11009 8.696224e-11 8.287501e-08 8.046295e-08
## R-HSA-186712   41/11009 1.367484e-06 6.516060e-04 6.326411e-04
##                                                                                                                          geneID
## R-HSA-9758941 5453/5692/5076/7546/3169/652/2303/5717/3975/2627/5714/344022/2637/8320/2626/5080/5015/7849/51176/2296/64321/28514
## R-HSA-186712                                                              2494/3651/4821/4825/2255/222546/389692/5080/6928/3642
##               Count
## R-HSA-9758941    22
## R-HSA-186712     10
dotplot(pathway2)

More information can be found in the vignettes of Bioconductor packages DOSE(Yu et al. 2015), ReactomePA, clusterProfiler(Yu et al. 2012), which also provide several methods to visualize enrichment results. The clusterProfiler(Yu et al. 2012) is designed for comparing and visualizing functional profiles among gene clusters, and can directly applied to compare biological themes at GO, DO, KEGG, Reactome perspective.

ChIP peak data set comparison

Profile of several ChIP peak data binding to TSS region

Function plotAvgProf, tagHeatmap and plotPeakProf can accept a list of tagMatrix and visualize profile or heatmap among several ChIP experiments, while plotAvgProf2 , peakHeatmap and plotPeakProf2 can accept a list of bed files and perform the same task in one step.

Average profiles

## promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
## tagMatrixList <- lapply(files, getTagMatrix, windows=promoter)
##
## to speed up the compilation of this vigenette, we load a precaculated tagMatrixList
data("tagMatrixList")
plotAvgProf(tagMatrixList, xlim=c(-3000, 3000))
## >> plotting figure...             2023-10-24 16:44:31
Average Profiles of ChIP peaks among different experiments

Average Profiles of ChIP peaks among different experiments

plotAvgProf(tagMatrixList, xlim=c(-3000, 3000), conf=0.95,resample=500, facet="row")

## normal method
plotPeakProf2(files, upstream = 3000, downstream = 3000, conf = 0.95,
              by = "gene", type = "start_site", TxDb = txdb,
              facet = "row")

## binning method 
plotPeakProf2(files, upstream = 3000, downstream = 3000, conf = 0.95,
              by = "gene", type = "start_site", TxDb = txdb,
              facet = "row", nbin = 800)

Peak heatmaps

tagHeatmap(tagMatrixList)
Heatmap of ChIP peaks among different experiments

Heatmap of ChIP peaks among different experiments

Profile of several ChIP peak data binding to body region

Functions plotPeakProf and plotPeakProf2 also support to plot profile of several ChIP peak data binding to body region.

plotPeakProf2(files, upstream = rel(0.2), downstream = rel(0.2),
              conf = 0.95, by = "gene", type = "body",
              TxDb = txdb, facet = "row", nbin = 800)