Contents

1 Introduction

This document describes how to use chipenrich to analyze results from ChIP-Seq experiments and other DNA sequencing experiments that result in a set of genomic regions. chipenrich includes two methods that adjust for potential confounders of gene set enrichment testing (locus length and mappability of the sequence reads). The first method chipenrich is designed for use with transcription-factor based ChIP-seq experiments and other DNA sequencing experiments with narrow genomic regions. The second method broadenrich is designed for use with histone modification based ChIP-seq experiments and other DNA sequencing experiments with broad genomic regions.

2 Synopsis

After starting R, the package should be loaded using the following:

library(chipenrich)

This will load chipenrich, the chipenrich.data package, and necessary dependencies. The main function for conducting all gene set enrichment testing is chipenrich(), whose defaults are:

chipenrich(peaks, out_name = "chipenrich", out_path = getwd(),
  genome = "hg19", genesets = c("GOBP", "GOCC", "GOMF"),
  locusdef = "nearest_tss", method = "chipenrich",
  fisher_alt = "two.sided", use_mappability = F, mappa_file = NULL,
  read_length = 36, qc_plots = T, max_geneset_size = 2000,
  num_peak_threshold = 1, n_cores = 1)

The peaks option should be either a data frame or character vector representing the path to a file containing the peaks. The file (or data frame) should have at least 3 columns: chrom, start and end, denoting the chromosome, start position, and end position of the peaks. Chromosome should be in UCSC format, e.g. chrX, chrY, chr22, etc. If a file, it must be tab-delimited, and the header must exist. The input file may also be a .bed, .broadPeak, or .narrowPeak file. Additional columns can exist, so long as they do not contain tab characters. Two example datasets, peaks_E2F4 and peaks_H3K4me3_GM12878, are included in the package.

data(peaks_E2F4, package = 'chipenrich.data')
data(peaks_H3K4me3_GM12878, package = 'chipenrich.data')

head(peaks_E2F4)
##   chrom     start       end
## 1  chr1 156186314 156186469
## 2  chr1  10490456  10490550
## 3  chr1  46713352  46713436
## 4  chr1 226496843 226496924
## 5  chr1 200589825 200589928
## 6  chr1  47779789  47779907

The first task of chipenrich() is to assign the peaks to genes. Currently supported genomes are listed below, with supported_genomes(). Data from older genome versions may be converted using UCSC’s liftover tool: http://genome.ucsc.edu/cgi-bin/hgLiftOver.

supported_genomes()
## [1] "dm3"  "hg19" "mm10" "mm9"  "rn4"

Peaks are assigned to genes according to a pre-defined locus definition, i.e. the region where peaks have to occur in order to be assigned to a gene. The following locus definitions are supported in chipenrich:

supported_locusdefs()
##    genome               locusdef
## 1     dm3                   10kb
## 2     dm3 10kb_and_more_upstream
## 3     dm3                    1kb
## 4     dm3                    5kb
## 5     dm3                   exon
## 6     dm3                 intron
## 7     dm3           nearest_gene
## 8     dm3            nearest_tss
## 9    hg19                   10kb
## 10   hg19 10kb_and_more_upstream
## 11   hg19                    1kb
## 12   hg19                    5kb
## 13   hg19                   exon
## 14   hg19                 intron
## 15   hg19           nearest_gene
## 16   hg19            nearest_tss
## 17   mm10                   10kb
## 18   mm10 10kb_and_more_upstream
## 19   mm10                    1kb
## 20   mm10                    5kb
## 21   mm10                   exon
## 22   mm10                 intron
## 23   mm10           nearest_gene
## 24   mm10            nearest_tss
## 25    mm9                   10kb
## 26    mm9 10kb_and_more_upstream
## 27    mm9                    1kb
## 28    mm9                    5kb
## 29    mm9                   exon
## 30    mm9                 intron
## 31    mm9           nearest_gene
## 32    mm9            nearest_tss
## 33    rn4                   10kb
## 34    rn4 10kb_and_more_upstream
## 35    rn4                    1kb
## 36    rn4                    5kb
## 37    rn4                   exon
## 38    rn4                 intron
## 39    rn4           nearest_gene
## 40    rn4            nearest_tss

Using the options 1kb, 5kb, or 10kb will only assign peaks to genes if the peaks are within 1 kilobases (kb), 5kb, or 10kb of a gene’s transcription start site (TSS), respectively. The option exon or intron will assign peaks to genes if the peaks occur within a gene’s exons or introns, respectively. The option 10kb_and_more_upstream will assign peaks to genes if the peaks occur in a region more than 10kb upstream from a TSS to the midpoint between the adjacent TSS. Using nearest_gene or nearest_tss will assign peaks to genes according to the nearest gene or the nearest TSS. Only the nearest_gene and nearest_tss locus definitions retain all peaks, others use only a subset of peaks that fall within the defined region. All gene loci are non-overlapping. The command help(chipenrich.data) may also provide more information on the locus definitions. Users may also create their own custom locus definitions.

The default gene set database is Gene Ontology (GO) terms, comprising of all three GO branches (GOBP, GOCC, and GOMF). Though, many more genesets are supported by chipenrich:

supported_genesets()
##                     geneset organism
## 1                      GOBP      dme
## 5                      GOCC      dme
## 9                      GOMF      dme
## 2                      GOBP      hsa
## 6                      GOCC      hsa
## 10                     GOMF      hsa
## 13         biocarta_pathway      hsa
## 16                 cytoband      hsa
## 17                drug_bank      hsa
## 20        ehmn_pathway_gene      hsa
## 23          gene_expression      hsa
## 25             kegg_pathway      hsa
## 28                     mesh      hsa
## 31               metabolite      hsa
## 34                  mirbase      hsa
## 37          panther_pathway      hsa
## 40                     pfam      hsa
## 43 protein_interaction_mimi      hsa
## 46                 reactome      hsa
## 49    transcription_factors      hsa
## 3                      GOBP      mmu
## 7                      GOCC      mmu
## 11                     GOMF      mmu
## 14         biocarta_pathway      mmu
## 18                drug_bank      mmu
## 21        ehmn_pathway_gene      mmu
## 24          gene_expression      mmu
## 26             kegg_pathway      mmu
## 29                     mesh      mmu
## 32               metabolite      mmu
## 35                  mirbase      mmu
## 38          panther_pathway      mmu
## 41                     pfam      mmu
## 44 protein_interaction_mimi      mmu
## 47                 reactome      mmu
## 50    transcription_factors      mmu
## 4                      GOBP      rno
## 8                      GOCC      rno
## 12                     GOMF      rno
## 15         biocarta_pathway      rno
## 19                drug_bank      rno
## 22        ehmn_pathway_gene      rno
## 27             kegg_pathway      rno
## 30                     mesh      rno
## 33               metabolite      rno
## 36                  mirbase      rno
## 39          panther_pathway      rno
## 42                     pfam      rno
## 45 protein_interaction_mimi      rno
## 48                 reactome      rno
## 51    transcription_factors      rno

Three methods for gene set enrichment testing are provided: the main ChIP-Enrich method (chipenrich), the Broad-Enrich method (broadenrich), and Fisher’s exact test (fet). The chipenrich method designed for datasets with narrow genomic regions such as transcription factor ChIP-seq peaks. The broadenrich method is designed for datasets with broad genomic regions such as histone modification ChIP-seq peaks. Finally, the fet method is also offered for comparison purposes and/or for use in limited situations when its assumptions are met (see examples).

supported_methods()
## [1] "chipenrich"  "fet"         "broadenrich"

Accounting for mappability of reads is optional and can only be accomplished using the ChIP-Enrich or Broad-Enrich method. See the section on mappability for more information on how it is calculated. Mappabilities for the following read lengths are available (24bp is only available for hg19):

supported_read_lengths()
##    genome               locusdef read_length
## 1    hg19                   10kb         100
## 2    hg19                   10kb          24
## 3    hg19                   10kb          36
## 4    hg19                   10kb          40
## 5    hg19                   10kb          50
## 6    hg19                   10kb          75
## 7    hg19 10kb_and_more_upstream         100
## 8    hg19 10kb_and_more_upstream          24
## 9    hg19 10kb_and_more_upstream          36
## 10   hg19 10kb_and_more_upstream          40
## 11   hg19 10kb_and_more_upstream          50
## 12   hg19 10kb_and_more_upstream          75
## 13   hg19                    1kb         100
## 14   hg19                    1kb          24
## 15   hg19                    1kb          36
## 16   hg19                    1kb          40
## 17   hg19                    1kb          50
## 18   hg19                    1kb          75
## 19   hg19                    5kb         100
## 20   hg19                    5kb          24
## 21   hg19                    5kb          36
## 22   hg19                    5kb          40
## 23   hg19                    5kb          50
## 24   hg19                    5kb          75
## 25   hg19                   exon         100
## 26   hg19                   exon          24
## 27   hg19                   exon          36
## 28   hg19                   exon          40
## 29   hg19                   exon          50
## 30   hg19                   exon          75
## 31   hg19                 intron         100
## 32   hg19                 intron          24
## 33   hg19                 intron          36
## 34   hg19                 intron          40
## 35   hg19                 intron          50
## 36   hg19                 intron          75
## 37   hg19           nearest_gene         100
## 38   hg19           nearest_gene          24
## 39   hg19           nearest_gene          36
## 40   hg19           nearest_gene          40
## 41   hg19           nearest_gene          50
## 42   hg19           nearest_gene          75
## 43   hg19            nearest_tss         100
## 44   hg19            nearest_tss          24
## 45   hg19            nearest_tss          36
## 46   hg19            nearest_tss          40
## 47   hg19            nearest_tss          50
## 48   hg19            nearest_tss          75
## 49    mm9                   10kb         100
## 50    mm9                   10kb          36
## 51    mm9                   10kb          40
## 52    mm9                   10kb          50
## 53    mm9                   10kb          75
## 54    mm9 10kb_and_more_upstream         100
## 55    mm9 10kb_and_more_upstream          36
## 56    mm9 10kb_and_more_upstream          40
## 57    mm9 10kb_and_more_upstream          50
## 58    mm9 10kb_and_more_upstream          75
## 59    mm9                    1kb         100
## 60    mm9                    1kb          36
## 61    mm9                    1kb          40
## 62    mm9                    1kb          50
## 63    mm9                    1kb          75
## 64    mm9                    5kb         100
## 65    mm9                    5kb          36
## 66    mm9                    5kb          40
## 67    mm9                    5kb          50
## 68    mm9                    5kb          75
## 69    mm9                   exon         100
## 70    mm9                   exon          36
## 71    mm9                   exon          40
## 72    mm9                   exon          50
## 73    mm9                   exon          75
## 74    mm9                 intron         100
## 75    mm9                 intron          36
## 76    mm9                 intron          40
## 77    mm9                 intron          50
## 78    mm9                 intron          75
## 79    mm9           nearest_gene         100
## 80    mm9           nearest_gene          36
## 81    mm9           nearest_gene          40
## 82    mm9           nearest_gene          50
## 83    mm9           nearest_gene          75
## 84    mm9            nearest_tss         100
## 85    mm9            nearest_tss          36
## 86    mm9            nearest_tss          40
## 87    mm9            nearest_tss          50
## 88    mm9            nearest_tss          75

3 Locus Definitions

We define a gene locus as the region from which we predict a gene could be regulated. ChIP-seq peaks, or other types of genomic regions, falling within a locus for a gene are assigned to that gene.

We provide a number of different definitions of a gene locus:

4 Mappability

We define base pair mappability as the average read mappability of all possible reads of size K that encompass a specific base pair location, \(b\). Mappability files from UCSC Genome Browser mappability track were used to calculate base pair mappability. The mappability track provides values for theoretical read mappability, or the number of places in the genome that could be mapped by a read that begins with the base pair location \(b\). For example, a value of 1 indicates a Kmer read beginning at \(b\) is mappable to one area in the genome. A value of 0.5 indicates a Kmer read beginning at \(b\) is mappable to two areas in the genome. For our purposes, we are only interested in uniquely mappable reads; therefore, all reads with mappability less than 1 were set to 0 to indicate non-unique mappability. Then, base pair mappability is calculated as:

\[ \begin{equation} M_{i} = (\frac{1}{2K-1}) \sum_{j=i-K+1}^{i+(K-1)} M_{j} \end{equation} \]

where \(M_{i}\) is the mappability of base pair \(i\), and \(M_{j}\) is mappability (from UCSC’s mappability track) of read \(j\) where j is the start position of the K length read. We calculated base pair mappability for reads of lengths 24, 36, 40, 50, 75, and 100 base pairs for Homo sapiens (build hg19) and for reads of lengths 36, 40, 50, 75, and 100 base pairs for Mus musculus (build mm9). Mappability is unavailable for Rattus norvegicus (build rn4) and Mus musculus (build mm10).

4.1 Locus Mappability

We define locus mappability as the average of all base pair mappability values for a gene’s locus. Locus mappability is calculated for each available locus definition.

5 Examples

If method = chipenrich and qc_plots = T then two pdf files will be output: One with a binomial smoothing spline fitted to the probability of a peak given gene length and one showing the distribution of the distance of peaks to the nearest TSS of a gene. These plots may also be generated using separate functions as illustrated below. The first figure below shows the distribution of peaks to the nearest TSS. In the second figure below, spline is fitted to the data given gene locus length. In the third figure below, we do the same but here we account for the mappable locus length (\(mappability \times locuslength\)).

plot_dist_to_tss(peaks = peaks_E2F4, genome = 'hg19')
## Constructing GRanges
## Sorting GRanges
## Reducing GRanges
E2F4 peak distances to TSS

E2F4 peak distances to TSS

plot_spline_length(peaks = peaks_E2F4, locusdef = 'nearest_tss', genome = 'hg19')
## Constructing GRanges
## Sorting GRanges
## Reducing GRanges
E2F4 spline without mappability

E2F4 spline without mappability

plot_spline_length(peaks = peaks_E2F4, locusdef = 'nearest_tss',  genome = 'hg19', use_mappability = T, read_length = 24)
## Constructing GRanges
## Sorting GRanges
## Reducing GRanges
E2F4 spline with mappability

E2F4 spline with mappability

If method = broadenrich and qc_plots = T then one pdf file is output: A plot showing the relationship between the gene locus length and the proportion of the locus covered by a peak. Figure 4 shows this relationship.

plot_gene_coverage(peaks = peaks_H3K4me3_GM12878, locusdef = 'nearest_tss',  genome = 'hg19')
## Constructing GRanges
## Sorting GRanges
## Reducing GRanges
H3K4me3 gene coverage

H3K4me3 gene coverage

5.1 ChIP-Enrich

There are many combinations of methods, genesets, and mappabiity settings that may be used to do gene set enrichment testing using chipenrich. In the following, we include some examples of gene set enrichment testing using the peaks_E2F4 and peaks_H3K4me3_GM12878 example datasets. Note: Analysis using multiple cores (n_cores) is not available on Windows.

# Without mappability
gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich')
results = chipenrich(peaks = peaks_E2F4, genesets = gs_path,
    locusdef = "nearest_tss", qc_plots = F, out_name = NULL, n_cores = 1)
## User-specified geneset(s)...
## Reading user-specified gene set definitions: /Users/rcavalca/Projects/chipenrich/inst/extdata/vignette_genesets.txt
## Done setting up user-specified geneset..
## Reading peaks from data.frame..
## Constructing GRanges
## Sorting GRanges
## Reducing GRanges
## Assigning peaks to genes with assign_peaks(...) ..
## Successfully assigned peaks..
## Test: ChIP-Enrich
## Genesets: user-supplied
## Running tests..
results.ce = results$results
print(results.ce[1:5,1:5])
##           Geneset.Type Geneset.ID Description   P.value       FDR
## goterm13 user-supplied GO:0031400  GO:0031400 0.1329270 0.7728310
## goterm3  user-supplied GO:0006631  GO:0006631 0.1557651 0.7728310
## goterm7  user-supplied GO:0009101  GO:0009101 0.2330154 0.7728310
## goterm8  user-supplied GO:0009314  GO:0009314 0.3834412 0.7728310
## goterm4  user-supplied GO:0007346  GO:0007346 0.4371945 0.7807044
# With mappability
results = chipenrich(peaks = peaks_E2F4, genesets = gs_path,
    locusdef = "nearest_tss", use_mappability=T, read_length=24, qc_plots = F,
    out_name = NULL,n_cores=1)
## User-specified geneset(s)...
## Reading user-specified gene set definitions: /Users/rcavalca/Projects/chipenrich/inst/extdata/vignette_genesets.txt
## Done setting up user-specified geneset..
## Reading peaks from data.frame..
## Constructing GRanges
## Sorting GRanges
## Reducing GRanges
## Assigning peaks to genes with assign_peaks(...) ..
## Successfully assigned peaks..
## Mappability adjustment is enabled..
## Test: ChIP-Enrich
## Genesets: user-supplied
## Running tests..
results.cem = results$results
print(results.cem[1:5,1:5])
##           Geneset.Type Geneset.ID Description   P.value       FDR
## goterm13 user-supplied GO:0031400  GO:0031400 0.1244346 0.6353849
## goterm3  user-supplied GO:0006631  GO:0006631 0.1270770 0.6353849
## goterm7  user-supplied GO:0009101  GO:0009101 0.2233518 0.6770617
## goterm8  user-supplied GO:0009314  GO:0009314 0.3600322 0.6770617
## goterm4  user-supplied GO:0007346  GO:0007346 0.3791546 0.6770617

5.2 Broad-Enrich

results = chipenrich(peaks = peaks_H3K4me3_GM12878, genesets = gs_path,
    method='broadenrich', locusdef = "nearest_tss", qc_plots = F,
    out_name = NULL, n_cores=1)
## User-specified geneset(s)...
## Reading user-specified gene set definitions: /Users/rcavalca/Projects/chipenrich/inst/extdata/vignette_genesets.txt
## Done setting up user-specified geneset..
## Reading peaks from data.frame..
## Constructing GRanges
## Sorting GRanges
## Reducing GRanges
## Assigning peaks to genes with assigned_peak_segments(...) ..
## Successfully assigned peaks..
## Calculating peak overlaps with gene loci..
## Test: Broad-Enrich
## Genesets: user-supplied
## Running tests..
results.be = results$results
print(results.be[1:5,1:5])
##          Geneset.Type Geneset.ID Description    P.value       FDR
## ratio1  user-supplied GO:0002521  GO:0002521 0.05227898 0.8126158
## ratio22 user-supplied GO:0051129  GO:0051129 0.29165616 0.8126158
## ratio14 user-supplied GO:0034220  GO:0034220 0.31491786 0.8126158
## ratio12 user-supplied GO:0022604  GO:0022604 0.35426035 0.8126158
## ratio15 user-supplied GO:0034660  GO:0034660 0.38877141 0.8126158

5.3 Fisher’s Exact Test

Fisher’s Exact test assumes that each gene is equally likely to have a peak. We recommend using Fisher’s exact test with the 1kb or 5kb locus definitions only. This will force all genes to have approximately the same locus length and avoid returning bias results.

results = chipenrich(peaks = peaks_E2F4, genesets = gs_path, locusdef = "5kb",
    method = "fet", fisher_alt = "two.sided", qc_plots = F, out_name = NULL)
## User-specified geneset(s)...
## Reading user-specified gene set definitions: /Users/rcavalca/Projects/chipenrich/inst/extdata/vignette_genesets.txt
## Done setting up user-specified geneset..
## Reading peaks from data.frame..
## Constructing GRanges
## Sorting GRanges
## Reducing GRanges
## Assigning peaks to genes with assign_peaks(...) ..
## Successfully assigned peaks..
## Test: Fisher's Exact Test
## Genesets: user-supplied
## Running tests..
results.fet = results$results
print(results.fet[1:5,1:5])
##     Geneset.Type Geneset.ID Description    P.value       FDR
## 14 user-supplied GO:0031400  GO:0031400 0.09062939 0.5664337
## 9  user-supplied GO:0009314  GO:0009314 0.16267803 0.5982504
## 5  user-supplied GO:0007346  GO:0007346 0.16751011 0.5982504
## 8  user-supplied GO:0009101  GO:0009101 0.35941216 0.8168458
## 4  user-supplied GO:0006631  GO:0006631 0.43484863 0.8906293

6 Output

The output of chipenrich() is an R object containing the results of the test and the peak to gene assignments. Both of these are also written to text files in the working directory (unless specified otherwised) after the test is completed.

6.1 Assigned peaks

Peak assignments are stored in $peaks. For example,

head(results$peaks)
##     peak_id  chr peak_start  peak_end geneid gene_symbol gene_locus_start
## 1  peak:587 chr1   76190246  76190344     34       ACADM         76185042
## 2  peak:758 chr1  112050490 112050617    140      ADORA3        112041743
## 3 peak:1261 chr1  226595642 226595837    142       PARP1        226590801
## 4 peak:1355 chr1  244614895 244615100    159        ADSS        244610413
## 5 peak:1356 chr1  244615437 244615512    159        ADSS        244610413
## 6  peak:702 chr1  100315629 100315789    178         AGL        100310639
##   gene_locus_end nearest_tss dist_to_tss nearest_tss_gene
## 1       76191321    76190042         252               34
## 2      112051743   112046743       -3809              140
## 3      226600801   226595801          61              142
## 4      244616544   244615436         438              159
## 5      244616544   244615436         -37              159
## 6      100315840   100315639          69              178
##   nearest_tss_gene_strand
## 1                       +
## 2                       -
## 3                       -
## 4                       -
## 5                       -
## 6                       +

6.2 Peaks-per-gene

Peak information aggregated over genes is stored in $peaks_per_gene. For example,

head(results$peaks_per_gene)
##      geneid length log10_length num_peaks peak
## 604     864  30135     4.479071         5    1
## 2447   3725  10000     4.000000         5    1
## 6333   9659  70457     4.847924         5    1
## 1753   2672  13075     4.116442         4    1
## 2254   3399  10343     4.014647         4    1
## 4971   7398  10407     4.017326         4    1

6.3 Gene set enrichment results

Gene set enrichment results are stored in $results. For example,

head(results$results)
##     Geneset.Type Geneset.ID Description    P.value       FDR Odds.Ratio
## 14 user-supplied GO:0031400  GO:0031400 0.09062939 0.5664337   1.585073
## 9  user-supplied GO:0009314  GO:0009314 0.16267803 0.5982504   1.446375
## 5  user-supplied GO:0007346  GO:0007346 0.16751011 0.5982504   1.475597
## 8  user-supplied GO:0009101  GO:0009101 0.35941216 0.8168458   1.279989
## 4  user-supplied GO:0006631  GO:0006631 0.43484863 0.8906293   1.244098
## 16 user-supplied GO:0034660  GO:0034660 0.87556302 1.0000000   1.048473
##      Status N.Geneset.Genes N.Geneset.Peak.Genes
## 14 enriched             297                   18
## 9  enriched             284                   16
## 5  enriched             299                   17
## 8  enriched             295                   15
## 4  enriched             282                   14
## 16 enriched             281                   12
##                                                                                                  Geneset.Peak.Genes
## 14 3725, 1647, 2810, 7161, 991, 1855, 2629, 5590, 5664, 5686, 5690, 6497, 8613, 10459, 54206, 64853, 116496, 374969
## 9              3725, 7398, 777, 1647, 7161, 10277, 1263, 4893, 7159, 7391, 8438, 23596, 25896, 51150, 51514, 261734
## 5          1104, 7161, 63967, 64326, 991, 1063, 1263, 4582, 4751, 5586, 5686, 5690, 6886, 7175, 10459, 54998, 79577
## 8                   10905, 6487, 142, 1650, 4582, 6646, 6675, 8704, 9653, 22796, 23169, 29929, 55624, 79947, 148789
## 4                             9261, 34, 1376, 2475, 5565, 5743, 5825, 6051, 8560, 11000, 11332, 51094, 55268, 64834
## 16                               5394, 6202, 7175, 23318, 25896, 25973, 55157, 55699, 79707, 113802, 116461, 126789

7 References

R.P. Welch^, C. Lee^, R.A. Smith, P. Imbriano, S. Patil, T. Weymouth, L.J. Scott, M.A. Sartor. “ChIP-Enrich: gene set enrichment testing for ChIP-seq data.” Nucl. Acids Res. (2014) 42(13):e105. doi:10.1093/nar/gku463

R.G. Cavalcante, C. Lee, R.P. Welch, S. Patil, T. Weymouth, L.J. Scott, and M.A. Sartor. “Broad-Enrich: functional interpretation of large sets of broad genomic regions.” Bioinformatics (2014) 30(17):i393-i400 doi:10.1093/bioinformatics/btu444