# Introduction

The identification of the transcription factor (TF) responsible for the coregulation of an specific set of genes is a common problem in transcriptomics. In the most simple scenario, the comparison of the transcriptome of cells or organisms in two conditions leads to the identification of a set of differentially expressed (DE) genes and the underlying assumption is that one or a few TFs regulate the expression of those genes. Traditionally, the identification of the relevant TFs has relied on the use of position weight matrices (PWMs) to predict transcription factor binding sites (TFBSs) proximal to the DE genes (Wasserman and Sandelin, 2004). The comparison of predicted TFBS in DE versus control genes reveals factors that are significantly enriched in the DE gene set. These approaches have been useful to narrow down potential binding sites, but can suffer from high rates of false positives. In addition, this strategy is limited by design to sequence-specific transcription factors (TF) and thus unable to identify cofactors that bind indirectly to target genes. To overcome these limitations, TFEA.ChIP exploits the vast amount of publicly available ChIP-seq datasets to determine TFBS proximal to a given set of genes and computes enrichment analysis based on this experimentally-derived rich information. Specifically TFEA.ChIP, uses information derived from the hundreds of ChIP-seq experiments from the ENCODE Consortium 1 expanded to include additional datasets contributed to GEO database2 3 by individual laboratories representing the binding sites of factors not assayed by ENCODE. The package includes a set of tools to customize the ChIP data, perform enrichment analysis and visualize the results. The package implements two enrichment analysis methods:

• Analysis of the association of TFBS and differential expression from 2x2 tables recording the presence of binding sites for a given TF in DE and control genes. The statistical significance of the association for each factor determined by a Fisher’s exact test.
• GSEA analysis, based on the core function of the GSEA algorithm developed by the GSEA team at the Broad Institute of MIT and Harvard4 5

TFEA.ChIP includes a TF-gene interaction database containing 1060 datasets from ChIP-Seq experiments testing 277 different human transcription factors from the ReMap 2018 repository6. Due to space limitations, TFEA.ChIPs internal database only includes ChIP-seq experiments from the ENCODE project. All the plots included in this vignette have been generated using the full ReMap 2018 ChIP-seq collection. To download the full ReMap 2018 database, as well as other ready-to-use databases, visit https://github.com/LauraPS1/TFEA.ChIP_downloads .

Although the package is mainly focused towards analyzing expression data generated from human cells, TFEA.ChIP includes the option to use datasets coming from experiments in mice, translating mouse gene names to their equivalent ID on the human genome.

# Analysis Example

TFEA.ChIP is designed to take the output of a differential expression analysis and identify TFBS enriched in the list of DE genes. In the case of the analysis of association, the only required input is a set of DE genes and, optionally, a set of control genes whose expression is not altered by the experimental conditions under study. For the GSEA analysis a ranked list of genes is required. This is supplied as a dataframe containing a column with gene names and a numerical column with the ranking metric, which typically are log-fold change or p-values for the gene expression changes in the two conditions under evaluation. For illustration purposes we will derive the input required for both analysis from a table containing the following field columns:

• Gene name (Genes). Internally the package uses Entrez IDs, but translating from Gene Symbols and ENSEMBL IDs is available.
• Log2 Fold Change (Log2FoldChange), indicating the difference in expression for each gene in the two experimental conditions being compared.
• p-value (pvalue) or adjusted p-value (pval.adj) for the difference in gene expression between the two conditions.

The output of popular packages, such as DESeq2, for detection of differentially expressed genes from the analysis of count data from RNA-seq experiments produce tables with this information. The hypoxia_DESeq and hypoxia datasets are the output of a differential expression analysis performed on an RNAseq experiment analyzing the response to hypoxia of endothelial cells7 deposited at the NCBI’s GEO repository (GSE89831).

To extract the information from a DESeqResults object or a data frame the function preprocessInputData is available.

Using the option from.Mouse = TRUE will convert mouse gene IDs to their equivalent human gene ID, thus taking advantage of the wider abailability of ChIP-seq experiments done on human cells. This strategy relies on the overlap between human and mouse transcription regulatory mechanisms. Nevertheless, we advise to be cautious using this approach, since extrapolating results from one organism to another is not always appropiate.

library(TFEA.ChIP)
data( "hypoxia_DESeq", "hypoxia", package="TFEA.ChIP" ) # load example datasets
hypoxia_table <- preprocessInputData( hypoxia_DESeq )
## Warning: Some genes returned 1:many mapping to ENTREZ ID. Genes were assigned the first ENTREZ ID match found.
head( hypoxia_table )
##   Genes log2FoldChange        pvalue      pval.adj
## 1  1769       4.588064  3.942129e-62  3.285202e-59
## 2  6781       4.339768  7.840000e-11  2.714222e-09
## 3 54843       4.307058  1.435621e-75  1.522672e-72
## 4  3486       4.297902  6.595438e-18  5.535897e-16
## 5   205       4.235590 1.318311e-125 5.126912e-122
## 6  5740       3.931662  4.398104e-15  2.631420e-13
head( hypoxia )
##      Genes log2FoldChange      pvalue   pval.adj
## 1     TNMD     0.00000000 1.000000000 1.00000000
## 2     DPM1    -0.44497788 0.001243905 0.01809033
## 3    SCYL3    -0.27092276 0.139977254 0.65057456
## 4 C1orf112    -0.52382569 0.035752231 0.25407939
## 5      FGR    -0.44810165 0.217220835 0.83961811
## 6      CFH     0.05237749 0.740720152 1.00000000
hypoxia_table <- preprocessInputData( hypoxia )
## Warning: Some genes returned 1:many mapping to ENTREZ ID. Genes were assigned the first ENTREZ ID match found.
head( hypoxia_table )
##        Genes log2FoldChange       pvalue     pval.adj
## 10785   1365       5.319709 9.647797e-77 2.817857e-73
## 10138 284525       5.179949 2.411485e-64 5.414910e-61
## 9362  392255       4.401819 3.559414e-91 1.998137e-87
## 3958    4883       4.208881 1.414157e-39 1.082537e-36
## 10320   2199       4.060889 1.305316e-46 1.465522e-43
## 5256    1769       4.024703 8.418844e-65 2.025454e-61

After running preprocessInputData, your dataset will be ready to use with the rest of the package; gene names will be in Entrez Gene ID format and the resulting table is sorted by log2(Fold Change).

## Analysis of the association of TFBS and differential expression.

1. Identification of DE genes

As indicated before, for this analysis, we must provide a list of genes are considered differentially induced and a list of control genes whose expression is not altered in the analyzed experiment. For that we will use the function Select_genes:

#extract vector with names of upregulated genes
Genes.Upreg <- Select_genes( hypoxia_table, min_LFC = 1 )
#extract vector with names of non-responsive genes
Genes.Control <- Select_genes( hypoxia_table,
min_pval = 0.5, max_pval = 1,
min_LFC = -0.25, max_LFC = 0.25 )
1. Translate the gene IDs to Entrez Gene IDs

In case the input dataset cannot be preprocessed or the user is interested in analyzing a particular set of genes that doesn’t come from the input dataset, translating the IDs to Entrez Gene ID format is required. To that end its available the function GeneID2entrez:

#Conversion of hgnc to ENTREZ IDs
GeneID2entrez( gene.IDs = c("EGLN3","NFYA","ALS2","MYC","ARNT" ) )
## [1] "112399" "4800"   "57679"  "4609"   "405"
# To translate from mouse IDs:
# GeneID2entrez( gene.IDs = c( "Hmmr", "Tlx3", "Cpeb4" ), mode = "m2h" ) # To get the equivalent human gene IDs
# GeneID2entrez( gene.IDs = c( "Hmmr", "Tlx3", "Cpeb4" ), mode = "m2m" ) # To get mouse ENTREZ gene IDs
1. Association analysis

In this step, we will construct a contingency table for each of the factors stored in the internal database categorizing the DE (DE_yes) and control (DE_no) genes according to the presence or absence of binding sites:

TFbound_yes TFbound_no
DE_yes number y/y number y/n
DE_no number n/y number n/n

Then, we will apply Fisher’s exact test to each contingency table to test the null hypothesis that factor binding and differential expression are independent. In addition, to the raw p-values the function also return the FDR-adjusted values to correct for multiple testing.

CM_list_UP <- contingency_matrix( Genes.Upreg, Genes.Control ) #generates list of contingency tables, one per dataset
pval_mat_UP <- getCMstats( CM_list_UP ) #generates list of p-values and OR from association test
head( pval_mat_UP )
##                                        Accession    Cell Treatment    TF
## 741 ENCSR000EVW.GATA2.ENDOTHELIAL_UMBILICAL_VEIN   HUVEC           GATA2
## 740   ENCSR000EVU.FOS.ENDOTHELIAL_UMBILICAL_VEIN   HUVEC             FOS
## 378                       ENCSR000BVG.RXRA.SKNSH   SKNSH            RXRA
## 648   ENCSR000EFA.JUN.ENDOTHELIAL_UMBILICAL_VEIN   HUVEC             JUN
## 600                    ENCSR000ECV.EP300.HELA_S3 HELA S3           EP300
## 741 6.326303e-16 3.087546 1.626461 6.705881e-13      12.173544 12.351236
## 740 7.022587e-11 3.078663 1.622304 1.063420e-08       7.973295  8.239798
## 363 1.693487e-10 2.272543 1.184308 2.243870e-08       7.649002  7.754135
## 378 4.182003e-10 2.165727 1.114852 4.029930e-08       7.394702  7.486023
## 648 6.381965e-10 2.452037 1.293980 5.637402e-08       7.248921  7.392920
## 600 9.517216e-10 2.235195 1.160400 7.419132e-08       7.129647  7.235853

In this example, all 1122 datasets in the internal database were used in the analysis. However, we can restrict the analysis to a specific subset of the database and/or a given set of transcription factors. To this end we can produce and index of the tables of interest with the function get_chip_index and pass this index as an additional argument to contingency_matrix. Finally, note that the list of control genes is optional. If not supplied, all human genes not present in the test list will be used as control. Thus, we could restrict the analysis to the datasets generated by the ENCODE project and use all non-DE genes as control:

chip_index <- get_chip_index( TFfilter = c( "HIF1A","EPAS1","ARNT" ) ) #restrict the analysis to datasets assaying these factors
chip_index <- get_chip_index( encodeFilter = TRUE ) # Or select ENCODE datasets only
CM_list_UPe <- contingency_matrix( Genes.Upreg, Genes.Control, chip_index ) #generates list of contingency tables
pval_mat_UPe <- getCMstats( CM_list_UPe, chip_index ) #generates list of p-values and ORs
head( pval_mat_UPe )
##                                        Accession    Cell Treatment    TF
## 741 ENCSR000EVW.GATA2.ENDOTHELIAL_UMBILICAL_VEIN   HUVEC           GATA2
## 740   ENCSR000EVU.FOS.ENDOTHELIAL_UMBILICAL_VEIN   HUVEC             FOS
## 378                       ENCSR000BVG.RXRA.SKNSH   SKNSH            RXRA
## 648   ENCSR000EFA.JUN.ENDOTHELIAL_UMBILICAL_VEIN   HUVEC             JUN
## 600                    ENCSR000ECV.EP300.HELA_S3 HELA S3           EP300
## 741 6.326303e-16 3.087546 1.626461 6.705881e-13      12.173544 12.351236
## 740 7.022587e-11 3.078663 1.622304 1.063420e-08       7.973295  8.239798
## 363 1.693487e-10 2.272543 1.184308 2.243870e-08       7.649002  7.754135
## 378 4.182003e-10 2.165727 1.114852 4.029930e-08       7.394702  7.486023
## 648 6.381965e-10 2.452037 1.293980 5.637402e-08       7.248921  7.392920
## 600 9.517216e-10 2.235195 1.160400 7.419132e-08       7.129647  7.235853

To know more about the experiments included in TFEA.ChIP’s database or the conditions of a particular experiment, load the metadata table included using data(“MetaData”, package = “TFEA.ChIP”).

To summarize the results by transcription factor use rankTFs. This function performs Wilcoxon rank-sum test or GSEA to test whether ChIPs belonging to the same TF are, as a group, significantly enriched / depleted in the results of the analysis. Be aware that in the case of transcription factors whose behavior is dependent on cellular context, integrating the results of all the related ChIPs might conceal its enrichment in a particular set of experimental conditions.

TF_ranking <- rankTFs( pval_mat_UP, rankMethod = "gsea", makePlot = TRUE )
TF_ranking[[ "TFranking_plot" ]]
## Warning: Use of sub1$arg.ES is discouraged. Use arg.ES instead. ## Warning: Use of sub1$ES is discouraged. Use ES instead.
## Warning: Use of sub1$TF is discouraged. Use TF instead. ## Warning: Use of sub1$TF is discouraged. Use TF instead.
## Warning: Use of sub2$arg.ES is discouraged. Use arg.ES instead. ## Warning: Use of sub2$ES is discouraged. Use ES instead.
## Warning: Use of sub2$TF is discouraged. Use TF instead. ## Warning: Use of sub2$TF is discouraged. Use TF instead.

head( TF_ranking[[ "TF_ranking" ]] )
##          TF      ES arg.ES pVal numberOfChIPs
## GATA2 GATA2 0.83829     72 0.00             5
## FOS     FOS 0.83725     44 0.00            10
## RXRA   RXRA 0.70690      4 0.28             4
## JUN     JUN 0.65628     10 0.01            12
## EP300 EP300 0.64669     98 0.00            20
1. Plot results

The table of results generated by getCMstats can be parsed to select candidate TF. The function plot_CM uses the package plotly to generate an interactive plot representing the p-value against the odd-ratio that is very helpful to explore the results.

plot_CM( pval_mat_UP ) #plot p-values against ORs

In fact, the exploration of this graph shows a strong enrichment for several HIF datasets, as expected for the hypoxia dataset. This can be clearly shown by highlighting the datasets of interest:

HIFs <- c( "EPAS1","HIF1A","ARNT" )
names(HIFs) <- c( "EPAS1","HIF1A","ARNT" )
col <- c( "red","blue","green" )
plot_CM( pval_mat_UP, specialTF = HIFs, TF_colors = col ) #plot p-values against ORs highlighting indicated TFs

## Gene Set Enrichment Analysis.

1. Generate a sorted list of ENTREZ IDs

The GSEA analysis implemented in the TFEA.ChIP package requires as input a sorted list of genes. By default, the function preprocessInputData will sort genes according to log fold change in descending order. However, they could be sorted by any numerical parameter including p-value. If you want to generate your custom gene list with other parameters, remember to make sure the gene IDs are in Entrez Gene ID format or translate them with GeneID2Entrez

1. Select the ChIP-Seq datasets to analyze

By default, the analysis will include all the ChIP-Seq experiments available in the database. However, this analysis might take several minutes to run. To restrict the analysis to a subset of the database we can generate an index variable and pass it to the function GSEA_run. This will limit the analysis to the ChIP-Seq datasets of the user’s choosing. This index variable can be generated using the function get_chip_index and allows the user to select the whole database, the set of ChIP-Seq experiments produced by the ENCODE project (“encode”) or a specific subset of transcription factors (as a vector containing the TF names).

chip_index <- get_chip_index( TFfilter = c( "HIF1A","EPAS1","ARNT" ) ) #restrict the analysis to datasets assaying these factors
1. Run the GSEA analysis

The function GSEA_run will perform a GSEA-based analysis on the input gene list. This function is based on the R-GSEA R script bundle written by the GSEA team at the Broad Institute of MIT and Harvard 8 9. The output of the analysis depends on the variable get.RES:

• When False, the function returns a data frame storing maximum Enrichment Score and associated p-value determined for each dataset included in the analysis.

• When True, the function returns a list of three elements. The first element (Enrichment.table) is the enrichment data frame previously mentioned. The second element (RES) is a list of vectors containing the Running Enrichment Scores values (see GSEA documentation, http://software.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#_Interpreting_GSEA_Results) for each of the sorted genes tested against each one of the analyzed ChIP datasets. The third element (indicators) is a list of vectors indicating if the sorted genes were bound by the factor analyzed in each ChIP dataset.

GSEA.result <- GSEA_run( hypoxia_table$Genes, hypoxia_table$log2FoldChange, chip_index, get.RES = TRUE) #run GSEA analysis
head(GSEA.result[["Enrichment.table"]])
##               Accession Cell Treatment   TF      ES p.val pval.adj Arg.ES
## 1 ENCSR155KHM.ARNT.K562 K562           ARNT 0.21499 0.139    0.139   3622
head(GSEA.result[["RES"]][["GSM2390642"]])
## NULL
head(GSEA.result[["indicators"]][["GSM2390642"]])
## NULL

The list of results can be restricted to a given set of transcription factors by setting the variable RES.filter.

1. Plotting the results

TFEA.ChIP includes two functions that use the package plotly to generate interactive html plots of your GSEA results: plot_ES and plot_RES.

1. Plot Enrichment Scores with plot_ES

We can choose to highlight ChIP-Seq from specific transcription factors plotting them in a particular color.

TF.hightlight <- c( "EPAS1","ARNT","HIF1A" )
names( TF.hightlight ) <- c( "EPAS1","ARNT","HIF1A" )
col <- c( "red","blue","green" )
plot_ES( GSEA.result, LFC = hypoxia_table$log2FoldChange, specialTF = TF.hightlight, TF_colors = col) 1. Plot Runing Enrichment Scores with plot_RES. This function will plot all the RES stored in the GSEA_run output. It is only recommended to restrict output to specific TF and/or datasets by setting the parameters TF and/or Accession respectively: plot_RES( GSEA_result = GSEA.result, LFC = hypoxia_table$log2FoldChange,
TF = c( "EPAS1" ), Accession = c(
"GSE43109.EPAS1.MACROPHAGE_HYPO_IL",
"GSE43109.EPAS1.MACROPHAGE_NORMO_IL" ) )

# Building a TF-gene binding database

If the user wants to generate their own database of ChIPseq datasets, the functions txt2gr and GR2tfbs_db automate most of the process. The required inputs are:

• A Metadata table (storing at least, Accession ID, name of the file, and TF tested in the ChIP-Seq experiment). The metadata table included with this package has the following fields: “Name”, “Accession”, “Cell”, “Cell Type”, “Treatment”, “Antibody”, and “TF”.

• A folder containing ChIP-Seq peak data, either in “.narrowpeak” format or the MACS output files “_peaks.bed" -a format that stores “chr”, “start”, “end”, “name”, and “Q-value” of every peak-.

1. Filter peaks from source and store them as a GRanges object

Specify the folder where the ChIP-Seq files are stored, create an array with the names of the ChIP-Seq files, and choose a format. Set a for loop to convert all your files to GenomicRanges objects using txt2GR. Please note that, by default, only peaks with an associated p-value of 0.05 (for narrow peaks files) or 1e-5 (for MACS files) will be kept. The user can modify the default values by setting the alpha argument to the desired threshold p-value.

folder <- "~/peak.files.folder"
File.list<-dir( folder )
format <- "macs"

gr.list <- lapply(
seq_along( File.list ),

tmp<-read.table( File.list[i], ..., stringsAsFactors = FALSE )

file.metadata <- myMetaData[ myMetaData$Name == File.list[i], ] ChIP.dataset.gr<-txt2GR(tmp, format, file.metadata) return(ChIP.dataset.gr) }, File.list = File.list, myMetadata = myMetadata, format = format ) # As an example of the output data( "ARNT.peaks.bed","ARNT.metadata", package = "TFEA.ChIP" ) # Loading example datasets for this function ARNT.gr <- txt2GR( ARNT.peaks.bed, "macs1.4", ARNT.metadata ) head( ARNT.gr, n=2 ) ## GRanges object with 2 ranges and 9 metadata columns: ## seqnames ranges strand | score mcols.Name ## <Rle> <IRanges> <Rle> | <numeric> <character> ## [1] chr1 1813434-1814386 * | 4.60909e-06 GSM2390643_ARNT_ChIPSeq.bw ## [2] chr1 3456399-3457536 * | 8.82450e-27 GSM2390643_ARNT_ChIPSeq.bw ## mcols.Accession mcols.Cell mcols.Cell Type mcols.Treatment ## <character> <character> <character> <character> ## [1] GSM2390643 HUVEC Hypoxia ## [2] GSM2390643 HUVEC Hypoxia ## mcols.Antibody mcols.TF mcols.Score Type ## <character> <character> <character> ## [1] HIF1B ARNT corrected p-Value ## [2] HIF1B ARNT corrected p-Value ## ------- ## seqinfo: 32 sequences from an unspecified genome; no seqlengths 1. [Optional] Dnase Hypersensitive Sites By default (see step 3), TFEA.ChIP filters the ChIPseq peaks in each dataset to select those overlapping or near to (by default max. distance 10 nucleotides) DNase HS regions located within 1Kb of genes. The database of gene-associated DHSs is generated from Encode’s Master DNaseI HS as follows: 1. Load Encode’s Master DNaseI HS and convert it to a Genomic Ranges object. dnaseClusters<-read.table( file="~/path.to.file.txt", header = TRUE, sep="\t", stringsAsFactors = FALSE ) dnaseClusters<-makeGRangesFromDataFrame( dnaseClusters, ignore.strand=TRUE, seqnames.field="chrom", start.field="chromStart", end.field="chromEnd" ) 1. Select the Dnase hypersensitive sites that are 1Kb or closer to a gene and assign a gene ID to every Dnase HS that remains. library( TxDb.Hsapiens.UCSC.hg19.knownGene, quietly = TRUE ) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene Genes <- genes( txdb ) near.gene <- findOverlaps( dnaseClusters, Genes, maxgap = 1000 ) dnase.sites.list <- queryHits( near.gene ) near.gene <- subjectHits( near.gene ) gene_ids <- Genes[ near.gene ]$gene_id
DHS.database <- dnaseClusters[ dnase.sites.list ]
mcols(DHS.database)$gene_id <- gene_ids These steps can be modified to generate a new GRanges object containing custom DNAse HS sites. 1. Assign TFBS peaks from ChIP dataset to specific genes The function GR2tfbs_db assigns the TFBS peaks in the ChIP datasets stored in gr.list to a gene. To this end, a ChIP peak overlaping a Dnase HS region receive the gene label associated to that region. By default the function also assigns the gene name when the ChIP peak does not overlap a Dnase region but maps at less than 10 nucleotides from it. This behaviour can be modified by setting the argument distanceMargin to the desired value (by default distanceMargin = 10 bases). data( "DnaseHS_db", "gr.list", package="TFEA.ChIP" ) # Loading example datasets for this function TF.gene.binding.db <- GR2tfbs_db( DnaseHS_db, gr.list ) str( TF.gene.binding.db ) ## List of 1 ##$ wgEncodeEH002402:Formal class 'GeneSet' [package "GSEABase"] with 13 slots
##   .. ..@ geneIdType      :Formal class 'NullIdentifier' [package "GSEABase"] with 2 slots
##   .. .. .. ..@ type      : chr "Entrez Gene ID"
##   .. .. .. ..@ annotation: chr ""
##   .. ..@ geneIds         : chr [1:54] "6009" "6522" "6604" "100507501" ...
##   .. ..@ setName         : chr "wgEncodeEH002402"
##   .. ..@ setIdentifier   : chr "wgEncodeEH002402"
##   .. ..@ shortDescription: chr "Genes assigned to ChIP-seq"
##   .. ..@ longDescription : chr "Cell: Dnd41, Treatment: none, TF: CTCF"
##   .. ..@ organism        : chr "Homo sapiens"
##   .. ..@ pubMedIds       : chr(0)
##   .. ..@ urls            : chr(0)
##   .. ..@ contributor     : chr(0)
##   .. ..@ version         :Formal class 'Versions' [package "Biobase"] with 1 slot
##   .. .. .. ..@ .Data:List of 1
##   .. .. .. .. ..$: int [1:3] 0 0 1 ## .. ..@ creationDate : chr(0) ## .. ..@ collectionType :Formal class 'NullCollection' [package "GSEABase"] with 1 slot ## .. .. .. ..@ type: chr "Null" The function, accepts any Genomic Range object that includes a metacolumn with a gene ID (stored in the @elementMetadata@listdata [["gene_id"]] slot of the object) for each genomic segment. For example, asignation of peaks to genes can be done by providing a list of all the genes in the genome: library(TxDb.Hsapiens.UCSC.hg19.knownGene) ## Loading required package: GenomicFeatures ## Loading required package: AnnotationDbi data( "gr.list", package="TFEA.ChIP") # Loading example datasets for this function txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene Genes <- genes( txdb ) ## 403 genes were dropped because they have exons located on both strands ## of the same reference sequence or on more than one reference sequence, ## so cannot be represented by a single genomic range. ## Use 'single.strand.genes.only=FALSE' to get all the genes in a ## GRangesList object, or use suppressMessages() to suppress this message. TF.gene.binding.db <- GR2tfbs_db( Genes, gr.list, distanceMargin = 0 ) str( TF.gene.binding.db ) ## List of 1 ##$ wgEncodeEH002402:Formal class 'GeneSet' [package "GSEABase"] with 13 slots
##   .. ..@ geneIdType      :Formal class 'NullIdentifier' [package "GSEABase"] with 2 slots
##   .. .. .. ..@ type      : chr "Entrez Gene ID"
##   .. .. .. ..@ annotation: chr ""
##   .. ..@ geneIds         : chr [1:21] "9289" "84991" "3728" "25946" ...
##   .. ..@ setName         : chr "wgEncodeEH002402"
##   .. ..@ setIdentifier   : chr "wgEncodeEH002402"
##   .. ..@ shortDescription: chr "Genes assigned to ChIP-seq"
##   .. ..@ longDescription : chr "Cell: Dnd41, Treatment: none, TF: CTCF"
##   .. ..@ organism        : chr "Homo sapiens"
##   .. ..@ pubMedIds       : chr(0)
##   .. ..@ urls            : chr(0)
##   .. ..@ contributor     : chr(0)
##   .. ..@ version         :Formal class 'Versions' [package "Biobase"] with 1 slot
##   .. .. .. ..@ .Data:List of 1
##   .. .. .. .. ..$: int [1:3] 0 0 1 ## .. ..@ creationDate : chr(0) ## .. ..@ collectionType :Formal class 'NullCollection' [package "GSEABase"] with 1 slot ## .. .. .. ..@ type: chr "Null" In this case the information about Dnase hypersensitivity is disregarded and peaks are asigned to overlapping genes (or genes closer than distanceMargin residues). 1. Generation of the TFBS database. The function makeTFBSmatrix generates a binary matrix with a row for each human gene and a column for each independent ChIPseq dataset. The cell of the matrix contain a value of 1 if the gene is bound by the TF and 0 otherwise. This matrix and the metadata table can be used instead of the default TFBS database. data( "tfbs.database", package = "TFEA.ChIP" ) # Loading example datasets for this function txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene gen.list <- genes( txdb )$gene_id # selecting all the genes in knownGene
##   403 genes were dropped because they have exons located on both strands
##   of the same reference sequence or on more than one reference sequence,
##   so cannot be represented by a single genomic range.
##   Use 'single.strand.genes.only=FALSE' to get all the genes in a
##   GRangesList object, or use suppressMessages() to suppress this message.
myTFBSmatrix <- makeTFBSmatrix( gen.list, tfbs.database )
myTFBSmatrix[ 2530:2533, 1:3 ] # The gene HMGN4 (Entrez ID 10473) has TFBS for this three ChIP-Seq datasets
##       wgEncodeEH000645 wgEncodeEH000679 wgEncodeEH000694
## 10472                0                0                0
## 10473                1                1                1
## 10474                0                0                0
## 10475                0                0                0
1. Sustitute the default database by a custom generated table.

At the beginning of a session, use the function set_user_data to use your TFBS binary matrix and metadata table with the rest of the package.

set_user_data( binary_matrix = myTFBSmatrix, metadata = myMetaData )

1. ENCODE Project Consortium (2012) Nature 489, 57-74)

2. Edgar, R et al. (2002) Nucleic Acids Res. 30:207-10

3. Barrett, T et al. (2013) Nucleic Acids Res. 41(Database issue):D991-5

4. Subramanian, Tamayo, et al. (2005) PNAS 102, 15545-15550

5. Mootha, Lindgren, et al. (2003) Nat Genet 34, 267-273

6. Jeanne Chèneby, Marius Gheorghe, Marie Artufel, Anthony Mathelier, Benoit Ballester; ReMap 2018: an updated atlas of regulatory regions from an integrative analysis of DNA-binding ChIP-seq experiments, Nucleic Acids Research, Volume 46, Issue D1, 4 January 2018, Pages D267–D275, https://doi.org/10.1093/nar/gkx1092

7. Tiana, M et al. The SIN3A histone deacetylase complex is required for a complete transcriptional response to hypoxia. https://doi.org/10.1101/182691

8. Subramanian, Tamayo, et al. (2005) PNAS 102, 15545-15550

9. Mootha, Lindgren, et al. (2003) Nat Genet 34, 267-273