Abstract
DNA transcription is intrinsically complex. Bioinformatic work with transcription factors (TFs) is complicated by a multiplicity of data resources and annotations. The Bioconductor package TFutils includes data structures and functions to enhance the precision and utility of integrative analyses that have components involving TFs. TFutils provides catalogs of human TFs from three reference sources (CISBP, HOCOMOCO, and GO), a catalog of TF targets derived from MSigDb, and multiple approaches to enumerating TF binding sites. Aspects of integration of TF binding patterns and genome-wide association study results are explored in examples.
A central concern of genome biology is improving understanding of gene transcription. In simple terms, transcription factors (TFs) are proteins that bind to DNA, typically near gene promoter regions. The role of TFs in gene expression variation is of great interest. Progress in deciphering genetic and epigenetic processes that affect TF abundance and function will be essential in clarifying and interpreting gene expression variation patterns and their effects on phenotype. Difficulties of identifying functional binding of TFs, and opportunities for using information of TF binding in systems biology contexts, are reviewed in Lambert et al. (2018) and Weirauch et al. (2014).
This paper describes an R/Bioconductor package called TFutils, which assembles various resources intended to clarify and unify approaches to working with TF concepts in bioinformatic analysis. Computations described in this paper can be carried out with Bioconductor version 3.8. The package can be installed with
# use install.packages("BiocManager") if not already available
library(BiocManager)
install("TFutils")
In the next section we describe the basic concepts of enumerating and classifying TFs, enumerating TF targets, and representing genome-wide quantification of TF binding affinity. This is followed by a review of the key data structures and functions provided in the package, and an example in cancer informatics.
The present paper does not deal directly with the manipulation or interpretation of sequence motifs. An excellent Bioconductor package that synthesizes many approaches to these tasks is universalmotif.
Given the importance of the topic, it is not surprising that a number of bioinformatic research groups have published catalogs of transcription factors along with metadata about their features. Standard nomenclature for TFs has yet to be established. Gene symbols, motif sequences, and position-weight matrix catalog entries have all been used as TF identifiers.
In TFutils we have gathered information from four widely used resources, focusing specifically on human TFs: Gene Ontology (GO, Ashburner et al. (2000), in which GO:0003700
is the tag for the molecular function concept “DNA binding transcription factor activity”), CISBP (Weirauch et al. (2014)), HOCOMOCO (Kulakovskiy et al. (2018)), and the “c3 TFT (transcription factor target)” signature set of MSigDb (Subramanian et al. (2005)). Figure @ref(fig:lkupset) depicts the sizes of these catalogs, measured using counts of unique HGNC gene symbols. The enumeration for GO uses Bioconductor’s org.Hs.eg.db package to find direct associations from GO:0003700
to HGNC symbols. The enumeration for MSigDb is heuristic and involves parsing the gene set identifiers used in MSigDb for exact or close matches to HGNC symbols. For CISBP and HOCOMOCO, the associated web servers provide easily parsed tabular catalogs.
As noted by Weirauch et al. (2014), interpretation of the “function and evolution of DNA sequences” is dependent on the analysis of sequence-specific DNA binding domains. These domains are dynamic and cell-type specific (Gertz et al. (2013)). Classifying TFs according to features of the binding domain is an ongoing process of increasing intricacy. Figure @ref(fig:TFclass) shows excerpts of hierarchies of terms related to TF type derived from GO (on the left) and TFclass (Wingender et al. (2018)). There is a disagreement between our enumeration of TFs based on GO in Figure @ref(fig:lkupset) and the 1919 shown in AmiGO, as the latter includes a broader collection of receptor activities.