Normally genomic signals are represented as numeric (e.g. methylation from WGBS or histone modification intensity from ChIP-seq) or binary values (e.g. existance of CpG islands in a given position), however, genomic signals sometimes can also be represented as categorical or discrete values (mostly stored as characters). A very typical example is chromatin states segmentation by ChromHMM. ChromHMM basically assigns a chromatin state (e.g. active transcription state or repressive transcription state) to a given window in the genome and the assignment of states in different windows are always mutually exclusive (which means one window can only have one state). In this vignette, We will demonstrate how to visualize the enrichment of various chromatin states around certain genomic targets and how to integerate with other epigenomic signals.

General examples

In following example, we use Roadmap dataset as the example dataset. First we show basic usages with using one sample (sample id: E003, embryonic stem cell, H1 cell line).

Load packages:

library(GenomicRanges)
library(data.table)
library(EnrichedHeatmap)
library(circlize)

The chromatin state segmentation is trained and applied from five histone modifications. Following file in use is from http://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E003_15_coreMarks_mnemonics.bed.gz.

We convert to a GRanges object.

states_bed = fread("zcat ~/EnrichedHeatmap_test/E003_15_coreMarks_mnemonics.bed.gz")
states = GRanges(seqnames = states_bed[[1]], 
    ranges = IRanges(states_bed[[2]] + 1, states_bed[[3]]), 
    states = states_bed[[4]])
unique(states_bed[[4]])
##  [1] "15_Quies"    "9_Het"       "14_ReprPCWk" "13_ReprPC"   "10_TssBiv"   "7_Enh"      
##  [7] "1_TssA"      "11_BivFlnk"  "2_TssAFlnk"  "5_TxWk"      "4_Tx"        "8_ZNF/Rpts" 
## [13] "6_EnhG"      "12_EnhBiv"   "3_TxFlnk"

In the 15 states, there are some states which are similar to each other such as 1_TssA (active TSS) and 2_TssAFlnk (flanking active TSS). To reduce the complexity of the analysis, we merge some of the similar states.

map = c(
    "1_TssA"      = "TssActive",
    "2_TssAFlnk"  = "TssActive",
    "3_TxFlnk"    = "Transcript",
    "4_Tx"        = "Transcript",
    "5_TxWk"      = "Transcript",
    "6_EnhG"      = "Enhancer",
    "7_Enh"       = "Enhancer",
    "8_ZNF/Rpts"  = "Heterochromatin",
    "9_Het"       = "Heterochromatin",
    "10_TssBiv"   = "TssBivalent",
    "11_BivFlnk"  = "TssBivalent",
    "12_EnhBiv"   = "Enhancer",
    "13_ReprPC"   = "Repressive",
    "14_ReprPCWk" = "Repressive",
    "15_Quies"    = "Quiescent"
)
states$states_simplified = map[states$states]

Also we set the colors for the 7 merged states.

states_col = c(
    "TssActive"       = "Red",
    "Transcript"      = "Green",
    "Enhancer"        = "Yellow",
    "Heterochromatin" = "PaleTurquoise",
    "TssBivalent"     = "Orange",
    "Repressive"      = "Grey",
    "Quiescent"       = "black"
)
states_name = names(states_col)
n_states = length(states_col)

In following, we demonstrate how to normalize the chromatin states to TSS. First we load the transcriptome and extract TSS regions. The transcriptome annotation is from http://egg2.wustl.edu/roadmap/data/byDataType/rna/annotations/gen10.long.gtf.gz and we only use protein coding genes. The database file (the sqlite file) for transcriptome was generated by GenomicFeatures package (the makeTxDbFromGFF() function).

library(GenomicFeatures)
txdb = loadDb("~/EnrichedHeatmap_test/gencode19_protein_coding_txdb.sqlite")

g = genes(txdb)
tss = promoters(g, upstream = 0, downstream = 1)

To reduce the running time, here we only use chromosome 1. Normalizing categorical signals is basically as same as numeric signals. Here we only need to specify the column name for the categorical signals.

tss_chr1 = tss[seqnames(tss) == "chr1"]
# column "states_simplified" is in character mode
mat_states = normalizeToMatrix(states, tss_chr1, value_column = "states_simplified")
mat_states
## Normalize states to tss_chr1:
##   Upstream 5000 bp (50 windows)
##   Downstream 5000 bp (50 windows)
##   Include target regions (width = 1)
##   2065 target regions
##   signal is categorical (7 levels)

In the last line of the message, it clarifies it is categorical signals with 7 levels.

The implementation of normalizing categorical signals is actually very simple. Internally, n_states normalized matrices are generated where each one corresponds to one chromatin state and the value in each window is the fraction how much the window is overlapped to the state (with w0 or weighted mean mode). Since the i^th row and the j^th column in all matrices correspond to a same window to a same target (here the TSS), if there are multiple states overlap to this same window, when summarizing from all states, the state with largest overlap fraction is assigned to this window.

Since the normalizeToMatrix() is called n_states times internally, it will be a little bit slow to normalize to categorical signals.

There are some special visualizations designed for categorical signals where each state is summarized and visualized separatedly in the top annotation. Here states_col must be a named vector where the names should correspond to the levels of the categorical signals.

EnrichedHeatmap(mat_states, name = "states", col = states_col)

plot of chunk categorical_default

You might feel the row ordering is a little bit in a mess. Although the signals are categorical, internally, there are coded as integer numbers. Just similar as how factor in R is stored, each categorical level corresponds to an integer number. If the signal column is simply a character vector, the assignment of integers are based on the natural ordering of this character vector. Thus the numeric coding for signals in states object is:

data.frame(states = unique(states$states_simplified), value = 1:7)
##            states value
## 1       Quiescent     1
## 2 Heterochromatin     2
## 3      Repressive     3
## 4     TssBivalent     4
## 5        Enhancer     5
## 6       TssActive     6
## 7      Transcript     7

Zero is assigned to a window if none of the states overlap to it and it is drawn with white.

The numeric coding can be controlled by setting the signals as a factor variable and the level order of the factor controls the corresponding coding. In following code, we convert states_simplified as a factor with specifying the level order.

states$states_simplified = factor(states$states_simplified, levels = states_name)
data.frame(states = levels(states$states_simplified), value = 1:7)
##            states value
## 1       TssActive     1
## 2      Transcript     2
## 3        Enhancer     3
## 4 Heterochromatin     4
## 5     TssBivalent     5
## 6      Repressive     6
## 7       Quiescent     7

Note here the order of states_name also reflects the closeness of different states. With the order defined above, TssActive is closer to Transcript and Repressive is closer to Quiescent.

In following left plot, the TSS with active states now are far from the TSS with repressive states. In the right plot we change the level of states_simplified and you can compare to the left one.

mat_states = normalizeToMatrix(states, tss_chr1, value_column = "states_simplified")
EnrichedHeatmap(mat_states, name = "states", col = states_col)
# shuffle levels for states_simplified
states$states_simplified = factor(states$states_simplified, 
    levels = states_name[c(1, 6, 2, 7, 4, 3, 5)])
mat_states = normalizeToMatrix(states, tss_chr1, value_column = "states_simplified")
EnrichedHeatmap(mat_states, name = "states", col = states_col, 
    row_title = "states_name[c(1, 6, 2, 7, 4, 3, 5)]")