# Contents

## 0.1 Installation

BiocManager::install("TADCompare")
# Or, for the developmental version
devtools::install_github("dozmorovlab/TADCompare")
library(dplyr)
library(TADCompare)

## 1.1 Introduction

TADCompare is a function that allows users to automatically identify differential TAD boundaries between two datasets. For each differential boundary, we provide unique classification (complex, merge, split, shifted, and strength change) defining how the TAD boundaries change between the datasets.

The only required input is two contact matrices in one of the permitted forms specified in the Input data vignette. TADCompare function will automatically determine the type of matrix and convert it to an appropriate form, given it is one of the supported formats. The only requirement is that all matrices be in the same format. For the fastest results, we suggest using $$n \times n$$ matrices. Additionally, we recommend users to provide resolution of their data. If the resolution is not provided, we estimate it using the numeric column names of contact matrices. We illustrate running TADCompare using the data from GM12878 cell line, chromosome 22, 50kb resolution (Rao et al. 2014).

# Get the rao contact matrices built into the package
data("rao_chr22_prim")
data("rao_chr22_rep")
# We see these are n x n matrices
dim(rao_chr22_prim)
## [1] 704 704
dim(rao_chr22_rep)
## [1] 704 704
# And, get a glimpse how the data looks like
rao_chr22_prim[100:105, 100:105]
##          2.1e+07 21050000 21100000 21150000 21200000 21250000
## 2.1e+07     6215     1047     1073      714      458      383
## 21050000    1047     4542     2640     1236      619      489
## 21100000    1073     2640    13468     4680     1893     1266
## 21150000     714     1236     4680    13600     4430     2124
## 21200000     458      619     1893     4430    11313     4465
## 21250000     383      489     1266     2124     4465    11824
# Running the algorithm with resolution specified
results = TADCompare(rao_chr22_prim, rao_chr22_rep, resolution = 50000)
# Repeating without specifying resolution
no_res = TADCompare(rao_chr22_prim, rao_chr22_rep)
## Estimating resolution
# We can see below that resolution can be estimated automatically if necessary
identical(results$Diff_Loci, no_res$Diff_Loci)
## [1] TRUE

## 1.3 Types of TADCompare output

TADCompare function returns a list with two data frames and one plot. The first data frame contains all regions of the genome containing a TAD boundary in at least one of the contact matrices. The results are shown below:

head(results$TAD_Frame) ## Boundary Gap_Score TAD_Score1 TAD_Score2 Differential Enriched_In ## 1 16300000 -18.6589859 0.7698503 7.704411 Differential Matrix 2 ## 2 16850000 0.2361100 7.8431724 7.580056 Non-Differential Matrix 1 ## 3 17350000 0.9683905 3.6628411 3.220253 Non-Differential Matrix 1 ## 4 18000000 -0.1033894 2.3616107 2.347392 Non-Differential Matrix 2 ## 5 18750000 3.0426660 4.7997622 3.558975 Differential Matrix 1 ## 6 18850000 2.6911660 3.7674844 2.680707 Differential Matrix 1 ## Type ## 1 <NA> ## 2 Non-Differential ## 3 Non-Differential ## 4 Non-Differential ## 5 Strength Change ## 6 Strength Change The “Boundary” column contains genomic coordinates of the given boundary. “Gap_Score” corresponds to the differential boundary score (Z-score of the difference between boundary scores). “TAD_Score1” and “TAD_Score2” corresponds to the boundary score of the two contact matrices. “Differential” simply indicates whether the boundary is differential or not differential. “Enriched_In” indicates which matrix contains the TAD boundary. “Type” identifies the type of TAD boundary change. Note: The first boundary will always be classified as “NA” since specific classification is impossible without a reference boundary to the left and right. The second data frame contains the same information as the first data frame but includes every region of the genome. We show it below: head(results$Boundary_Scores)
##   Boundary TAD_Score1 TAD_Score2     Gap_Score     Differential Enriched_In
## 1 16300000  0.7698503  7.7044114 -18.658985912     Differential    Matrix 2
## 2 16850000  7.8431724  7.5800558   0.236110024 Non-Differential    Matrix 1
## 3 16900000 -0.4402121 -0.4337948   0.009161239 Non-Differential    Matrix 1
## 4 16950000 -0.7460531 -0.5551327  -0.467725582 Non-Differential    Matrix 2
## 5 17000000 -0.6803509 -0.6512025  -0.037456910 Non-Differential    Matrix 2
## 6 17050000 -0.6626352 -0.5562997  -0.245694062 Non-Differential    Matrix 2
##               Type
## 1             <NA>
## 2 Non-Differential
## 3 Non-Differential
## 4 Non-Differential
## 5 Non-Differential
## 6 Non-Differential

Finally, we include a plot that contains a stacked barplot indicating the prevalence of each type of TAD boundary. The barplot is a ggplot2 object, making it completely customizable. We show this below:

p <- results$Count_Plot class(p) ## [1] "gg" "ggplot" plot(p) ## Warning: Removed 1 rows containing missing values (position_stack). ## 1.4 Pre-specification of TADs We recognize that users may like to use their own TAD caller for the identification of TAD boundaries prior to boundary comparison. As a result, we provide the option for pre-specification of TADs in a form of two data frames with chromosome, start, and end coordinates of TAD boundaries for the two contact matrices. With this option, only provided TAD boundaries will be tested. We provide an example below using the SpectralTAD TAD caller (Cresswell, Stansfield, and Dozmorov 2019–1AD): # Call TADs using SpectralTAD bed_coords1 = bind_rows(SpectralTAD::SpectralTAD(rao_chr22_prim, chr = "chr22", levels = 3)) ## Estimating resolution ## Warning: Unknown or uninitialised column: Group. ## Warning: Unknown or uninitialised column: Group. bed_coords2 = bind_rows(SpectralTAD(rao_chr22_rep, chr = "chr22", levels = 3)) ## Estimating resolution # Combining the TAD boundaries for both contact matrices Combined_Bed = list(bed_coords1, bed_coords2) # Printing the combined list of TAD boundaries Above, we see the dataset output by SpectralTAD contains a column named “start” and “end” indicating the start and end coordinate of each TAD. This is required, though the output of any TAD caller can be used effectively with some data manipulation. The “Level” column indicates the level of TAD in a hierarchy. # Running TADCompare with pre-specified TADs TD_Compare = TADCompare(rao_chr22_prim, rao_chr22_rep, resolution = 50000, pre_tads = Combined_Bed) # Returning the boundaries head(TD_Compare$TAD_Frame)
##   Boundary   Gap_Score TAD_Score1 TAD_Score2     Differential Enriched_In
## 1 16850000  0.23611002  7.8431724  7.5800558 Non-Differential    Matrix 1
## 2 17250000 -0.08875306  0.1103987  0.1409999 Non-Differential    Matrix 2
## 3 17600000 -0.06733403 -0.4153272 -0.3809658 Non-Differential    Matrix 2
## 4 18000000 -0.10338944  2.3616107  2.3473922 Non-Differential    Matrix 2
## 5 18350000 -0.04284963 -0.7767081 -0.7433986 Non-Differential    Matrix 2
## 6 19100000  0.01857312 -0.8269469 -0.8153997 Non-Differential    Matrix 1
##               Type
## 1      Non-Overlap
## 2      Non-Overlap
## 3      Non-Overlap
## 4 Non-Differential
## 5 Non-Differential
## 6      Non-Overlap
# Testing to show that all pre-specified boundaries are returned
table(TD_Compare$TAD_Frame$Boundary %in% bind_rows(Combined_Bed)\$end) 
##
## TRUE
##   94

Here, we see that the only boundaries that are returned are those we pre-specified.

## 1.5 Visualization of TADCompare Results

We provide means for visualizing differential TAD boundaries using the DiffPlot function. As an input, DiffPlot takes the output from the TADCompare function and the original contact matrices. As an output, it returns a ggplot2 object that can be further customized. We demonstrate visualization of the differences between GM12878 and IMR90 cell lines, on the subset of chromosome 2, 40kb resolution data processed in Schmitt et al. (Schmitt et al. 2016):

data("GM12878.40kb.raw.chr2")
data("IMR90.40kb.raw.chr2")
mtx1 <- GM12878.40kb.raw.chr2
mtx2 <- IMR90.40kb.raw.chr2
res <- 40000 # Set resolution
# Globally normalize matrices for better visual comparison (does not affect TAD calling)
mtx1_total <- sum(mtx1)
mtx2_total <- sum(mtx2)
(scaling_factor <- mtx1_total / mtx2_total)
## [1] 0.3837505
# Rescale matrices depending on which matrix is smaller
if (mtx1_total > mtx2_total) {
mtx2 <- mtx2 * scaling_factor
} else {
mtx1 <- mtx1 * (1 / scaling_factor)
}
# Coordinates of interesting regions
start_coord <- 8000000
end_coord   <- 16000000
# Another interesting region
# start_coord <- 40000000
# end_coord   <- 48000000

### 1.5.1 Comparing TAD boundary scores

# Running TADCompare as-is
TD_Compare <- TADCompare(mtx1, mtx2, resolution = res)
# Running the plotting algorithm with pre-specified TADs
cont_mat1   = mtx1,
cont_mat2   = mtx2,
resolution  = res,
start_coord = start_coord,
end_coord   = end_coord,
show_types  = TRUE,
point_size  = 5,
max_height  = 5,
rel_heights = c(1, 2),
palette     = "RdYlBu")
plot(p)

From these results, we can see that boundary scores from both matrices (“Boundary score 1” and “Boundary score 2”) frequently detected as significant boundaries in both matrices (boundary scores above the 1.5 threshold), but are non-differential (black dots). The differential boundaries correspond to the “Differential boundary score” track, where absolute boundary score differences above the 2.0 threshold are detected. The different types of differential boundaries are defined according to a schema described in the TADCompare manuscript. Note that coloring by types may be disabled by setting “show_types” to FALSE; only differential and non-differential labels will be shown.

### 1.5.2 Comparing pre-defined TAD boundaries

We can also use pre-specified TADs by providing a list of TAD coordinates containing TAD boundaries for each contact matrix. The list should be of length 2. We show how to do this below, using SpectralTAD for pre-specification:

# Call TADs using SpectralTAD
bed_coords1 = bind_rows(SpectralTAD(mtx1, chr = "chr2", levels = 3))
## Estimating resolution
bed_coords2 = bind_rows(SpectralTAD::SpectralTAD(mtx2, chr = "chr2", levels = 3))
## Estimating resolution
# Placing the data in a list for the plotting procedure
Combined_Bed = list(bed_coords1, bed_coords2)

# Running the plotting algorithm with pre-specified TADs
plot(p)