Quick and straightforward visualization of read signal over genomic intervals is key for generating hypotheses from sequencing data sets (e.g. ChIP-seq, ATAC-seq, bisulfite/methyl-seq). Many tools both inside and outside of R and Bioconductor are available to explore these types of data, and they typically start with a bigWig or BAM file and end with some representation of the signal (e.g. heatmap). profileplyr leverages many Bioconductor tools to allow for both flexibility and additional functionality in workflows that end with visualization of the read signal. Signal over select genomic intervals can be quantified with Bioconductor packages in R (e.g. soGGi, EnrichedHeatmap, genomation or via the command line (e.g. deepTools, ngs.plot). profileplyr takes the signal quantification over ranges coming from either soGGi or deepTools, as well as the metadata associated with the ranges and the input bigWig or BAM files and stores this information in a RangedSummarizedExperiment object. profileplyr then uses this object to summarize and plot the signal within these ranges, and also to group and subset these ranges based on clustering, overlap with GRanges objects, overlap with gene lists, and genomic annotations. After manipulation with established Bioconductor tools, the profileplyr object can then be either directly visualized as a customized heatmap in R using EnrichedHeatmap or exported in a format that can be visualized with deepTools. While many users prefer to use deepTools certain steps of sequencing analysis (the handling of BAM files and utilizing computeMatrix/plotHeatmap), by staying outside of R one cannot take advantage of the many Bioconductor packages used to analyze genomic range data. The profileplyr object serves as an intermediate that can be easily manipulated and annotated using Bioconductor tools, and the integration of this object with deepTools, soGGi, and EnrichedHeatmap functions enhances the user’s ability to visualize genomic signal.
There are two main ways to go from a bigWig of BAM file to a profileplyr object, which is a version of the RangedSummarizedExperiment class within the SummarizedExperiment package. A profileplyr object can be generated from the command line with the output of the deepTools ‘computeMatrix’ function, or from within R using the output of the soGGi regionPlot() function.
If you do do not currently have deepTools installed, instructions for installation are in the deepTools manual, with the easiest method likely being through Bioconda.
This direct output from the ‘computeMatrix’ function can be imported into R as a profileplyr object using the import_deepToolsMat() function. This function takes as its only argument the path to the matrix from deepTools. The information contained within this matrix is stored in a profileplyr object.
library(profileplyr)
example <- system.file("extdata",
"example_deepTools_MAT",
package = "profileplyr")
proplyrObject <- import_deepToolsMat(example)
The soGGi function plotRegion() allows for the quantification of signal over genomic intervals within R, and the information from this quantification is stored in a ChIPprofile object. profileplyr can take the ChIPprofile object as an input to generate a profileplyr object using the as_profileplyr() function. While the plotRegion function from soGGi does not allow for inputting multiple signal files or multiple BED files, profileplyr contains a function to do this, BamBigwig_to_ChIPprofile(). This function takes a character vector of paths to bigWig files or BAM files in the ‘signalFiles’ argument. Note that the files for each function call must be all BAMs or all bigWigs and not a combination of the two. The ‘testRanges’ argument is must be a character vector of paths to BED files. The corresponding names for each BED file that will appear on visualizations involving the resulting profileplyr object can be set with the ‘testRanges_names’ argument.
signalFiles <- c(system.file("extdata","Sorted_Hindbrain_day_12_1_filtered.bam",package = "profileplyr"),
system.file("extdata","Sorted_Liver_day_12_1_filtered.bam",package = "profileplyr"))
# BAMs must be indexed
require(Rsamtools)
for (i in seq_along(signalFiles)){
indexBam(signalFiles[i])
}
testRanges <- system.file("extdata",
"newranges_small.bed",
package = "profileplyr")
chipProfile <- BamBigwig_to_chipProfile(signalFiles,
testRanges,
format = "bam",
paired=FALSE,
style="percentOfRegion",
nOfWindows=20,
distanceAround=40
)
chipProfile
## class: ChIPprofile
## dim: 3 36
## metadata(2): names AlignedReadsInBam
## assays(2): '' ''
## rownames(3): giID1 giID2 giID3
## rowData names(4): name score sgGroup giID
## colnames(36): Start-1 Start-2 ... End+7 End+8
## colData names(0):
Once we have the ChIPprofile object, the as_profileplyr() function will generate the profileplyr object that can be used in the many downstream functions described below. Similar to when you import from deepTools, the grouping of the ranges after using the as_profileplyr() function is stored in the ‘rowGroupsInUse’ section of params(proplyrObject). Further, if the object is derived from a soGGi ChIPprofile object, the inherited group column will be called ‘sgGroup’.
proplyrObject <- as_profileplyr(chipProfile)
params(proplyrObject)$rowGroupsInUse
## [1] "sgGroup"
The profileplyr object is a form of the RangedSummarizedExperiment class, which allows us to both store all of the relevant information that is imported from soGGi or deepTools, and have flexibility in how we manipulate this object in downstream analysis. We will go through many of these features by looking at the object we just generated from the import functions.
proplyrObject
## class: profileplyr
## dim: 400 25
## metadata(0):
## assays(3): hindbrain1_binned liver1_binned kidney1_binned
## rownames: NULL
## rowData names(3): names score dpGroup
## colnames: NULL
## colData names(0):
The matrices that represent the bigWig or BAM signal within each bin (bin size specified within deepTools, soGGi by default uses base pair resolution, so bin = 1) are contained in a list within the profileplyr object that can be accessed with the assays() function. The columns of each matrix are the bins, while the rows are the ranges from the BED files that were used as the input to deepTools or soGGi, and the dimensions of each matrix can be seen with dim().
library(SummarizedExperiment)
assays(proplyrObject)
## List of length 3
## names(3): hindbrain1_binned liver1_binned kidney1_binned
dim(proplyrObject)
## [1] 400 25
A key feature of the RangedSummarizedExperiment class is the ability to link the rows of the assay matrices to genomic ranges. Those ranges are stored in a GRanges object allowing for efficient integration with other Bioconductor packages. The GRanges object within the profileplyr object can be accessed with the rowRanges() function. Standard GRanges accessor functions such as start(), end(), or ranges() can be used on the entire profileplyr object. Furthermore, as these ranges are annotated by the various functions of profileplyr, information and classifications of these ranges will be stored in this GRanges object in the metadata columns, which can be accessed as a Dataframe with the mcols() function (this same information is also contained in the rowData section of the object, accessed with rowData()).
rowRanges(proplyrObject)[1:3]
## GRanges object with 3 ranges and 3 metadata columns:
## seqnames ranges strand | names score dpGroup
## <Rle> <IRanges> <Rle> | <character> <numeric> <factor>
## [1] chr14 34880170-34881532 * | . 0 genes
## [2] chr15 41155088-41156065 * | ._r1 0 genes
## [3] chr1 135258238-135259293 * | ._r2 0 genes
## -------
## seqinfo: 20 sequences from an unspecified genome; no seqlengths
The information associated with each sample is stored as a Dataframe within with the parameters of the profileplyr object, and is accessed with sampleData().
sampleData(proplyrObject)
## DataFrame with 3 rows and 20 columns
## upstream downstream body bin.size unscaled.5.prime
## <numeric> <numeric> <numeric> <numeric> <numeric>
## hindbrain1_binned 0 0 1000 40 0
## liver1_binned 0 0 1000 40 0
## kidney1_binned 0 0 1000 40 0
## unscaled.3.prime sample_labels verbose bin.avg.type
## <numeric> <character> <logical> <character>
## hindbrain1_binned 0 hindbrain1_binned FALSE mean
## liver1_binned 0 liver1_binned FALSE mean
## kidney1_binned 0 kidney1_binned FALSE mean
## missing.data.as.zero scale skip.zeros nan.after.end
## <logical> <numeric> <logical> <logical>
## hindbrain1_binned FALSE 1 FALSE FALSE
## liver1_binned FALSE 1 FALSE FALSE
## kidney1_binned FALSE 1 FALSE FALSE
## proc.number sort.regions sort.using ref.point min.threshold
## <numeric> <character> <character> <list> <list>
## hindbrain1_binned 1 keep mean
## liver1_binned 1 keep mean
## kidney1_binned 1 keep mean
## max.threshold generation.method
## <list> <character>
## hindbrain1_binned deepTools
## liver1_binned deepTools
## kidney1_binned deepTools
The ‘rowGroupsInUse’ element for the params slot indicates which column of the range metadata (mcols() or rowRanges()) that will be used for grouping in the final output if that object were used for visualization. For a profileplyr object created from a deepTools computeMatrix output or from a soGGi ChIPprofile object, the inherited groups correspond to the BED files over which the signal was quantified, and these groups are contained in the ‘dpGroup’ column.
params(proplyrObject)$rowGroupsInUse
## [1] "dpGroup"
The ‘mcolToOrderBy’ slot of the profileplyr object parameters indicates which column of the range metadata will be used for ordering the ranges as they are exported to either deepTools or to EnrichedHeatmap (with the generateEnrichedHeatmap() function within profileplyr). This can be set using the orderBy() function, which requires a profileplyr object and a character string matching a column name of the range metadata as arguments. If groupBy is never used on an object, the default will be to order by the mean signal of each range (within each group). In addition, until groupBy() has been used on a profileplyr object, the ‘mcolToOrderBy’ slot of the parameters will be NULL and will not be seen with params(proplyrObject).
NOTE: If you are exporting to deepTools and want to order by something other than mean range signal (or other deepTools ordering options), then make sure to set the –sortRegions argument of deepTools ‘plotHeatmap’ to “no” so your custom ordering will be used. generateEnrichedHeatmap() will always use the column indicated by params(proplyrObject)$mcolToOrderBy, or the mean signal if that value is NULL.
params(proplyrObject)$mcolToOrderBy
## NULL
proplyrObject <- orderBy(proplyrObject, "score")
params(proplyrObject)$mcolToOrderBy
## [1] "score"
The profileplyr object can be subset either by sample, or by rows and columns of the matrix for each sample. This is done using the ‘[ ]’ brackets, with the first position being assay matrix rows, the second position being assay matrix columns, and the third position being the entire matrix for each sample (in addition to the rest of the parameters and range data). For example if you wanted to get the first ten rows and columns of each sample matrix:
proplyrObject[1:10,1:10]
## class: profileplyr
## dim: 10 10
## metadata(0):
## assays(3): hindbrain1_binned liver1_binned kidney1_binned
## rownames: NULL
## rowData names(3): names score dpGroup
## colnames: NULL
## colData names(0):
The more useful subsetting functionality is likely the ability to subset by samples using the third position of the bracket. For example, if you only wanted the first two samples to export to a deepTools matrix or for further downstream analysis, you could simply just subset with the third position:
proplyrObject[ , , 1:2]
## class: profileplyr
## dim: 400 25
## metadata(0):
## assays(2): hindbrain1_binned liver1_binned
## rownames: NULL
## rowData names(3): names score dpGroup
## colnames: NULL
## colData names(0):
It might be helpful to change the sample names to something that is shorter to make labeling of figures clearer. This can be done separately in most of the visualization packages that a profileplyr object can be used in, but you can change the names within in the sampleData it will be changed for all downstream analyses. After importing either a deepTools matrix or a soGGi object, the sample names are stored as the rownames of the sampleData Dataframe.
rownames(sampleData(proplyrObject))
## [1] "hindbrain1_binned" "liver1_binned" "kidney1_binned"
The names can simply be changed by reassigning the rownames of this Dataframe.
rownames(sampleData(proplyrObject)) <- c("Hindbrain",
"Liver",
"Kidney")
Importantly, this also changes the names of the matrices stored in the assays() section of the profileplyr object, and the names used for any output method will also be changed (e.g. deepTools, EnrichedHeatmap, ggplot)
Code involving profileplyr objects can be made clearer using the pipe operator from the magrittr package. This is aided by the fact that the profileplyr object is both the output and input of most functions within the package. This is demonstrated throughout the vignette, including the more complex visualizations later in this vignette. The user can go all the way from importing a deepTools matrix or soGGi object, through annotation and grouping, and ending with export to either the generateEnrichedHeatmap() (shown below) or export_deepToolsMat() function
The profileplyr object allows for the flexibility to be exported to various types of visualization, both inside and outside of R. This flexibility coupled with the annotation and subsetting functions described later in this vignette provide great potential for quickly navigating through data sets. There are two main ways to visualize this data once the data is stored in a profileplyr object, first as a heatmap of signal over the individual bins within these genomic ranges (e.g. deepTools and EnrichedHeatmap), and second as summarized signal (e.g. mean of the entire range) in ggplot or clustered heatmaps (e.g. pheatmap).
For heatmap visualization of the signal over the genomic intervals this object can be converted to:
Any profileplyr object can be exported as a matrix formatted for the deepTools ‘plotHeatmap’ function using the export_deepToolsMat() function. This simply requires the name of the profileplyr object and the path to the desired destination of the gzipped matrix. This function will build the metadata for the deepTools matrix based on the parameters of the profileplyr object. Importantly, the range metadata column of profileplyr object (accessed with rowRanges(object)) that is specified in the ‘rowGroupsInUse’ section of the parameters for the whole object (found using params(object)$rowGroupsInUse) is automatically used to set the grouping of the exported deepTools matrix. The ‘rowGroupsInUse’ can be changed using the groupBy() function (see more details below).
output_path <- file.path(tempdir(),"ATAC_example.MAT.gz")
export_deepToolsMat(proplyrObject, con = output_path)
To generate a heatmap within R directly from the profileplyr object, the generateEnrichedHeatmap() function should be used. This function takes a profileplyr object and produces a heatmap using the EnrichedHeatmap package. It allows for easy export from the profileplyr object and simple inclusion of the metadata as heatmap annotations. So far the profileplyr object used in this vignette is relatively basic (no groups or additional metadata), but this function generates a multipanel heatmap with a variety of arguments that have been tailored to visualizing the profileplyr object. As we explore more functionality within profileplyr, some of these features will be demonstrated.
# heatmap not printed, see below for examples of rendered heatmaps from this function
heatmap <- generateEnrichedHeatmap(proplyrObject)
class(heatmap)
## [1] "HeatmapList"
## attr(,"package")
## [1] "ComplexHeatmap"
profileplyr also facilitates direct use of the EnrichedHeatmap() function within the EnrichedHeatmap package, allowing the user further flexibility in the heatmaps that can be generated from a profileplyr object.EnrichedHeatmap uses a special type of matrix of the ‘normalizedMatrix’ class to generate heatmaps. The convertToEnrichedHeatmapMat() function within profileplyr takes a profileplyr object as the only required input, and then converts the matrices contained within the assays slot to a list of ‘normalizedMatrix’ class objects that can be used in the EnrichedHeatmap() function. See the EnrichedHeatmap vignette for detailed examples for how heatmaps can be concatenated together to visualize all of the data in one figure. Importantly, the metadata stored in the rowRanges slot of the profileplyr object can also be utilized to annotate these heatmaps. NOTE: While this function provides the most flexibility for the user to build a custom set of heatmaps, a more useful function for quick visualization of data stored in a profileplyr object is the generateEnrichedHeatmap() function, which builds these heatmaps with one command using the profileplyr object (see the previous section).
EH_mat <- convertToEnrichedHeatmapMat(proplyrObject)
EH_mat[[1]]
## Normalize Hindbrain to target:
## Upstream 0 bp (0 window)
## Downstream 0 bp (0 window)
## Include target regions (25 windows)
## 400 target regions
# the matrices of this list can be inputs for EnrichedHeatmap()
Oftentimes it will be important to use some kind of summary statistic for the entire range when trying to understand and visualize differences between samples. This is often going to be done using the summarize() function. The summarize function requires the name of a profileplyr object, the function used to summarize the ranges (e.g. rowMeans or rowMax), and the type of output. Here the basic options will be demonstrated with the example data which contains no groups, however, note how it is used later in the vignette when additional grouping options are discussed and summarizing the data in this way becomes especially useful.
If the ‘output’ argument is set to ‘matrix’, then only a matrix will be returned with a single column for each sample containing the bins summarized as indicated with the ‘fun’ argument. The row names of this matrix is a unique identifier for each range containing the chromosome, start, end, and group.
object_sumMat <- summarize(proplyrObject,
fun = rowMeans,
output = "matrix")
object_sumMat[1:3, ]
## Hindbrain Liver Kidney
## chr14_34880170_34881532_genes 0.5134429 0.04936532 0.4721223
## chr15_41155088_41156065_genes 0.3062093 0.04938936 0.2161117
## chr1_135258238_135259293_genes 0.4648443 0.37157612 0.7157744
This matrix can be used directly in other heatmap generating packages, including heatmap or pheatmap.
If the ‘output’ argument is set to ‘long’, then the output will be a long data frame that can be used for plotting with ggplot. The grouping column of the range metadata as specified by ‘params(proplyrObject)$rowGroupsInUse’ will automatically be included in the data frame. If the other range metadata columns should be included in the data frame, then the ‘keep_all_mcols’ argument should be set to TRUE. Additionally, columns specifying the range, as well as the sample and the summarized signal that correspond to that range are included by default.
proplyrObject_long <- summarize(proplyrObject,
fun = rowMeans,
output = "long")
proplyrObject_long[1:3, ]
## dpGroup combined_ranges Sample Signal
## 1 genes chr14_34880170_34881532 Hindbrain 0.5134429
## 2 genes chr15_41155088_41156065 Hindbrain 0.3062093
## 3 genes chr1_135258238_135259293 Hindbrain 0.4648443
This data frame can then be used directly with ggplot for plotting.
It is often helpful to log transform the signal to more clearly see trends in the signal that is quantified.
library(ggplot2)
ggplot(proplyrObject_long, aes(x = Sample, y = log(Signal))) +
geom_boxplot()
Lastly, if the ‘output’ argument is set to ‘object’, then a profileplyr object containing the summarized matrix will be returned. This will allow for further grouping or manipulation of the summarized ranges with other profileplyr functions, as opposed to using the binned ranges that are often used in the examples below.
proplyrObject_summ <- summarize(proplyrObject,
fun = rowMeans,
output = "object")
assays(proplyrObject_summ)[[1]][1:3, ]
## Hindbrain Liver Kidney
## [1,] 0.5134429 0.04936532 0.4721223
## [2,] 0.3062093 0.04938936 0.2161117
## [3,] 0.4648443 0.37157612 0.7157744
rowRanges(proplyrObject_summ)[1:3]
## GRanges object with 3 ranges and 3 metadata columns:
## seqnames ranges strand | names score dpGroup
## <Rle> <IRanges> <Rle> | <character> <numeric> <factor>
## [1] chr14 34880170-34881532 * | . 0 genes
## [2] chr15 41155088-41156065 * | ._r1 0 genes
## [3] chr1 135258238-135259293 * | ._r2 0 genes
## -------
## seqinfo: 20 sequences from an unspecified genome; no seqlengths
Clustering of the signal across the genomic intervals of interest can shed light on patterns that aren’t immediately apparent when looking at the data as a whole. The clusterRanges() function provides a framework to cluster ranges that are contained in a profileplyr object. It should be noted that clustering can also be performed outside of R within the deepTools ‘plotHeatmap’ function, however, profileplyr clustering allows for integration of this clustering with further grouping mechanisms and visualizations within R and profileplyr.
clusterRanges() takes a profileplyr object as its first argument as well as a function to summarize the signal in each range (similar to the summarize() function above). The pheatmap package is then used to cluster the ranges, and the type of clustering depends on whether the user inputs a value for ‘kmeans_k’ (for kmeans clustering) or ‘cutree_rows’ (for hierarchical clustering using hclust). An integer value entered for either of these arguments will specify the number of clusters used for each method. If both ‘kmeans_k’ or ‘cutree_rows’ are left as NULL (default), then a heatmap will be printed with hierarchical clustering, but no distinct clusters defined, and no profileplyr object will be returned. This might be helpful as an initial and quick look at the ranges or as a means to determine the number of clusters to try.
clusterRanges(proplyrObject,
fun = rowMeans)
# this code prints heatmap (does not return), but heatmap not shown here to save space
If an integer is entered for either ‘kmeans_k’ or ‘cutree_rows’, then a profileplyr object will be returned with a cluster assigned to each range, and a column is added to the range metadata with the cluster for each range. Further, the ‘rowGroupsInUse’ is changed to this column, meaning that if this object is exported as a deepTools matrix with the export_deepToolsMat() function, the heatmap generated by ‘plotHeatmap’ will be grouped by the clusters.
set.seed(0)
kmeans <- clusterRanges(proplyrObject,
fun = rowMeans,
kmeans_k = 4)
rowRanges(kmeans)[1:3]
## GRanges object with 3 ranges and 4 metadata columns:
## seqnames ranges strand | names score dpGroup
## <Rle> <IRanges> <Rle> | <character> <numeric> <factor>
## [1] chr14 34880170-34881532 * | . 0 genes
## [2] chr15 41155088-41156065 * | ._r1 0 genes
## [3] chr1 135258238-135259293 * | ._r2 0 genes
## cluster
## <ordered>
## [1] 1
## [2] 1
## [3] 3
## -------
## seqinfo: 20 sequences from an unspecified genome; no seqlengths
params(kmeans)$rowGroupsInUse
## [1] "cluster"
To visually inspect the clusters, the heatmap can also be printed (though it will not be returned) if the ‘silent’ argument is set to FALSE. A profileplyr object will still be returned even if silent is set to FALSE.
Kmeans clustering:
set.seed(0)
kmeans <- clusterRanges(proplyrObject,
fun = rowMeans,
kmeans_k = 4,
silent = FALSE)
Hierarchical Clustering:
hclust <- clusterRanges(proplyrObject,
fun = rowMeans,
cutree_rows = 4,
silent = FALSE)
The profileplyr object that is the output of clusterRanges() can be exported as a deepTools matrix, which can then be used as an input for plotHeatmap.
output_path <- file.path(tempdir(),"kmeans_cluster_mat.gz")
export_deepToolsMat(kmeans, con = output_path)
This matrix can then be passed directly to ‘plotHeatmap’, either by using the terminal or by using the system() function in R.
Code from command line :
plotHeatmap -m kmeans_cluster_mat.gz -o kmeans_cluster_mat.jpg –startLabel start –endLabel end –xAxisLabel “distance (bp)”
After clustering the profileplyr object can be passed directly as an argument into the generateEnrichedHeatmap() function, and by default the heatmap will be grouped and annotated by these clusters, which were automatically set as the ‘rowGroupsInUse’ in the clusterRanges() function. Assuming that the ‘include_group_annotation’ argument of this function is set to the default value of TRUE, whichever metadata column is set to the ‘rowGroupsInUse’ slot will be used for the grouping and annotation seen below to the left of the heatmap. If there are no range groups, then the user can set this argument to be FALSE, and those color annotations will be absent.
Further, the maximum value for the y-axis in the line plots at the top of each heatmap are also automatically set based on the highest mean range signal from all of the samples. This can be set manually with the ‘ylim’ argument, or if ylim = NULL, the maximum will be inferred (i.e. be different) for each individual heatmap.
generateEnrichedHeatmap(kmeans)
An alternative way to visualize the clusters is by summarizing the signal over each range and then plotting those means by cluster. The flexibility with grouping and annotating within profileplyr makes this relatively easy. This is also a good demonstration of using piped code, going all the way from import to clustering, then to summarizing by mean range signal, and finishing with ggplot visualization:
library(magrittr)
set.seed(0)
import_deepToolsMat(example) %>%
clusterRanges(fun = rowMeans,
kmeans_k = 4,
silent = TRUE) %>%
summarize(fun = rowMeans,
output = "long") %>%
ggplot(aes(x = Sample, y = log(Signal))) +
geom_boxplot() +
facet_grid(~cluster) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_x_discrete(labels= c("Hindbrain", "Liver", "Kidney"))
Understanding the genes that are in the proximity to a particular set of ranges allows for potential understanding of functional consequences of signal patterns within these ranges. By annotating the ranges within a profileplyr object with genes, the user can connect signal of particular samples to specific genes and the functions attributed to those genes. profileplyr provides two methods of annotating the ranges within the object using two established Bioconductor packages, ChIPseeker and rGREAT.
The annotateRanges() function passes the ranges of a profileplyr object to the annotatePeak() function of ChIPseeker, and then integrates the information from this output into the profileplyr object. The annotation of the ranges, including the annotation type (promoter, exon, etc) and closest gene, are then compiled in the range metadata. In addition to the profileplyr object, a TxDb object must be specified with the ‘TxDb’ argument. The TxDb argument can be one of three things: