Last edited: 19 May 2020

1 Introduction

In order to make sense of genomic data one often aims to plot such data in a genome browser, along with a variety of genomic annotation features, such as gene or transcript models, CpG island, repeat regions, and so on. These features may either be extracted from public data bases like ENSEMBL or UCSC, or they may be generated or curated in-house. Many of the currently available genome browsers do a reasonable job in displaying genome annotation data, and there are options to connect to some of them from within R (e.g., using the rtracklayer package). However, none of these solutions offer the flexibility of the full R graphics system to display large numeric data in a multitude of different ways. The Gviz package (Hahne and Ivanek 2016) aims to close this gap by providing a structured visualization framework to plot any type of data along genomic coordinates. It is loosely based on the GenomeGraphs package by Steffen Durinck and James Bullard, however the complete class hierarchy as well as all the plotting methods have been restructured in order to increase performance and flexibility. All plotting is done using the grid graphics system, and several specialized annotation classes allow to integrate publicly available genomic annotation data from sources like UCSC or ENSEMBL.

2 Basic Features

The fundamental concept behind the Gviz package is similar to the approach taken by most genome browsers, in that individual types of genomic features or data are represented by separate tracks. Within the package, each track constitutes a single object inheriting from class GdObject, and there are constructor functions as well as a broad range of methods to interact with and to plot these tracks. When combining multiple objects, the individual tracks will always share the same genomic coordinate system, thus taking the burden of aligning the plot elements from the user.

It is worth mentioning that, at a given time, tracks in the sense of the Gviz package are only defined for a single chromosome on a specific genome, at least for the duration of a given plotting operation. You will later see that a track may still contain information for multiple chromosomes, however most of this is hidden except for the currently active chromosome, and the user will have to explicitly switch the chromosome to access the inactive parts. While the package in principle imposes no fixed structure on the chromosome or on the genome names, it makes sense to stick to a standardized naming paradigm, in particular when fetching additional annotation information from online resources. By default this is enforced by a global option ucscChromosomeNames, which is set during package loading and which causes the package to check all supplied chromosome names for validity in the sense of the UCSC definition (chromosomes have to start with the chr string). You may decide to turn this feature off by calling options(ucscChromosomeNames=FALSE). For the remainder of this vignette however, we will make use of the UCSC genome and chromosome identifiers, e.g., the chr7 chromosome on the mouse mm9 genome.

The different track classes will be described in more detail in the 4 section further below. For now, let’s just take a look at a typical Gviz session to get an idea of what this is all about. We begin our presentation of the available functionality by loading the package:

library(Gviz)

The most simple genomic features consist of start and stop coordinates, possibly overlapping each other. CpG islands or microarray probes are real life examples for this class of features. In the Bioconductor world those are most often represented as run-length encoded vectors, for instance in the IRanges and GRanges classes. To seamlessly integrate with other Bioconductor packages, we can use the same data structures to generate our track objects. A sample set of CpG island coordinates has been saved in the cpgIslands object and we can use that for our first annotation track object. The constructor function AnnotationTrack is a convenient helper to create the object.

library(GenomicRanges)
data(cpgIslands)
class(cpgIslands)
## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"
chr <- as.character(unique(seqnames(cpgIslands)))
gen <- genome(cpgIslands)
atrack <- AnnotationTrack(cpgIslands, name = "CpG")

Please note that the AnnotationTrack constructor (as most constructors in this package) is fairly flexible and can accommodate many different types of inputs. For instance, the start and end coordinates of the annotation features could be passed in as individual arguments start and end, as a data.frame or even as an IRanges or GRangesList object. Furthermore, a whole bunch of coercion methods are available for those package users that prefer the more traditional R coding paradigm, and they should allow operations along the lines of as(obj, 'AnnotationTrack'). You may want to consult the class’ manual page for more information, or take a look at the 8 section for a listing of the most common data structures and their respective counterparts in the Gviz package.

With our first track object being created we may now proceed to the plotting. There is a single function plotTracks that handles all of this. As we will learn in the remainder of this vignette, plotTracks is quite powerful and has a number of very useful additional arguments. For now we will keep things very simple and just plot the single CpG islands annotation track.

plotTracks(atrack)

As you can see, the resulting graph is not particularly spectacular. There is a title region showing the track’s name on a gray background on the left side of the plot and a data region showing the seven individual CpG islands on the right. This structure is similar for all the available track objects classes and it somewhat mimics the layout of the popular UCSC Genome Browser. If you are not happy with the default settings, the Gviz package offers a multitude of options to fine-tune the track appearance, which will be shown in the 3 section.

Apart from the relative distance of the CpG islands, this visualization does not tell us much. One obvious next step would be to indicate the genomic coordinates we are currently looking at in order to provide some reference. For this purpose, the Gviz package offers the GenomeAxisTrack class. Objects from the class can be created using the constructor function of the same name.

gtrack <- GenomeAxisTrack()

Since a GenomeAxisTrack object is always relative to the other tracks that are plotted, there is little need for additional arguments. Essentially, the object just tells the plotTracks function to add a genomic axis to the plot. Nonetheless, it represent a separate annotation track just as the CpG island track does. We can pass this additional track on to plotTracks in the form of a list.

plotTracks(list(gtrack, atrack))

You may have realized that the genomic axis does not take up half of the available vertical plotting space, but only uses the space necessary to fit the axis and labels. Also the title region for this track is empty. In general, the Gviz package tries to find reasonable defaults for all the parameters controlling the look and feel of a plots so that appealing visualizations can be created without too much tinkering. However, all features on the plot including the relative track sizes can also be adjusted manually.

As mentioned before in the beginning of this vignette, a plotted track is always defined for exactly one chromosome on a particular genome. We can include this information in our plot by means of a chromosome ideogram. An ideogram is a simplified visual representation of a chromosome, with the different chromosomal staining bands indicated by color, and the centromere (if present) indicated by the shape. The necessary information to produce this visualization is stored in online data repositories, for instance at UCSC. The Gviz package offers very convenient connections to some of these repositories, and the IdeogramTrack constructor function is one example for such a connection. With just the information about a valid UCSC genome and chromosome, we can directly fetch the chromosome ideogram information and construct a dedicated track object that can be visualized by plotTracks. Please not that you will need an established internet connection for this to work, and that fetching data from UCSC can take quite a long time, depending on the server load. The Gviz package tries to cache as much data as possible to reduce the bandwidth in future queries.

itrack <- IdeogramTrack(genome = gen, chromosome = chr)

Similar to the previous examples, we stick the additional track object into a list in order to plot it.

plotTracks(list(itrack, gtrack, atrack))

Ideogram tracks are the one exception in all of Gviz’s track objects in the sense that they are not really displayed on the same coordinate system like all the other tracks. Instead, the current genomic location is indicated on the chromosome by a red box (or, as in this case, a red line if the width is too small to fit a box).

So far we have only looked at very basic annotation features and how to give a point of reference to our plots. Naturally, we also want to be able to handle more complex genomic features, such as gene models. One potential use case would be to utilize gene model information from an existing local source. Alternatively, we could download such data from one of the available online resources like UCSC or ENSEBML, and there are constructor functions to handle these tasks. For this example we are going to load gene model data from a stored data.frame. The track class of choice here is a GeneRegionTrack object, which can be created via the constructor function of the same name. Similar to the AnnotationTrack constructor there are multiple possible ways to pass in the data.

data(geneModels)
grtrack <- GeneRegionTrack(geneModels, genome = gen,
                           chromosome = chr, name = "Gene Model")
plotTracks(list(itrack, gtrack, atrack, grtrack))

In all those previous examples the plotted genomic range has been determined automatically from the input tracks. Unless told otherwise, the package will always display the region from the leftmost item to the rightmost item in any of the tracks. Of course such a static view on a chromosomal region is of rather limited use. We often want to zoom in or out on a particular plotting region to see more details or to get a broader overview. To that end, plotTracks supports the from and to arguments that let us choose an arbitrary genomic range to plot.

plotTracks(list(itrack, gtrack, atrack, grtrack),
           from = 26700000, to = 26750000)

Another pair of arguments that controls the zoom state are extend.left and extend.right. Rather than from and to, those arguments are relative to the currently displayed ranges, and can be used to quickly extend the view on one or both ends of the plot. In addition to positive or negative absolute integer values one can also provide a float value between -1 and 1 which will be interpreted as a zoom factor, i.e., a value of 0.5 will cause zooming in to half the currently displayed range.

plotTracks(list(itrack, gtrack, atrack, grtrack),
           extend.left = 0.5, extend.right = 1000000)

You may have noticed that the layout of the gene model track has changed depending on the zoom level. This is a feature of the Gviz package, which automatically tries to find the optimal visualization settings to make best use of the available space. At the same time, when features on a track are too close together to be plotted as separate items with the current device resolution, the package will try to reasonably merge them in order to avoid overplotting.

Often individual ranges on a plot tend to grow quite narrow, in particular when zooming far out, and a couple of tweaks become helpful in order to get nice plots, for instance to drop the bounding borders of the exons.

plotTracks(list(itrack, gtrack, atrack, grtrack), 
           extend.left = 0.5, extend.right = 1000000, col = NULL)

When zooming further in it may become interesting to take a look at the actual genomic sequence at a given position, and the Gviz package provides the track class SequenceTrack that let’s you do just that. Among several other options it can draw the necessary sequence information from one of the BSgenome packages.

library(BSgenome.Hsapiens.UCSC.hg19)
strack <- SequenceTrack(Hsapiens, chromosome = chr)
plotTracks(list(itrack, gtrack, atrack, grtrack, strack), 
           from = 26591822, to = 26591852, cex = 0.8)

So far we have replicated the features of a whole bunch of other genome browser tools out there. The real power of the package comes with a rather general track type, the DataTrack. DataTrack object are essentially run-length encoded numeric vectors or matrices, and we can use them to add all sorts of numeric data to our genomic coordinate plots. There are a whole bunch of different visualization options for these tracks, from dot plots to histograms to box-and-whisker plots. The individual rows in a numeric matrix are considered to be different data groups or samples, and the columns are the raster intervals in the genomic coordinates. Of course, the data points (or rather the data ranges) do not have to be evenly spaced; each column is associated with a particular genomic location. For demonstration purposes we can create a simple DataTrack object from randomly sampled data.

set.seed(255)
lim <- c(26700000, 26750000)
coords <- sort(c(lim[1], 
                 sample(seq(from = lim[1], to = lim[2]), 99), 
                 lim[2]))
dat <- runif(100, min = -10, max = 10)
dtrack <- DataTrack(data = dat, start = coords[-length(coords)],
                    end = coords[-1], chromosome = chr, genome = gen, 
                    name = "Uniform")
plotTracks(list(itrack, gtrack, atrack, grtrack, dtrack), 
           from = lim[1], to = lim[2])