This vignette will guide you through the more advanced uses of gmoviz
, such
as the incremental apporach to generating plots and
making finer modifications. It is highly recommended
that you have read the basic overview of gmoviz
before this vignette.
As well as high-level functions functions,
gmoviz
contains many lower-level functions that can be used to construct a
plot track-by-track for more flexibility.
This section will use the rBiocpkg("pasillaBamSubset")
package for example
data, so please ensure you have it installed before proceeding:
if (!require("pasillaBamSubset")) {
if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install("GenomicAlignments")
}
library(pasillaBamSubset)
The first step in creating a circular plot is to initialise it. This involves creating the ideogram (the rectangles that represent each sequence), which lays out the sectors for data to be plotted into. To do this, we need some ideogram data, in one of the following formats:
GRanges
, with one range for each sector you’d like to plot.data.frame
, with three columns: chr
(sector’s name), start
and
end
.For example, the following two ideogram data are equivalent:
ideogram_1 <- GRanges(seqnames = c("chrA", "chrB", "chrC"),
ranges = IRanges(start = rep(0, 3), end = rep(1000, 3)))
ideogram_2 <- data.frame(chr = c("chrA", "chrB", "chrC"),
start = rep(0, 3),
end = rep(1000, 3))
print(ideogram_1)
#> GRanges object with 3 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chrA 0-1000 *
#> [2] chrB 0-1000 *
#> [3] chrC 0-1000 *
#> -------
#> seqinfo: 3 sequences from an unspecified genome; no seqlengths
print(ideogram_2)
#> chr start end
#> 1 chrA 0 1000
#> 2 chrB 0 1000
#> 3 chrC 0 1000
Both of the higher level functions featureDiagram
and insertionDiagram
do
this as their first step.
Of course, typing this manually each time is troublesome. gmoviz
provides the
function getIdeogramData
which creates a GRanges
of the ideogram data
from either a .bam file, single .fasta file or a folder containing many
.fasta files.1 Note that reading in from a .bam file is significantly faster than from
a .fasta file. This function can be used as follows:
## from a .bam file
fly_ideogram <- getIdeogramData(bam_file = pasillaBamSubset::untreated3_chr4())
## from a single .fasta file
fly_ideogram_chr4_only <- getIdeogramData(
fasta_file = pasillaBamSubset::dm3_chr4())
But what if we wanted to read in just the chr3L? Luckily getIdeogramData
has
filters to select the specific sequences you want.
When reading in the ideogram data from file, there are often sequences in the
.bam file or .fasta file folder that are not necessary for the plot. Thus,
the getIdeogramData
function provides three filters to allow you to only
read in the sequences you want.2 These filters only work on the bam_file
and fasta_folder
input
methods. Using a fasta_file
means that filtering is not possible (although
you can of course edit the ideogram GRanges after it is generated).
If we want only a single chromosome/sequence, we can supply it to
wanted_chr
:
getIdeogramData(bam_file = pasillaBamSubset::untreated3_chr4(),
wanted_chr = "chr4")
#> GRanges object with 1 range and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr4 0-1351857 *
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
Alternatively, if we want all chromosomes/sequences expect one, we can supply
it to unwanted_chr
:
getIdeogramData(bam_file = pasillaBamSubset::untreated3_chr4(),
unwanted_chr = "chrM")
#> GRanges object with 7 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr2L 0-23011544 *
#> [2] chr2R 0-21146708 *
#> [3] chr3L 0-24543557 *
#> [4] chr3R 0-27905053 *
#> [5] chr4 0-1351857 *
#> [6] chrX 0-22422827 *
#> [7] chrYHet 0-347038 *
#> -------
#> seqinfo: 7 sequences from an unspecified genome; no seqlengths
Finally, you can supply any regex pattern to just_pattern
to create your own
custom filter:
getIdeogramData(bam_file = pasillaBamSubset::untreated3_chr4(),
just_pattern = "R$")
#> GRanges object with 2 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr2R 0-21146708 *
#> [2] chr3R 0-27905053 *
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
Of course, for these filters to work the spelling of the filter must exactly match the spelling of the .fasta file names or the sequences in the .bam file.
Now that we have the ideogram data, we can initialise the graph. For this example, we will just focus on chromosome 4.
gmovizInitialise(fly_ideogram_chr4_only, track_height = 0.15)
We can see that a rectangle has been plotted and labelled to indicate chr4. Changing a few visual settings, we can create a better looking ideogram:
gmovizInitialise(fly_ideogram_chr4_only,
space_between_sectors = 25, # bigger space between start & end
start_degree = 78, # rotate the circle
sector_label_size = 1, # bigger label
track_height = 0.15, # thicker rectangle
xaxis_spacing = 30) # label every 30 degrees on the x axis
However, these small tweaks are not the only way we can enhance the appearance
of our plot. gmovizInitialise
can also display
coverage data and labels, as well as
supporting zooming and alteration of sector widths.
As demonstrated with the insertionDiagram
and featureDiagram
functions, we
can supply some coverage_data
to enhance the ideogram and change the regular
rectangles into line graphs which display the coverage (‘coverage
rectangles’). This then allows the easy identification of deletions,
duplications and other events which alter the coverage.
To do this, we must first read in the coverage information from the .bam
file. This can be done with the getCoverage
function:
chr4_coverage <- getCoverage(
regions_of_interest = "chr4",
bam_file = pasillaBamSubset::untreated3_chr4(),
window_size = 350, smoothing_window_size = 400)
Here, we get the smoothed and windowed coverage for chr4.3 See below the section on smoothing and windowing
for the effect of each of these arguments As we wanted the
coverage for the entire chr4, we could simply make
regions_of_interest = "chr4"
. However, we could also have supplied a GRanges
describing that area instead. Whichever input is used, it is really important
that the sequence names match exactly. For example, the following will t
hrow an error, because there is no sequence named “4” or “Chr4” in the .bam
file:
getCoverage(regions_of_interest = "4",
bam_file = pasillaBamSubset::untreated3_chr4(),
window_size = 300, smoothing_window_size = 400)
#> Error in getCoverage(regions_of_interest = "4", bam_file = pasillaBamSubset::untreated3_chr4(), : Make sure all of the chromsomes in regions_of_interest are in
#> the bam file and spelled exactly the same as in the bam
getCoverage(regions_of_interest = "Chr4",
bam_file = pasillaBamSubset::untreated3_chr4(),
window_size = 300, smoothing_window_size = 400)
#> Error in getCoverage(regions_of_interest = "Chr4", bam_file = pasillaBamSubset::untreated3_chr4(), : Make sure all of the chromsomes in regions_of_interest are in
#> the bam file and spelled exactly the same as in the bam
Now that we have the coverage data, we can plot the ideogram again using this
information. To draw a ‘coverage rectangle’ we need to firstly specifiy the
coverage_data
to be used (as either a GRanges or a data frame) and then also
supply to coverage_rectangle
a vector of the sector names to plot the
coverage data for.4 This means that you can have the coverage of multiple sequences/regions
in the same GRanges but choose to plot only some of them.
gmovizInitialise(ideogram_data = fly_ideogram_chr4_only,
coverage_rectangle = "chr4",
coverage_data = chr4_coverage,
xaxis_spacing = 30)
As you can see, the chr4 ideogram rectangle is replaced with a line graph showing the coverage over the entire chromosome. The coloured area represents the coverage, allowing easy identification of high and low coverage areas.
When reading in the coverage data, there are two additional parameters
window_size
and smoothing_window_size
that modify the values.
window_size
controls the window size over which coverage is calculated
(where a window size of 1 is per base coverage. A larger window size will
reduce the time taken to read in, smooth and plot the coverage. It will also
remove some of the variation in the coverage, although this is not its primary
aim. If you have more than 10-15,000 points, it is highly recommended to
use a larger window size, as this will take a long time to plot.
smoothing_window_size
controls the window used for moving average
smoothing, as carried out by the pracma package. It does not
reduce the number of points and so offers no speed improvement (in fact, it
increases the time taken to read in the coverage data). It does, however,
reduce the variation to produce a smoother, more attractive plot.
For example, try running the following:
# default window size (per base coverage)
system.time({getCoverage(regions_of_interest = "chr4",
bam_file = pasillaBamSubset::untreated3_chr4())})
# window size 100
system.time({getCoverage(regions_of_interest = "chr4",
bam_file = pasillaBamSubset::untreated3_chr4(),
window_size = 100)})
# window size 500
system.time({getCoverage(regions_of_interest = "chr4",
bam_file = pasillaBamSubset::untreated3_chr4(),
window_size = 500)})
Notice how going from the default window size of 1 (per base coverage) to a relatively modest window size of 100 dramatically reduces the time needed to read in the coverage data.
In terms of the appearance of the plot: (note: for speed, we will plot only a subset of the chromosome: from 70000-72000bp)
# without smoothing
chr4_region <- GRanges("chr4", IRanges(70000, 72000))
chr4_region_coverage <- getCoverage(regions_of_interest = chr4_region,
bam_file = pasillaBamSubset::untreated3_chr4())
gmovizInitialise(ideogram_data = chr4_region, coverage_rectangle = "chr4",
coverage_data = chr4_region_coverage, custom_ylim = c(0,4))
# with moderate smoothing
chr4_region_coverage <- getCoverage(regions_of_interest = chr4_region,
bam_file = pasillaBamSubset::untreated3_chr4(),
smoothing_window_size = 10)
gmovizInitialise(ideogram_data = chr4_region, coverage_rectangle = "chr4",
coverage_data = chr4_region_coverage, custom_ylim = c(0,4))
# with strong smoothing
chr4_region_coverage <- getCoverage(regions_of_interest = chr4_region,
bam_file = pasillaBamSubset::untreated3_chr4(),
smoothing_window_size = 75)
gmovizInitialise(ideogram_data = chr4_region, coverage_rectangle = "chr4",
coverage_data = chr4_region_coverage, custom_ylim = c(0,4))