Intuitively visualizing and interpreting data from high-throughput genomic technologies continues to be challenging. “Genomic Visualizations in R” (GenVisR) attempts to alleviate this burden by providing highly customizable publication-quality graphics supporting multiple species and focused primarily on a cohort level (i.e., multiple samples/patients). GenVisR attempts to maintain a high degree of flexibility while leveraging the abilities of ggplot2 and bioconductor to achieve this goal.
Read the published Bioinformatics paper!
For the majority of users we recommend installing GenVisR from the release branch of Bioconductor, Installation instructions using this method can be found on the GenVisR landing page on Bioconductor.
Please note that GenVisR imports a few packages that have “system requirements”, in most cases these requirements will already be installed. If they are not please follow the instructions to install these packages given in the R terminal. Briefly these packages are: “libcurl4-openssl-dev” and “libxml2-dev”
Development for GenVisR occurs on the griffith lab github repository available here. For users wishing to contribute to development we recommend cloning the GenVisR repo there and submitting a pull request. Please note that development occurs on the R version that will be available at each Bioconductor release cycle. This ensures that GenVisR will be stable for each Bioconductor release but it may necessitate developers download R-devel.
We also encourage users to report bugs and suggest enhancements to GenVisR on the github issue page available here:
waterfall
provides a method of visualizing the mutational landscape of a cohort. The input to waterfall
consists of a data frame derived from either a .maf (version 2.4) file or a file in MGI annotation format (obtained from The Genome Modeling System) specified via the fileType
parameter. waterfall
will display the mutation occurrence and type in the main panel while showing the mutation burden and the percentage of samples with a mutation in the top and side sub-plots. Conflicts arising from multiple mutations in the same gene/sample cell are resolved by a hierarchical removal of mutations keeping the most deleterious as defined by the order of the “mutation type” legend. Briefly this hierarchy is as follows with the most deleterious defined first:
MAF | MGI |
---|---|
Nonsense_Mutation | nonsense |
Frame_Shift_Ins | frame_shift_del |
Frame_Shift_Del | frame_shift_ins |
Translation_Start_Site | splice_site_del |
Splice_Site | splice_site_ins |
Nonstop_Mutation | splice_site |
In_Frame_Ins | nonstop |
In_Frame_Del | in_frame_del |
Missense_Mutation | in_frame_ins |
5’Flank | missense |
3’Flank | splice_region_del |
5’UTR | splice_region_ins |
3’UTR | splice_region |
RNA | 5_prime_flanking_region |
Intron | 3_prime_flanking_region |
IGR | 3_prime_untranslated_region |
Silent | 5_prime_untranslated_region |
Targeted_Region | rna |
intronic | |
silent |
Occasionally a situation may arise in which it may be desireable to run waterfall
on an unsupported file type. This can be achieved by setting the fileType
parameter to “Custom”. Further the hieararchy of mutations (described above) must be specified with the variant_class_order
parameter which expects a character vector describing the mutations observed in order of most to least important. Note that all mutations in the input data must be specified in the variant_class_order
parameter. Using this option will require the data frame to contain the following column names: “sample”, “gene”, “variant_class”.
To view the general behavior of waterfall
we use the brcaMAF
data structure available within GenVisR. This data structure is a truncated MAF file consisting of 50 samples from the TCGA project corresponding to Breast invasive carcinoma (complete data from TCGA public web portal).
# Plot the mutation landscape
waterfall(brcaMAF, fileType="MAF")
This type of view is of limited use without expanding the graphic device given the large number of genes. Often it is beneficial to reduce the number of cells in the plot by limiting the number of genes plotted. There are three ways to accomplish this, the mainRecurCutoff
parameter accepts a numeric value between 0 and 1 and will remove genes from the data which do not have at least x proportion of samples mutated. For example if it were desireable to plot those genes with mutations in >= 6% of samples:
# Load GenVisR and set seed
library(GenVisR)
set.seed(383)
# Plot only genes with mutations in 6% or more of samples
waterfall(brcaMAF, mainRecurCutoff = 0.06)
## Error in waterfall(brcaMAF, mainRecurCutoff = 0.06): unused arguments (brcaMAF, mainRecurCutoff = 0.06)
Alternatively one can set a maximum number of genes to plot via the maxGenes
parameter which will select the top x recurrently mutated genes. In addition specific genes of interest can be displayed using the plotGenes
parameter. This parameter accepts a case insensitive character vector of genes present in the data and will subset the data on those genes. For example, if it was desirable to plot only the following genes “PIK3CA”, “TP53”, “USH2A”, “MLL3”, AND “BRCA1”:
# Plot only the specified genes
waterfall(brcaMAF, plotGenes = c("PIK3CA", "TP53", "USH2A", "MLL3", "BRCA1"))
## Error in waterfall(brcaMAF, plotGenes = c("PIK3CA", "TP53", "USH2A", "MLL3", : unused arguments (brcaMAF, plotGenes = c("PIK3CA", "TP53", "USH2A", "MLL3", "BRCA1"))
It is important to note that the mutation burden sub plot does not change during these subsets, this is calculated directly from the input via the formula: \(mutations\ in\ sample/coverage\ space * 1000000\). The coverage space defaults to the size in base pairs of the “SeqCap EZ Human Exome Library v2.0”. This default can be changed via the parameter coverageSpace
. This calculation is only meant to be a rough estimate as actual coverage space can vary from sample to sample, for a more accurate calculation the user has the option to supply an optional argument via the parameter mutBurden
supplying the users own calculation of mutation burden for each sample. This should be a data frame with column names ‘sample’, ‘mut_burden’ taking the following form:
sample | mut_burden |
---|---|
TCGA-A1-A0SO-01A-22D-A099-09 | 1.5572403530013 |
TCGA-A2-A0EU-01A-22W-A071-09 | 2.19577768355127 |
TCGA-A2-A0ER-01A-21W-A050-09 | 1.89335842847617 |
TCGA-A2-A0EN-01A-13D-A099-09 | 2.67976843443599 |
TCGA-A1-A0SI-01A-11D-A142-09 | 1.64223789887094 |
TCGA-A2-A0D0-01A-11W-A019-09 | 2.9426074728573 |
TCGA-A2-A0D0-01A-11W-A019-09 | 1.49832578136762 |
TCGA-A1-A0SI-01A-11D-A142-09 | 1.55903682620951 |
TCGA-A2-A0CT-01A-31W-A071-09 | 2.61283158874499 |
TCGA-A2-A04U-01A-11D-A10Y-09 | 1.49772855192887 |
In addition to specifying the mutation burden the user also has the ability to plot additional clinical data. The clinical data supplied should be a data frame in “long” format with column names “sample”, “variable”, “value”. It is recommended to use the melt
function in the package reshape2 to coerce data into this format. Here we add clinical data to be plotted and specify a custom order and colours for these variables putting these values in two columns within the clinical plot legend:
# Create clinical data
subtype <- c("lumA", "lumB", "her2", "basal", "normal")
subtype <- sample(subtype, 50, replace = TRUE)
age <- c("20-30", "31-50", "51-60", "61+")
age <- sample(age, 50, replace = TRUE)
sample <- as.character(unique(brcaMAF$Tumor_Sample_Barcode))
clinical <- as.data.frame(cbind(sample, subtype, age))
# Melt the clinical data into 'long' format.
library(reshape2)
clinical <- melt(clinical, id.vars = c("sample"))
# Run waterfall
waterfall(brcaMAF, clinDat = clinical, clinVarCol = c(lumA = "blue4", lumB = "deepskyblue",
her2 = "hotpink2", basal = "firebrick2", normal = "green4", `20-30` = "#ddd1e7",
`31-50` = "#bba3d0", `51-60` = "#9975b9", `61+` = "#7647a2"), plotGenes = c("PIK3CA",
"TP53", "USH2A", "MLL3", "BRCA1"), clinLegCol = 2, clinVarOrder = c("lumA", "lumB",
"her2", "basal", "normal", "20-30", "31-50", "51-60", "61+"))
## Error in waterfall(brcaMAF, clinDat = clinical, clinVarCol = c(lumA = "blue4", : unused arguments (brcaMAF, clinDat = clinical, clinVarCol = c(lumA = "blue4", lumB = "deepskyblue", her2 = "hotpink2", basal = "firebrick2", normal = "green4", `20-30` = "#ddd1e7", `31-50` = "#bba3d0", `51-60` = "#9975b9", `61+` = "#7647a2"), plotGenes = c("PIK3CA", "TP53", "USH2A", "MLL3", "BRCA1"), clinLegCol = 2, clinVarOrder = c("lumA", "lumB", "her2", "basal", "normal", "20-30", "31-50", "51-60", "61+"))
Occasionally there may be samples not represented within the .maf file (due to a lack of mutations). It may still be desirable to plot these samples. To accomplish this simply add the relevant samples into the appropriate column before loading the data and leave the rest of the columns as NA. Alternatively the user can specify a list of samples to plot via the plotSamples
parameter which will accept samples not in the input data.
genCov
provides a methodology for viewing coverage information in relation to a gene track. It takes a named list of data frames with each data frame containing column names “end” and “cov” and rows corresponding to coordinates within the region of interest. Additional required arguments are a GRanges object specifying the region of interest, a BSgenome for gc content calculation, and a TxDb object containing transcription metadata (see the package Granges for more information). genCov
will plot a genomic features track and align coverage data in the list to the plot. It is recommended to use bedtools multicov to obtain coverage information for a region of interest. We demonstrate genCov
functionality using pseudo-data containing coverage information for the gene PTEN.
# Load transcript meta data
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
# Load BSgenome
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
# Define a region of interest
gr <- GRanges(seqnames = c("chr10"), ranges = IRanges(start = c(89622195), end = c(89729532)),
strand = strand(c("+")))
# Create Data for input
start <- c(89622194:89729524)
end <- c(89622195:89729525)
chr <- 10
cov <- c(rnorm(1e+05, mean = 40), rnorm(7331, mean = 10))
cov_input_A <- as.data.frame(cbind(chr, start, end, cov))
start <- c(89622194:89729524)
end <- c(89622195:89729525)
chr <- 10
cov <- c(rnorm(50000, mean = 40), rnorm(7331, mean = 10), rnorm(50000, mean = 40))
cov_input_B <- as.data.frame(cbind(chr, start, end, cov))
# Define the data as a list
data <- list(`Sample A` = cov_input_A, `Sample B` = cov_input_B)
# Call genCov
genCov(data, txdb, gr, genome, gene_labelTranscriptSize = 2, transform = NULL, base = NULL)