The R package decompTumor2Sig has been developed to decompose individual tumor genomes (i.e., the lists of somatic single nucleotide variants identified in individual tumors) according to a set of given mutational signatures—a problem termed signature refitting—using a quadratic-programming approach.
Mutational signatures can be either of the form initially proposed by Alexandrov et al. (Cell Rep. 3:246–259, 2013 and Nature 500:415–421, 2013)—in the following called “Alexandrov signatures”—or of the simplified form proposed by Shiraishi et al. (PLoS Genet. 11:e1005657, 2015)—in the following called “Shiraishi signatures”.
For each of the given mutational signatures, decompTumor2Sig determines their contribution to the load of somatic mutations observed in a tumor.
Please read the following important notes first:
Note: here and in the following, when referring to “mutations”, we intend single nucleotide variants (SNVs).
Note: given the number of parameters to be estimated, decompTumor2Sig should best be applied only to tumor samples with a sufficient number of somatic mutations. From our experience, using tumor samples with 100+ somatic mutations was reasonable when using version 2 of the COSMIC Mutational Signatures (30 signatures). Tumor samples with fewer somatic mutations should be avoided unless a tumor-type specific subset of signatures is used.
Note: be aware that publicly available signature sets have often been defined with respect to genome wide mutation frequencies, so they should best be applied to somatic mutation data from whole genome sequencing. Background frequencies may be different for subsets of the genome, i.e., current signature sets might yield incorrect results when applied to, for example, mutation data from targetted sequencing of only a subset of genes!
Note: for all functions provided by decompTumor2Sig, please see the manual or the inline R help page for further details and explanations.
Krüger S, Piro RM (2019) decompTumor2Sig: Identification of mutational signatures active in individual tumors. BMC Bioinformatics 20(Suppl 4):152.
Krüger S, Piro RM (2017) Identification of mutational signatures active in individual tumors. PeerJ Preprints 5:e3257v1 (Proceedings of the NETTAB 2017 Workshop on Methods, Tools & Platforms for Personalized Medicine in the Big Data Era, October 16-18, 2017 in Palermo, Italy).
BibTeX:
@UNPUBLISHED(krueger-decompTumor2Sig-paper,
author = "Kr{\"u}ger, Sandra and Piro, Rosario Michael",
title = "decompTumor2Sig: Identification of mutational signatures active in individual tumors",
journal = "BMC Bioinformatics",
volume = "20",
number = "Suppl 4",
pages = "152",
year = 2019
);
@ARTICLE(krueger-decompTumor2Sig-nettab,
author = "Kr{\"u}ger, Sandra and Piro, Rosario Michael",
title = "Identification of mutational signatures active in individual tumors",
journal = "PeerJ Preprints",
volume = "5",
pages = "e3257v1",
year = 2017
);
decompTumor2Sig requires several CRAN and Bioconductor R packages to be installed. Dependencies are usually handled automatically, when installing the package using the following commands:
install.packages("BiocManager")
BiocManager::install("decompTumor2Sig")
[NOTE: Ignore the first line if you already have installed the BiocManager.]
In the unlikely case that a manual installation is required, e.g., if you do not install decompTumor2Sig via Bioconductor (which is highly recommended), before installing decompTumor2Sig make sure the following packages are installed:
CRAN packages:
Matrix, quadprog, plyr, ggplot2, ggseqlogo, gridExtra
Bioconductor packages:
GenomicRanges, GenomicFeatures, GenomeInfoDb, Biostrings, BiocGenerics, S4Vectors, BSgenome.Hsapiens.UCSC.hg19, TxDb.Hsapiens.UCSC.hg19.knownGene, VariantAnnotation, SummarizedExperiment
If you intend to work with Shiraishi-type signatures, you may also want to install the R package pmsignature which is neither part of CRAN nor of Bioconductor and must be downloaded and installed manually (available at: https://github.com/friend1ws/pmsignature).
CRAN packages can be installed from R using the following command:
install.packages("<package_name>")
Bioconductor packages can be installed from R using the following command:
BiocManager::install("<package_name>")
Sometimes, it may also be useful to update Bioconductor:
BiocManager::install()
Finally, the manual installation of decompTumor2Sig can, for example, be done from the command line …
R CMD INSTALL decompTumor2Sig_<version>.tar.gz
… or the newest version can directly be installed from GitHub using the CRAN package devtools:
library(devtools)
install_github("rmpiro/decompTumor2Sig")
After installation, loading the package is simple:
library(decompTumor2Sig)
decompTumor2Sig works with two kinds of input data:
Additionally, decompTumor2Sig requires the genomic sequence (in form of a BSgenome object) to determine neighboring nucleotides of the mutated bases. It may also require transcript annotations (in form of a TxDb object) in case the given mutational signatures take information on the transcription direction into account.
Mutational signatures can be read in two different formats: Alexandrov-type signatures and Shiraishi-type signatures.
Alexandrov signatures are specified either in the format used at the COSMIC Mutational Signatures website for signatures version 2 (March 2015, see http://cancer.sanger.ac.uk/cosmic/signatures_v2 -> “Download signatures”), in the format used for signatures version 3 (May 2019; see https://www.synapse.org/#!Synapse:syn12009743), or in the format used for singatures version 3.2 (March 2021; see https://cancer.sanger.ac.uk/cosmic/signatures/SBS/). Also the signatures of version 3.1 (same as version 3.0 but distributed as an Excel spread sheet on the COSMIC Mutational Signatures website) can be read. For versions 3, 3.1 and 3.2, only Single Base Substitution (SBS) signatures can be used.
The standard Alexandrov-type signatures report mutation frequencies for nucleotide triplets where the mutated base is at the center. Also, the basic Alexandrov signatures do not take transcription direction into account when computing mutation frequencies.
Example for version 2:
Substitution Type Trinucleotide Somatic Mutation Type Signature 1 ...
C>A ACA A[C>A]A 0.011098326166 ...
C>A ACC A[C>A]C 0.009149340734 ...
C>A ACG A[C>A]G 0.001490070468 ...
C>A ACT A[C>A]T 0.006233885236 ...
...
T>G TTG T[T>G]G 0.002031076880 ...
T>G TTT T[T>G]T 0.004030128160 ...
Example for version 3:
Type,SubType,SBS1,SBS2,SBS3,SBS4,SBS5,SBS6, ...
C>A,ACA,8.86E-04,5.80E-07,2.08E-02,4.22E-02,1.20E-02,4.25E-04, ...
C>A,ACC,2.28E-03,1.48E-04,1.65E-02,3.33E-02,9.44E-03,5.24E-04, ...
C>A,ACG,1.77E-04,5.23E-05,1.75E-03,1.56E-02,1.85E-03,5.20E-05, ...
C>A,ACT,1.28E-03,9.78E-05,1.22E-02,2.95E-02,6.61E-03,1.80E-04, ...
...
T>G,TTG,5.83E-04,9.54E-05,8.05E-03,2.32E-03,6.94E-03,3.24E-04, ...
T>G,TTT,2.23E-16,2.23E-16,1.05E-02,5.68E-04,1.35E-02,1.01E-03, ...
(Apart from the change of tab- to comma-separator, the main difference is the lack of the redundant 3rd annotation column in version 3. Version 3.1 has the same format as version 3, but is distributed as an Excel spread sheet.)
Example for version 3.2:
Type SBS1 SBS2 ...
A[C>A]A 0.000876022860356065 5.79005909129116e-07 ...
A[C>A]C 0.0022201195808427 0.000145504532003707 ...
A[C>A]G 0.000179727205679816 5.36186073164881e-05 ...
A[C>A]T 0.00126505263784183 9.75912243380865e-05 ...
...
T[T>G]G 0.000577613846957072 9.54312693870027e-05 ...
T[T>G]T 2.20126302870092e-16 2.22251767921597e-16 ...
To read Alexandrov-type signatures, use the command readAlexandrovSignatures(). By default, the command reads the version 2 (!) signatures directly from COSMIC and stores them in a list object:
signatures <- readAlexandrovSignatures()
length(signatures)
## [1] 30
signatures[1]
## $Signature.1
## A[C>A]A A[C>A]C A[C>A]G A[C>A]T C[C>A]A C[C>A]C
## 1.109833e-02 9.149341e-03 1.490070e-03 6.233885e-03 6.595870e-03 7.342368e-03
## C[C>A]G C[C>A]T G[C>A]A G[C>A]C G[C>A]G G[C>A]T
## 8.928404e-04 7.186582e-03 8.232604e-03 5.758021e-03 6.163352e-04 4.459080e-03
## T[C>A]A T[C>A]C T[C>A]G T[C>A]T A[C>G]A A[C>G]C
## 1.225006e-02 1.116223e-02 2.275496e-03 1.525910e-02 1.801068e-03 2.580909e-03
## A[C>G]G A[C>G]T C[C>G]A C[C>G]C C[C>G]G C[C>G]T
## 5.925480e-04 2.963986e-03 1.284983e-03 7.021348e-04 5.062896e-04 1.381543e-03
## G[C>G]A G[C>G]C G[C>G]G G[C>G]T T[C>G]A T[C>G]C
## 6.021227e-04 2.393352e-03 2.485340e-07 8.900807e-04 1.874853e-03 2.067419e-03
## T[C>G]G T[C>G]T A[C>T]A A[C>T]C A[C>T]G A[C>T]T
## 3.048970e-04 3.151574e-03 2.951453e-02 1.432275e-02 1.716469e-01 1.262376e-02
## C[C>T]A C[C>T]C C[C>T]G C[C>T]T G[C>T]A G[C>T]C
## 2.089645e-02 1.850170e-02 9.557722e-02 1.711331e-02 2.494381e-02 2.716149e-02
## G[C>T]G G[C>T]T T[C>T]A T[C>T]C T[C>T]G T[C>T]T
## 1.035708e-01 1.768985e-02 1.449210e-02 1.768078e-02 7.600222e-02 1.376170e-02
## A[T>A]A A[T>A]C A[T>A]G A[T>A]T C[T>A]A C[T>A]C
## 4.021520e-03 2.371144e-03 2.810910e-03 8.360909e-03 1.182587e-03 1.903167e-03
## C[T>A]G C[T>A]T G[T>A]A G[T>A]C G[T>A]G G[T>A]T
## 1.487961e-03 2.179344e-03 6.892894e-04 5.524095e-04 1.200229e-03 2.107137e-03
## T[T>A]A T[T>A]C T[T>A]G T[T>A]T A[T>C]A A[T>C]C
## 5.600155e-03 1.999079e-03 1.090066e-03 3.981023e-03 1.391577e-02 6.274961e-03
## A[T>C]G A[T>C]T C[T>C]A C[T>C]C C[T>C]G C[T>C]T
## 1.013764e-02 9.256316e-03 4.176675e-03 5.252593e-03 7.013225e-03 6.713813e-03
## G[T>C]A G[T>C]C G[T>C]G G[T>C]T T[T>C]A T[T>C]C
## 1.124784e-02 6.999724e-03 4.977593e-03 1.066741e-02 8.073616e-03 4.857381e-03
## T[T>C]G T[T>C]T A[T>G]A A[T>G]C A[T>G]G A[T>G]T
## 8.325454e-03 6.257106e-03 1.587636e-03 1.784091e-03 1.385831e-03 3.158539e-03
## C[T>G]A C[T>G]C C[T>G]G C[T>G]T G[T>G]A G[T>G]C
## 3.026912e-04 2.098502e-03 1.599549e-03 2.758538e-03 9.904500e-05 2.023656e-04
## G[T>G]G G[T>G]T T[T>G]A T[T>G]C T[T>G]G T[T>G]T
## 1.188353e-03 8.007233e-04 1.397554e-03 1.291737e-03 2.031077e-03 4.030128e-03
Alternatively, the signatures can be read from a file of the format shown above (TSV, CSV or Excel spread sheet):
signatures <- readAlexandrovSignatures(file="<signature_file>")
Shiraishi signatures are specified as matrices (in flat files without headers and row names; one file per signature).
Format (see Shiraishi et al. PLoS Genetics 11(12):e1005657, 2015):
The first row: Frequencies of the six possible base changes C>A, C>G, C>T, T>A, T>C, and T>G. Please note that due to the complementarity of base pairing these six base changes already include A>? and G>?.
The following 2k rows (for k up- and downstream flanking bases): Frequencies of the bases A, C, G, and T, followed by two 0 values.
The optional last row (only if transcription direction is considered): Frequencies of occurrences on the transcription strand, and on the opposite strand, followed by four 0 values.
Example:
0.05606328 0.07038910 0.39331059 0.13103780 0.20797476 0.14122446
0.27758477 0.21075424 0.23543226 0.27622874 0 0
0.33081021 0.25347427 0.23662536 0.17909016 0 0
0.21472304 2.6656e-09 0.55231053 0.23296643 0 0
0.22640542 0.20024237 0.32113377 0.25221844 0 0
0.50140066 0.49859934 0 0 0 0
Among its examples, the decompTumor2Sig package provides a small set of four Shiraishi-type signatures in flat files. These signatures were obtained from 21 breast cancer genomes (Nik-Zainal et al. Cell 149:979–993, 2012) using the R package pmsignature (Shiraishi et al. PLoS Genet. 11:e1005657, 2015).
To read these flat files as signatures, you can use the following example:
# take the example signature flat files provided with decompTumor2Sig
sigfiles <- system.file("extdata",
paste0("Nik-Zainal_PMID_22608084-pmsignature-sig",1:4,".tsv"),
package="decompTumor2Sig")
# read the signature flat files
signatures <- readShiraishiSignatures(files=sigfiles)
signatures[1]
## $`Nik-Zainal_PMID_22608084-pmsignature-sig1.tsv`
## [C>A] [C>G] [C>T] [T>A] [T>C] [T>G]
## mut 0.05606328 7.038910e-02 0.3933106 0.1310378 0.2079748 0.1412245
## -2 0.27758477 2.107542e-01 0.2354323 0.2762287 0.0000000 0.0000000
## -1 0.33081021 2.534743e-01 0.2366254 0.1790902 0.0000000 0.0000000
## +1 0.21472304 2.665627e-09 0.5523105 0.2329664 0.0000000 0.0000000
## +2 0.22640542 2.002424e-01 0.3211338 0.2522184 0.0000000 0.0000000
## tr 0.50140066 4.985993e-01 0.0000000 0.0000000 0.0000000 0.0000000
The third possibility is to convert Shiraishi-type signatures directly from the package that computes them (pmsignature; Shiraishi et al. PLoS Genet. 11:e1005657, 2015).
Example workflow:
# load example signatures for breast cancer genomes from Nik-Zainal et al
# (PMID: 22608084) in the format produced by pmsignature (PMID: 26630308)
pmsigdata <- system.file("extdata",
"Nik-Zainal_PMID_22608084-pmsignature-Param.Rdata",
package="decompTumor2Sig")
load(pmsigdata)
# extract the signatures from the pmsignature 'Param' object
signatures <- getSignaturesFromEstParam(Param)
signatures[1]
## [[1]]
## [C>A] [C>G] [C>T] [T>A] [T>C] [T>G]
## mut 0.05606328 7.038910e-02 0.3933106 0.1310378 0.2079748 0.1412245
## -2 0.27758477 2.107542e-01 0.2354323 0.2762287 0.0000000 0.0000000
## -1 0.33081021 2.534743e-01 0.2366254 0.1790902 0.0000000 0.0000000
## +1 0.21472304 2.665627e-09 0.5523105 0.2329664 0.0000000 0.0000000
## +2 0.22640542 2.002424e-01 0.3211338 0.2522184 0.0000000 0.0000000
## tr 0.50140066 4.985993e-01 0.0000000 0.0000000 0.0000000 0.0000000
Please note that pmsignature is neither part of CRAN nor of Bioconductor and must be downloaded and installed manually (available at: https://github.com/friend1ws/pmsignature). To load mutational signatures without pmsignature being installed, see the previous sections.
An Alexandrov-type signature can be converted into a Shiraishi-type signature (but not vice versa due to the loss of information). Consider the following example:
sign_a <- readAlexandrovSignatures()
sign_s <- convertAlexandrov2Shiraishi(sign_a)
sign_s[1]
## $Signature.1
## [C>A] [C>G] [C>T] [T>A] [T>C] [T>G]
## mut 0.1100022 0.02309801 0.6754994 0.04153693 0.1241471 0.02571636
## -1 0.3290833 0.21464994 0.2370499 0.21921681 0.0000000 0.00000000
## +1 0.1858812 0.15440965 0.4967238 0.16298544 0.0000000 0.00000000
Note: since the standard Alexandrov-type signatures regard nucleotide triplets and do not take transcription direction into account, the obtained Shiraishi-type signatures will also be limited to triplets without information about transcription direction.
Important: Please be aware that signatures are initially not determined in isolation but as a set of signatures derived from a commonly large set of tumor genomes (Alexandrov et al. Cell Rep. 3:246–259, 2013; Alexandrov et al. Nature 500:415–421, 2013; Shiraishi et al. PLoS Genet. 11:e1005657, 2015). Therefore, the biological meaning of converting signatures is not well defined, and the approach should be taken with caution! As an example for the possible outcome, please see our paper (Krüger and Piro, 2019).
The exact numeric features of mutational signatures (e.g., the 96 mutation probabilities of trinucleotide mutation types within an Alexandrov-type signature) depend not only on the mutational processes themselves, but also on the nucleotide frequencies within the reference sequences of the mutation data from which the signatures were derived.
Consider, for example, a hypothetical biochemical mutational process which can potentially cause C[C>T]G and A[C>T]A mutations with equal probability. Even if both mutation types are potentially equiprobable, in CG-rich regions more C[C>T]G than A[C>T]A mutations will be observed, simply because these regions likely contain more CCG trinucleotides than ACA trinucleotides which can be mutated. This illustrates that the observed fractions of mutations do not only depend on the underlying mutational process alone, but also on the nucleotide frequencies within the reference sequences from which the mutational signatures were inferred in the first place (e.g., whole genome, whole exome, another subset of genomic regions?).
Sometimes, however, it may be useful to apply signature refitting to mutation data from genomic regions whose nucelotide frequencies differ from those for which mutational signatures are available. For example, signature refitting might need to be applied to mutation data from whole exome sequencing, or targetted sequencing (only a specific set of genes) although mutational signatures were derived for data from the whole reference genome.
For this purpose, the function adjustSignaturesForRegionSet() provides the possibility to (re-)adjust/normalize mutational signatures for a given reference sequence (e.g., whole genome) to the nucleotide frequencies present in another target set of genomic regions (e.g., whole exome) to which signature refitting has to be applied.
Adjustment/normalization of Alexandrov-type signatures:
For Alexandrov-type signatures, the important frequencies are those of multi-nucleotide sequence patterns (usually trinucleotides) whose central base can be mutated. Therefore, adjustment factors for individual multi-nucleotide mutation types (e.g., A[C>T]G) are computed by comparing the corresponding multi-nucleotide frequencies (e.g., ACG) between the original reference regions and the target regions.
Mathematical approach: divide the mutation probabilities of individual mutation features (e.g., trinucleotide mutation types for Alexandrov signatures) by the original (genome-wide?) multi-nucleotide frequency, then multiply it by the corresponding frequency in the target region set. That is, if the target regions, for example, have a doubled frequency of a certain trinucleotide (e.g., ACG) with respect to the original reference sequence, the mutation probabilities of the corresponding mutation types (A[C>A]G, A[C>G]G, A[C>T]G) in an Alexandrov signature will be multiplied by 2. Reasoning: the mutational process has more opportunities to mutate that trinucleotide in the target regions than in the original reference sequence for which the signature was initially derived. After the appropriate adjustment of individual features, an Alexandrov signature is re-normalized such that overall probability over all mutation types sum up to 1.
Adjustment/normalization of Shiraishi-type signatures:
In the Shiraishi-type signature model, individual bases of the multi-nucleotide sequence patterns are considered as independent features. Therefore, the adjustment of such signatures should be applied to individual bases while still reflecting the presence of absence (hence, the frequencies) of the full sequence patterns in the genomic regions. Thus, to compute single-nucleotide frequencies for the adjustment, first the frequencies of the multi-nucleotide sequence patterns are determined (as for Alexandrov-type signatures) and then broken down to single-nucleotide frequencies for the individual positions of the patterns. That is, for each position in the sequence patterns a set of single-nucleotide frequencies is determined to describe, for example, how frequent a specific nucleotide occurs at a specific position in the pattern (very much like in position-frequency matrices used for transcription factor binding motives).
Mathematical approach: for signature adjustment, the position-specific nucleotide frequencies are applied to the position-specific mutation probabilities of a Shiraishi signature much in the same way as described above for the adjustment of multi-nucleotide frequencies in Alexandrov signatures. After the appropriate adjustment of individual position-specific features, a Shirishi signature is re-normalized such that for each individual position in the pattern the overall base change probability (for the central base) or nucleotide probabilities (for flanking bases) sum up to 1.
Usage example:
Example for adjusting Alexandrov signatures (derived from mutation data from the whole genome) to the nucleotide frequencies present in promoter regions:
# get Alexandrov signatures from COSMIC
signatures <- readAlexandrovSignatures()
signatures$Signature.1[1:5]
## A[C>A]A A[C>A]C A[C>A]G A[C>A]T C[C>A]A
## 0.011098330 0.009149341 0.001490070 0.006233885 0.006595870
# get gene annotation for the default reference genome (hg19)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene::TxDb.Hsapiens.UCSC.hg19.knownGene
# get a GRanges object for gene promoters (-2000 to +200 bases from TSS)
library(GenomicRanges)
regionsTarget <- promoters(txdb, upstream=2000, downstream=200)
# adjust signatures according to nucleotide frequencies in this subset of
# the genome
sign_adj <-
adjustSignaturesForRegionSet(signatures,
regionsTarget, regionsOriginal=NULL,
refGenome=BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19)
sign_adj$Signature.1[1:5]
## A[C>A]A A[C>A]C A[C>A]G A[C>A]T C[C>A]A
## 0.004968736 0.005733981 0.001537169 0.002991879 0.003980106
adjustSignaturesForRegionSet() accepts both types of signatures with the following arguments:
For certain applications it may be necessary to construct signatures or convert them from other kind of data. To do so, each signature must be either a numeric vector of probabilities which sum up to 1 (for Alexandrov-type signatures) or a matrix or data.frame with six numeric columns, every row of which sums up to 1 (for Shiraishi-type signatures).
A set of signatures is then simply a list of such signatures.
To verify whether a signature set has a format that can be used with decompTumor2Sig the package provides the following set of functions. Since the mutation frequencies in genomes are specified in exactly the same way, these functions can be used for both signatures and genomes:
Examples:
isAlexandrovSet(sign_a)
## [1] TRUE
isSignatureSet(sign_a)
## [1] TRUE
isShiraishiSet(sign_s)
## [1] TRUE
isSignatureSet(sign_s)
## [1] TRUE
sameSignatureFormat(sign_a, sign_s)
## [1] FALSE
Information on the somatic mutations found in a tumor can be read from one of the following formats and converted to mutation frequencies for decompTumor2Sig.
The somatic mutations of a tumor genome can be read from a VCF file. For detailed information on this format (including an example), see https://samtools.github.io/hts-specs/VCFv4.2.pdf.
Mutations from multiple tumor genomes can also be read from a multi-sample VCF file.
As an example, the decompTumor2Sig package provides the somatic mutations for a subset of six of the 21 breast cancer genomes originally published by Nik-Zainal et al (Cell 149:979–993, 2012). The dataset has been converted from MPF (see below) to VCF.
Example workflow:
# load the reference genome and the transcript annotation database
refGenome <- BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19
transcriptAnno <-
TxDb.Hsapiens.UCSC.hg19.knownGene::TxDb.Hsapiens.UCSC.hg19.knownGene
# take the six breast cancer genomes from Nik-Zainal et al (PMID: 22608084)
gfile <- system.file("extdata",
"Nik-Zainal_PMID_22608084-VCF-convertedfromMPF.vcf.gz",
package="decompTumor2Sig")
# read the cancer genomes in VCF format
genomes <- readGenomesFromVCF(gfile, numBases=5, type="Shiraishi",
trDir=TRUE, refGenome=refGenome,
transcriptAnno=transcriptAnno, verbose=FALSE)
length(genomes)
## [1] 6
genomes[1:2]
## $PD3851a
## [C>A] [C>G] [C>T] [T>A] [T>C] [T>G]
## mut 0.1902098 0.1048951 0.3482517 0.1160839 0.1566434 0.08391608
## -2 0.2573427 0.2307692 0.1860140 0.3258741 0.0000000 0.00000000
## -1 0.2867133 0.2405594 0.1762238 0.2965035 0.0000000 0.00000000
## +1 0.2839161 0.1902098 0.2685315 0.2573427 0.0000000 0.00000000
## +2 0.2741259 0.2167832 0.1972028 0.3118881 0.0000000 0.00000000
## tr 0.4699301 0.5300699 0.0000000 0.0000000 0.0000000 0.00000000
##
## $PD3890a
## [C>A] [C>G] [C>T] [T>A] [T>C] [T>G]
## mut 0.1545082 0.2471311 0.2323770 0.1176230 0.1315574 0.1168033
## -2 0.3016393 0.2118852 0.1918033 0.2946721 0.0000000 0.0000000
## -1 0.2372951 0.2311475 0.1622951 0.3692623 0.0000000 0.0000000
## +1 0.2954918 0.2086066 0.1545082 0.3413934 0.0000000 0.0000000
## +2 0.2913934 0.1979508 0.1885246 0.3221311 0.0000000 0.0000000
## tr 0.4852459 0.5147541 0.0000000 0.0000000 0.0000000 0.0000000
When reading somatic mutations of tumor genomes with readGenomesFromVCF(), they are preprocessed to determine mutation frequencies according to specific sequence characteristics which can be controlled by the following arguments:
Alternatively, somatic mutations can be read from an MPF file.
Example MPF file:
patient1 chr1 809687 G C
patient1 chr1 819245 G T
patient1 chr2 2818266 A G
patient1 chr2 3433314 G A
patient2 chr1 2927666 A G
patient2 chr1 3359791 C T
The five columns contain
Hence, with an MPF file, too, multiple tumors can be described.
As an example, the decompTumor2Sig package provides the somatic mutations for six of the 21 breast cancer genomes originally published by Nik-Zainal et al (Cell 149:979–993, 2012).
Example workflow:
# load the reference genome and the transcript annotation database
refGenome <- BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19
transcriptAnno <-
TxDb.Hsapiens.UCSC.hg19.knownGene::TxDb.Hsapiens.UCSC.hg19.knownGene
# take the six breast cancer genomes from Nik-Zainal et al (PMID: 22608084)
gfile <- system.file("extdata", "Nik-Zainal_PMID_22608084-MPF.txt.gz",
package="decompTumor2Sig")
# read the cancer genomes in MPF format
genomes <- readGenomesFromMPF(gfile, numBases=5, type="Shiraishi",
trDir=TRUE, refGenome=refGenome,
transcriptAnno=transcriptAnno, verbose=FALSE)
Note: the preprocessing of the somatic mutations into mutation frequencies can be controlled with the same function arguments that have been described above for readGenomesFromVCF().
The third possibility to get the somatic mutations from one or more tumor genomes is to convert them directly from a tumor data object (MutationFeatureData object) loaded using the pmsignature package (Shiraishi et al. PLoS Genet. 11:e1005657, 2015).
An example of such an object is provided with decompTumor2Sig:
# get breast cancer genomes from Nik-Zainal et al (PMID: 22608084)
# in the format produced by pmsignature (PMID: 26630308)
pmsigdata <- system.file("extdata",
"Nik-Zainal_PMID_22608084-pmsignature-G.Rdata",
package="decompTumor2Sig")
load(pmsigdata)
# extract the genomes from the pmsignature 'G' object
genomes <- getGenomesFromMutFeatData(G, normalize=TRUE)
genomes[1]
## $PD3851a
## [C>A] [C>G] [C>T] [T>A] [T>C] [T>G]
## mut 0.1913161 0.1031208 0.3487110 0.1166893 0.156038 0.08412483
## -2 0.2605156 0.2306649 0.1831750 0.3256445 0.000000 0.00000000
## -1 0.2862958 0.2388060 0.1791045 0.2957938 0.000000 0.00000000
## +1 0.2795115 0.1886024 0.2686567 0.2632293 0.000000 0.00000000
## +2 0.2727273 0.2184532 0.1981004 0.3107191 0.000000 0.00000000
## tr 0.4735414 0.5264586 0.0000000 0.0000000 0.000000 0.00000000
Please note that pmsignature is neither part of CRAN nor of Bioconductor and must be downloaded and installed manually (available at: https://github.com/friend1ws/pmsignature). To read tumor genomes without pmsignature being installed, see the previous sections about VCF and MPF and the following section.
Important: the argument normalize, that can be specified for getGenomesFromMutFeatData(), controls whether the function should simply count the number of occurrences or whether it provides (normalized) fractions/percentages of mutations among the total set. Normalization is the default and is what should be used for determining the contribution of individual signatures to the mutational load of a tumor. normalize=FALSE should be used only in case you are interested in how many somatic mutations of the single signature categories can be found in a tumor; it should not be used for further processing with decompTumor2Sig!
Important: There is a slight difference on how pmsignature and decompTumor2Sig preprocess mutations for counting them when taking the transcription direction into account: For mutations which map to a region with multiple overlapping genes with opposing transcription directions, pmsignature assigns the transcript direction of the gene encountered first in the transcript database (see also Section 3.2.1). This was also the behavior of decompTumor2Sig until version 1.3.5 (used for our papers; Krüger and Piro, 2017, 2019). In newer versions, decompTumor2Sig excludes these mutations by default from the count because from mutation data alone it cannot be inferred which of the two genes has the higher transcriptional activity which might potentially be linked to the occurrence of the mutation. However, when converting data from pmsignature these mutations have already been processed and can therefore not be excluded during the conversion.
The Bioconductor package VariantAnnotation provides the class VRanges which can be used to store mutation information. decompTumor2Sig allows to extract single nucleotide variants (SNVs) from such an object and convert them into the tumor genomes’ mutation frequencies using the function convertGenomesFromVRanges(), as in the following example workflow:
# load the reference genome and the transcript annotation database
refGenome <- BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19
transcriptAnno <-
TxDb.Hsapiens.UCSC.hg19.knownGene::TxDb.Hsapiens.UCSC.hg19.knownGene
# take six breast cancer genomes from Nik-Zainal et al (PMID: 22608084)
gfile <- system.file("extdata",
"Nik-Zainal_PMID_22608084-VCF-convertedfromMPF.vcf.gz",
package="decompTumor2Sig")
# get the corresponding VRanges object (using the VariantAnnotation package)
library(VariantAnnotation)
vr <- readVcfAsVRanges(gfile, genome="hg19")
# convert the VRanges object to the decompTumor2Sig format
genomes <- convertGenomesFromVRanges(vr, numBases=5, type="Shiraishi",
trDir=TRUE, refGenome=refGenome,
transcriptAnno=transcriptAnno,
verbose=FALSE)
Note: the preprocessing of the somatic mutations into mutation frequencies can be controlled with the same function arguments that have been described above for readGenomesFromVCF().
Since the mutation data of genomes is specified as mutation probabilities and has exactly the same format as signatures, the format can be verified with the very same functions described in Section 3.1.6 (see above).
(Note: The following examples are for illustrative purpose only and need not be biologically meaningful.)
Once both the tumor genome(s) and the given mutational signatures have been read (see above), the contribution of the given signatures to the somatic mutations in individual tumors can be determined using the following workflow.
In the following, we assume to have the signatures in a list object named signatures and the mutation frequencies of the tumor genome(s) in another list object named genomes.
Important note: it is imperative that the mutation frequencies represented by both signatures and genomes have been computed with the same characteristics. That is, if the signatures refer to sequences of 5 nucleotides/bases (with the mutated base at the center), then also the genomes must have been read for 5 bases. If the signatures have been produced taking the transcription direction into account, then also the genomes must have been read taking the transcription direction into account, and so on.
Given that the signatures and the genomes (if read appropriately) have the same format and contain the same type of information (fractions/percentages of somatic mutations that have specific features, e.g., specific neighboring bases), both can essentially be visualized in the same way.
The function plotMutationDistribution() takes as an input either a single signature or the mutation frequencies of an individual tumor genome (either of Alexandrov- or of Shiraishi-type) and plots the mutation frequency data according to the signature model.
To plot, for example, Alexandrov/COSMIC signature 1 (obtained as described in Section 3.1.1):
signatures <- readAlexandrovSignatures()
plotMutationDistribution(signatures[[1]])
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the decompTumor2Sig package.
## Please report the issue at
## <https://github.com/rmpiro/decompTumor2Sig/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
To plot one of the Shiraishi signatures provided with this package (see Section 3.1.2):
sigfiles <- system.file("extdata",
paste0("Nik-Zainal_PMID_22608084-pmsignature-sig",1:4,".tsv"),
package="decompTumor2Sig")
signatures <- readShiraishiSignatures(files=sigfiles)
plotMutationDistribution(signatures[[1]])
decompTumor2Sig’s representation of the mutation frequency data of Shiraishi-type signatures uses sequence logos for the flanking bases and the variant bases (with the heights of the bases being proportional to their probability/frequency). The two possibilities for the mutated central base (C or T) are represented next to each other and their respective frequency is indicated below. This side-by-side representation allows to distinguish the probabilities of variant bases (on top) according to the mutated base. Transcription strand bias (if information on transcription direction is used) is shown in the upper right corner (frequency of mutations on the transcription strand, “+”, and the opposite strand, “-”).
In the plot above, for example, C and T are nearly equally likely to be mutated by the represented mutational process, but a mutated C most frequently becomes a T, while a mutated T becomes one of the other bases with roughly equal probability. Also, the mutational signature has next to no strand bias.
This representation is similar to the way the pmsignature package represents such signatures, as shown by the following example:
(This plot above was generated with pmsignature and serves only for comparison, showing the same signature as above.)
To show that genome mutation frequencies can be represented in the same manner, the following example reads the tumor genomes provided with this package (see Section 3.2.1) using the Alexandrov model and plots the mutation frequencies of the first genome:
refGenome <- BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19
gfile <- system.file("extdata",
"Nik-Zainal_PMID_22608084-VCF-convertedfromMPF.vcf.gz",
package="decompTumor2Sig")
genomes <- readGenomesFromVCF(gfile, numBases=3, type="Alexandrov",
trDir=FALSE, refGenome=refGenome, verbose=FALSE)
plotMutationDistribution(genomes[[1]])
In many cases already a small subset of the given signatures is sufficient to explain the major part of the variance of the mutation frequencies observed in a single tumor genome.
The explained variance can be determined by comparing the true mutation frequencies \(g_i\) of the tumor genome—where \(i\) is one of the mutation features of the signature model (e.g., triplet mutations for Alexandrov signatures, or base changes or flanking bases for Shiraishi signatures)—to the mutation frequencies \(\hat{g}_i\) obtained when re-composing/reconstructing the mutation frequencies of the tumor genome from the mutational signatures and their computed exposures/contributions. (See Section 4.3 for computing the exposures/contributions, and Section 4.4 for reconstructing the mutation frequencies of a tumor.)
For the Alexandrov model, the explained variance can be estimated as:
\[ \mathrm{evar}=1-\frac{\sum_i{(g_i-\hat{g}_i)^2}}{\sum_i({g_i}-\bar{g})^2} \]
where the numerator is the residual sum of squares (RSS) between the predicted and true mutation frequencies of the tumor genome (i.e., the squared error), and the denominator can be interpreted as the deviation from a tumor genome with a uniform mutation frequency of 1/96 for each feature (which is precisely the average mutation frequency \(\bar{g}\)).
For the Shiraishi model, \(\bar{g}\) does not describe a tumor genome with uniform mutation frequencies, and hence the explained variance is estimated as:
\[\mathrm{evar}=1-\frac{\sum_i{(g_i-\hat{g}_i)^2}}{\sum_i({g_i}-{g}^{*}_i)^2}\]
where \(g^{*}\) is a uniform tumor model that uses mutation frequencies of 1/6 for the six possible base changes, 1/4 for each of the possible flanking bases, and 1/2 for each of the two possible transcription-strand directions. For more details, please see Krüger and Piro (2019).
The function plotExplainedVariance() allows to visually analyze how many signatures are necessary to explain certain fractions of the variance of a tumor genome’s mutational patterns.
For an increasing number K of signatures, the highest variance explained by subsets of K signatures will be plotted. This can help to evaluate what minimum threshold for the explained variance could be used to decompose tumor genomes with the function decomposeTumorGenomes() (see below).
As a simple example, load a set of 15 Shiraishi-type signatures (object signatures) provided with this package. These signatures were obtained with the package pmsignature from a set of 435 tumor genomes with at least 100 somatic mutations from ten different tumor entities (data from Alexandrov et al. Nature 500:415-421, 2013; for the analysis, see our papers mentioned in Section 1.1):
# load the 15 Shiraishi signatures obtained from
# 435 tumor genomes from Alexandrov et al.
sfile <- system.file("extdata",
"Alexandrov_PMID_23945592_435_tumors-pmsignature-15sig.Rdata",
package="decompTumor2Sig")
load(sfile)
length(signatures)
## [1] 15
signatures[1]
## [[1]]
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 6.431522e-13 2.320704e-02 2.293643e-79 0.64050260 0.1876257 0.1486647
## [2,] 2.916982e-01 2.059223e-01 2.043698e-01 0.29800971 0.0000000 0.0000000
## [3,] 2.239856e-01 4.126288e-01 1.982799e-01 0.16510571 0.0000000 0.0000000
## [4,] 9.822384e-02 7.036368e-14 8.147461e-01 0.08703009 0.0000000 0.0000000
## [5,] 2.524624e-01 1.965627e-01 2.681366e-01 0.28283839 0.0000000 0.0000000
## [6,] 4.422814e-01 5.577186e-01 0.000000e+00 0.00000000 0.0000000 0.0000000
This loads an object signatures with 15 Shiraishi signatures for mutated subsequences of length 5 (five nucleotides with the mutated base at the center) and taking transcription direction into account.
Now read the tumor genomes (object genomes) provided with this package, as described in Section 3.2.1:
# load the reference genome and the transcript annotation database
refGenome <- BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19
transcriptAnno <-
TxDb.Hsapiens.UCSC.hg19.knownGene::TxDb.Hsapiens.UCSC.hg19.knownGene
# take six breast cancer genomes from Nik-Zainal et al (PMID: 22608084)
gfile <- system.file("extdata",
"Nik-Zainal_PMID_22608084-VCF-convertedfromMPF.vcf.gz",
package="decompTumor2Sig")
# read the cancer genomes in VCF format
genomes <- readGenomesFromVCF(gfile, numBases=5, type="Shiraishi",
trDir=TRUE, refGenome=refGenome,
transcriptAnno=transcriptAnno, verbose=FALSE)
The explained variance can be plotted only for a single tumor genome. Plotting the explained variance of the first tumor genome when using increasing subsets of the 15 mutational signatures (see above), for example, can be done with the following command:
plotExplainedVariance(genomes[[1]], signatures, minExplainedVariance=0.9,
minNumSignatures=2, maxNumSignatures=NULL,
greedySearch=TRUE)
The function plotExplainedVariance() takes the following arguments:
This is the heart of the functionality of decompTumor2Sig, its main purpose.
Given a tumor genome and a set of mutational signatures (that represent mutational processes like UV light, smoking, etc; see Alexandrov and Stratton, Curr Opin Genet Dev 24:52-60, 2014), we would like to estimate how strongly the different signatures (processes) contributed to the overall mutational load observed in the tumor. To this end decompTumor2Sig relies on quadratic programming. For details, see our paper (Krüger and Piro, 2019).
The result will be a vector of weights/contributions (or “exposures”) which indicate the fractions or percentages of somatic mutations which can likely be attributed to the given signatures.
Lets take, for instance, the signature and tumor data used for the example in Section 4.2 (see there).
To compute the contributions of the 15 given signatures to the first tumor genome, we can run the following command …
exposure <- decomposeTumorGenomes(genomes[1], signatures)
… and get the following exposures (contributions) for the 15 signatures:
exposure
## $PD3851a
## sign_1 sign_2 sign_3 sign_4 sign_5 sign_6
## 0.048859110 0.132178330 0.050850027 0.054031520 0.001450941 0.109720556
## sign_7 sign_8 sign_9 sign_10 sign_11 sign_12
## 0.095453994 0.049931644 0.030747503 0.091294629 0.025005691 0.032603285
## sign_13 sign_14 sign_15
## 0.037525747 0.194417963 0.045929061
The exposures/contributions for a single tumor genome can also be plotted:
plotDecomposedContribution(exposure)