load example data
To demonstrate the use of ZygosityPredictor, NGS data from the Seq-C2 project
were used [1]. In the following chunk, all required datalayer of the WGS_LL_T_1
sample are loaded. The variants are loaded as GRanges objects, one for somatic
copy number alterations (GR_SCNA), one for germline- and one for somatic small
variants (GR_GERM_SMALL_VARS and GR_SOM_SMALL_VARS). The input formats will be
discussed in more detail in section 4.
library(ZygosityPredictor)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(GenomicRanges)
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
## lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff, table,
## tapply, union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
##
## first, rename
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
## Loading required package: GenomeInfoDb
# file to sequence alignment
FILE_BAM <- system.file("extdata", "ZP_example.bam",
package = "ZygosityPredictor")
VCF <- system.file("extdata", "ZP_example_chr7.vcf.gz",
package = "ZygosityPredictor")
# meta information of the sample
PURITY <- 0.98
PLOIDY <- 1.57
SEX <- "female"
# variants
data("GR_SCNA")
data("GR_GERM_SMALL_VARS")
data("GR_SOM_SMALL_VARS")
# used gene model
data("GR_GENE_MODEL")
data("GR_HAPLOBLOCKS")
Calculation of affected copies of a variant
Two functions are provided to calculate how many copies are affected by single
small variants, based on two formulas, one for germline variants and one for
somatic variants.
Germline variants
To calculate the affected copies for a germline variant by using
aff_germ_copies()
, the following inputs are required:
- af: numeric; between 0 and 1; calculated allele frequency of the variant
in the tumor sample
- tcn: numeric; total copy number at the position of the variant
- purity: numeric; between 0 and 1; purity or tumor cell content of the
tumor sample
- c_normal: numeric; expected copy number at the position of the variant
in normal tissue, 1 for gonosomes in male samples, and 2 for male autosomes and
all chromosomes in female samples. (The function can also assess the c_normal
parameter by itself, but then the following two inputs must be provided:
chr and sex)
- chr: (only if c_normal is not provided) character; can be either a single
number or in the “chr1” format; chromosome of the variant
- sex: (only if c_normal is not provided) character; either “male” or
“female” / “m” or “f”; sex of the sample
- af_normal: (default 0.5) numeric; allele-frequency of germline variant in
normal tissue. 0.5 represents heterozygous variants in diploid genome, 1 would
be homozygous. Could be relevant if germline CNVs are present at the position.
Then also the c_normal parameter would have to be adjusted.
the output is a numeric value that indicates the affected copies.
## as an example we take the first variant of our prepared input data and
## extract the required information from different input data layer
## the allele frequency and the chromosome can be taken from the GRanges object
AF = elementMetadata(GR_GERM_SMALL_VARS[1])[["af"]]
CHR = seqnames(GR_GERM_SMALL_VARS[1]) %>%
as.character()
## the total copy number (tcn) can be extracted from the CNV object by selecting
## the CNV from the position of the variant
TCN = elementMetadata(
subsetByOverlaps(GR_SCNA, GR_GERM_SMALL_VARS[1])
)[["tcn"]]
## purity and sex can be taken from the global variables of the sample
## with this function call the affected copies are calculated for the variant
aff_germ_copies(af=AF,
tcn=TCN,
purity=PURITY,
chr=CHR,
sex=SEX)
## [1] 1.50733
Somatic variants
To calculate how many copies are affected by a somatic variant by
aff_som_copies()
, the same inputs are required, but a different formula
is evaluated:
## the function for somatic variants works the same way as the germline function
AF = elementMetadata(GR_SOM_SMALL_VARS[1])[["af"]]
CHR = seqnames(GR_SOM_SMALL_VARS[1]) %>%
as.character()
TCN = elementMetadata(
subsetByOverlaps(GR_SCNA, GR_SOM_SMALL_VARS[1])
)[["tcn"]]
aff_som_copies(af=AF,
chr=CHR,
tcn=TCN,
purity=PURITY,
sex=SEX)
## [1] 1.471406
Calculate affected copies of a set of variants
In order to apply the previously mentioned functions to a whole set of variants
and calculate the affected copies, the following code can be used.
## as an example we calculate the affected copies for the somatic variants:
GR_SOM_SMALL_VARS %>%
## cnv information for every variant is required.. merge with cnv object
IRanges::mergeByOverlaps(GR_SCNA) %>%
as_tibble() %>%
## select relevant columns
select(chr=1, pos=2, gene, af, tcn) %>%
mutate_at(.vars=c("tcn", "af"), .funs=as.numeric) %>%
rowwise() %>%
mutate(
aff_copies = aff_som_copies(chr, af, tcn, PURITY, SEX),
wt_copies = tcn-aff_copies
)
## # A tibble: 10 × 7
## # Rowwise:
## chr pos gene af tcn aff_copies wt_copies
## <fct> <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 chr6 29100982 OR2J1 0.358 4.06 1.47 2.59
## 2 chr6 29101487 OR2J1 0.328 4.06 1.35 2.72
## 3 chr6 29101522 OR2J1 0.328 4.06 1.35 2.72
## 4 chr7 107767595 SLC26A3 0.0263 2.49 0.0665 2.42
## 5 chr7 107774037 SLC26A3 0.0476 2.49 0.120 2.37
## 6 chr16 48216115 ABCC11 0.352 3.07 1.10 1.97
## 7 chr16 48231866 ABCC11 0.347 3.07 1.08 1.99
## 8 chrX 88753422 CPXCR1 0.5 1.12 0.579 0.539
## 9 chrX 88753806 CPXCR1 0.491 1.12 0.569 0.549
## 10 chr17 41771694 JUP 0.857 1.06 0.947 0.117
There s also the function predict_per_variant
that basically does the same
with slightly adjusted inputs.
predict_per_variant(purity=PURITY,
sex=SEX,
somCna=GR_SCNA,
somSmallVars=GR_SOM_SMALL_VARS)
## Warning in FUN(X[[i]], ...): column gt_cna contains annotations of balanced
## segments. They are removed from imbalance phasing if enabled
## Warning in FUN(X[[i]], ...): column gt_cna contains annotations of balanced
## segments. They are removed from imbalance phasing if enabled
## Warning in FUN(X[[i]], ...): column gt_cna contains annotations of balanced
## segments. They are removed from imbalance phasing if enabled
## Warning in check_gr_small_vars(germSmallVars, "germline", ZP_env): Input
## germSmallVars empty/does not contain variants. Assuming there are no germline
## small variants
## Warning in check_opt_incdel(includeIncompleteDel, ploidy): Large scale
## deletions cannot be included without ploidyPlease provide input ploidy. Provide
## ploidy=2to assume diploid case
## Warning in check_opt_incdel(includeHomoDel, ploidy): Large scale deletions
## cannot be included without ploidyPlease provide input ploidy. Provide
## ploidy=2to assume diploid case
## $evaluation_per_variant
## # A tibble: 10 × 19
## gene mut_id pos ref alt af tcn cna_type all_imb gt_cna seg_id
## <chr> <chr> <int> <chr> <chr> <dbl> <dbl> <chr> <lgl> <chr> <int>
## 1 OR2J1 m1 2.91e7 C A 0.358 4.06 HZ FALSE <NA> 5
## 2 OR2J1 m2 2.91e7 T C 0.328 4.06 HZ FALSE <NA> 5
## 3 OR2J1 m3 2.91e7 C T 0.328 4.06 HZ FALSE <NA> 5
## 4 SLC26A3 m1 1.08e8 C T 0.0263 2.49 LOH FALSE 2:sub 7
## 5 SLC26A3 m2 1.08e8 G C 0.0476 2.49 LOH FALSE 2:sub 7
## 6 ABCC11 m1 4.82e7 G T 0.352 3.07 HZ TRUE 1:2 1
## 7 ABCC11 m2 4.82e7 C T 0.347 3.07 HZ TRUE 1:2 1
## 8 CPXCR1 m1 8.88e7 A C 0.5 1.12 HZ FALSE <NA> 8
## 9 CPXCR1 m2 8.88e7 G A 0.491 1.12 HZ FALSE <NA> 8
## 10 JUP m1 4.18e7 GTGT… G 0.857 1.06 LOH FALSE 1:0 2
## # ℹ 8 more variables: tcn_assumed <lgl>, origin <chr>, chr <fct>, class <chr>,
## # aff_cp <dbl>, wt_cp <dbl>, vn_status <dbl>, pre_info <chr>
##
## $combined_uncovered
## NULL
Predict Zygosity
In this section, we will use the WGS_LL_T_1 dataset from the Seq-C2 project as
an example to investigate whether mutations in the following genes result in
total absence of wildtype copies. The genes which were selected as an example
for the analysis are shown below. The example data set was reduced to these
genes.
- TP53
- BRCA1
- TRANK1
- TRIM3
- JUP
- CDYL
- SCRIB
- ELP2
Predict zygosity for a set of genes in a sample
The prediction of zygosity for a set of genes in a sample can be assessed by
the predict_zygosity()
function.
Important note: The runtime of the analysis depends strongly on the number
of genes to be assessed and on the number of input variants. It is therefore
recommended to reduce the number of genes to the necessary ones. Also,
depending on the research question to be addressed, the variants should be
filtered to the most relevant ones, not only because of runtime considerations,
but also to sharpen the final result. A large number of mutations in a gene,
some of which are of little biological relevance or even SNPs, will inevitably
reduce the validity of the results.
fp <- predict_zygosity(
purity = PURITY,
ploidy = PLOIDY,
sex = SEX,
somCna = GR_SCNA,
somSmallVars = GR_SOM_SMALL_VARS,
germSmallVars = GR_GERM_SMALL_VARS,
geneModel = GR_GENE_MODEL,
bamDna = FILE_BAM,
vcf=VCF,
haploBlocks = GR_HAPLOBLOCKS
)
## Warning in check_gr_gene_model(geneModel, ZP_env):
## Warning in FUN(X[[i]], ...): column gt_cna contains annotations of balanced
## segments. They are removed from imbalance phasing if enabled
## Warning in FUN(X[[i]], ...): column gt_cna contains annotations of balanced
## segments. They are removed from imbalance phasing if enabled
## Warning in FUN(X[[i]], ...): column gt_cna contains annotations of balanced
## segments. They are removed from imbalance phasing if enabled
## no RNA file provided: Analysis will be done without RNA reads
Interpretation of results
Of note, the results displayed here were chosen to explain and exemplify the
functionality of the tool; biological and medical impact of the specific
variants has not been a selection criterion.
The result which is returned by the function consists of a list of tibbles:
- Evaluation per variant
- Evaluation per gene
- Main Phasing info
- detailed read-level phasing (RLP) info
- detailed alleleic imbalance phasing (AIP) info, if enabled
- Read pair info (only if showReadDetail=TRUE and logDir is provided)
- Variants not covered by somCna (only if present and no sCNA gap assumption
was done)
One way of accessing the results is a simple extraction of the tibbles from the
list. In addition, two accessor function are implemented: ZP_ov()
shows an
overview of the full resultand and gene_ov()
shows more detailed
information about a selected gene.
ZP_ov(fp)
## Zygosity Prediction of 12 input variants
## 0 are uncovered by sCNA input and were not evaluated
## 7 genes analyzed:
## |status |eval_by | n|
## |:-------------------|:--------------|--:|
## |all_copies_affected |aff_cp | 2|
## |wt_copies_left |aff_cp | 2|
## |wt_copies_left |direct-phasing | 1|
## |wt_copies_left |insufficient | 2|
## |undefined |NA | 0|
## phased genes:
## CPXCR1, OR2J1
The function provides an overview about the evaluations done by
ZygosityPredictor. One can see that in our case 12 input small variants were
used to predict gene status. Of those, two got the status all_copies_affected.
Evaluation per variant
The first result of the function is the evaluation per variant. In this step
all information required for subsequent steps is annotated and the affected
copies per variant are calculated. For every variant, the function checks
whether it already affects all copies of the gene. The format of the output is
a tibble; the number of rows corresponds to the total
number of input variants. The tool annotates a few self-explanatory columns
such as the origin of the respective variant (germline/somatic) or the class
(snv/ins/del). It also appends information from the sCNA results: the total
copy number at the position of the variant and the information if a loss of
heterozygosity is present (cna_type). Also, an ID is assigned to every small
variant. Then, the genes are numbered consecutively in order to unambiguously
assign variants to genes in the following analysis. The most important results
of this step are the calculation of the affected and wildtype copies, as well
as, depending on the data, an initial check of whether a variant already affects
all copies.
Of note, there can be situations in which left wildtype copies are below 0.5,
but still this information is not sufficient to predict “all_copies_affected”
without doubt. Depending on the origin of the variant, further criteria must
be met (e.g., LOH). The procedure for this first check is shown in the pre_info
column.
Evaluation per gene + phasing info
By using gene_ov()
, more detailed information can be viewed about how the
tool came to a gene status. What the accessor function does is basically
filtering the output tibbles to the gene of relevance. Detailed explanation about all columns can be found below the next chunk.
gene_ov(fp, OR2J1)
## Top level: Gene status
## |gene | n_mut|status | conf|eval_by | wt_cp|warning |wt_cp_range |info |phasing | eval_time_s|
## |:-----|-----:|:--------------|----:|:-------|--------:|:-------|:-----------|:----------------------------------------|:--------------|-----------:|
## |OR2J1 | 3|wt_copies_left | 1|aff_cp | 2.592225|none |1.24 - 2.72 |most relevant phasing combination: m1-m2 |direct-phasing | 3.283|
##
##
## Sub level: Evaluation per variant
## |gene |origin |class |chr | pos|ref |alt | af| tcn|tcn_assumed |cna_type |all_imb |gt_cna | seg_id| aff_cp| wt_cp|mut_id |
## |:-----|:-------|:-----|:----|--------:|:---|:---|----:|----:|:-----------|:--------|:-------|:------|------:|------:|-----:|:------|
## |OR2J1 |somatic |snv |chr6 | 29100982|C |A | 0.36| 4.06|FALSE |HZ |FALSE |NA | 5| 1.47| 2.59|m1 |
## |OR2J1 |somatic |snv |chr6 | 29101487|T |C | 0.33| 4.06|FALSE |HZ |FALSE |NA | 5| 1.35| 2.72|m2 |
## |OR2J1 |somatic |snv |chr6 | 29101522|C |T | 0.33| 4.06|FALSE |HZ |FALSE |NA | 5| 1.35| 2.72|m3 |
##
##
## Sub level: Main phasing combinations
## |comb | nconst|const |phasing |via | conf|unplausible |subclonal | wt_cp| min_poss_wt_cp| max_poss_wt_cp| score|gene |
## |:-----|------:|:-----|:--------------|:---|----:|:-----------|:---------|-----:|--------------:|--------------:|-----:|:-----|
## |m2-m3 | 1|same |direct-phasing |8 | 1.00|0 |0 | 2.72| 1.37| 2.72| 1|OR2J1 |
## |m1-m2 | 1|same |direct-phasing |4 | 0.86|0 |0 | 2.59| 1.25| 2.59| 1|OR2J1 |
## |m1-m3 | 1|same |direct-phasing |7 | 0.86|0 |0 | 2.59| 1.24| 2.59| 1|OR2J1 |
##
##
## Sub level: All read-level-phasing combinations, including SNPs
## Showing 3 of 3 phasing attempts
##
## | both| mut1| mut2| dev_var| skipped|const | nconst| p_same| p_diff| none_raw| DNA_rds| RNA_rds|subclonal |unplausible | dist|class_comb |comb | ncomb|gene | conf|
## |-----:|----:|----:|-------:|-------:|:-----|------:|------:|------:|--------:|-------:|-------:|:---------|:-----------|----:|:----------|:-----|-----:|:-----|----:|
## | 2.00| 0| 0| 0| 0|same | 1| 1| 0.14| 0| 2| 0|FALSE |FALSE | 505|snv-snv |m1-m2 | 4|OR2J1 | NA|
## | 2.00| 0| 0| 0| 0|same | 1| 1| 0.14| 0| 2| 0|FALSE |FALSE | 540|snv-snv |m1-m3 | 7|OR2J1 | NA|
## | 39.99| 0| 0| 0| 0|same | 1| 1| 0.00| 22| 62| 0|FALSE |FALSE | 35|snv-snv |m2-m3 | 8|OR2J1 | NA|
##
The first prompted tibble originates from the eval_per_gene tibble of the output.
The tibble shown
below originates from eval_per_variant and shows all input small variants of the
gene. The next tibble originates from main_phasing_info and shows the main
phasing combinations. Finally, the last tibble comes from detailed_RLP_info
and contains every phasing combination including the ones with and between SNPs.
What can be seen is that the final gene status of OR2J1 is wt_copies_left. The
tool predicted around 2.6 remaining wt copies with maximum confidence level of
1. Below it can be seen that 3 small variants inside the gene were in the input
leading to 3 main phasing combinations. All of them were phased via direct read level phasing.
As all main combinations could
be solved, the left wt copies could be accurately defined and the gene status
assigned. In this particular case, the gene status could have been solved even
without phasing all main combinations by the case we refer to as insufficient.
As the final gene status is defnied via the so called integrated_affected_copies
,
which are defnied for diff constellation via: aff_cp(m1)+aff_cp(m2)
and for the same constellation:
max(aff_cp(m1), aff_cp(m2))
, we can pre-calculate the maximum affected
copies in case of a diff constellation. If the difference of the total copy number
and the integrated affected copies leaves more than 0.5 wt copies, the status wt_cp_left can be concluded.
As can be seen in the main phasing combinations in column: min_poss_wt_cp, none
of the main phasing combinations could lead to less than 0.5 wt copies even if
the variants were found on different reads.
The outputs contain columns providing more detailed information about gene status definition
which will be explained in the following.
All read-level phasing combinations
(either visible via gene_ov()
or accessable via fp$detailed_RLP_info
):
- both: number of reads classified as both (both variants present), adjusted with basecall and maping quality
- mut1: number of reads classified as mut1 (only the first variant of combination pßresent), adjusted with basecall and maping quality
- mut2: number of reads classified as mut2 (only the second variant of combination pßresent), adjusted with basecall and maping quality
- dev_var: number of reads having another variant at one of the positions of the expected variants
- skipped: number of reads not mapping to the position of the variant (only expected with RNA reads)
- nstatus: variant constellation in numeric representation (2 = all copies affected, 1 = wt copies left, 0 = undefined)
- status: variant constellation in character representation
- nstatus: numeric representation of constellation
- xsq_same/xsq_diff: X-squared values of chi-squared tests. _same is the one against expected same result
- p_same/p_diff: p-value of chi-squared tests
- v_same/v_diff: Cramers V value of chi-squared tests (5)
- none: number of reads classified as none (none of both variants found)
- DNA_rds/RNA_rds: number of DNA/RNA reads used
- dist: Distance between variants
- class_comb: mutational classes corresponding to character combination identifier
- comb: combination identifier in character representation (mX-mY).
- ncomb: combination identifier in numeric representation. Derived from position in phasing matrix
- subclonal: TRUE if reads with classification both and either mut1 or mut2 were found. There are two explanantions why soimething like this could happen in a
cancer sample: First that the tumor is not fully clonal and a subclone is present carrying only the variant that happened earlier during tumor development. The second variant is only carried by a subclone that emerged later. The second explanation could be that the reads are actually from the same cells but a duplication of one allel happend between their occurence. We could imagine a scenario were the genotype of a gene is 1:2 and both variants are on the duplicated allele. If one of them happend before the duplication, we would expect reads carrying only the earlier variant but also reads carrying both of them.
- unplausible TRUE if reads with classification both, mut1 and mut2 are found for the same variant combination. Compared to the “subclonal” case this is more difficult to explain and might also be an artifact. The only way this could happen in a cancer sample is that the same mutational event happens independently on both alleles and or in different subclones. This case is generally very unlikely and is therefore annotated as unplasuible. Such a setting is also reflected in the p-valkue and thereofre the confidence, as similary with both expected results is found.
Main phasing combinations (either visible via gene_ov()
or accessable via fp$phasing_info
):
- phasing: Info which phasing approach was used (direct, indirect, haploblock, imbalance, flagged)
- via: combinations which were used to solve this combination. For direct phasing the number is the numeric representation of the combi9nation identifier.
For indirect phasing it will look like this: 4-7-8 which measn thta the combinations 4, 7 and 8 were used to solve.
- conf: aggregated confidence for the combination
- wt_cp: exact left wt copies. If phasing of combination was not succesful, wt copies can not be calculated acurately which leads to NA
- min_poss_wt_cp: Minimal possible wt copies if variants are on different copies according to affected copynumber of the two variants in the combination. If greater than 0.5, the constellation is unable to contributre to a status of all copies affrected for the gene.
- max_poss_wt_cp: maximum possible wt copies if the two variants are on the same copy
- score: contribution to gene status. (2 is a contribution of all copies affected by the combination, 1 has more than 0.5 wt copies left)
- gene: Gene that w2as tried to solve with the combination