This vignette outlines parsing and annotation of structural variants from Variant Call Format (VCF) using the StructuralVariantAnnotation
package. StructuralVariantAnnotation
contains useful helper functions for reading, comparing, and interpreting structural variant calls. StructuralVariantAnnotation
The package contains functions for parsing VCFs from a number of popular callers, dealing with breakpoints involving two separate genomic loci encoded as GRanges
objects, as well as identifying various biological phenomena.
The StructuralVariationAnnotation package can be installed from Bioconductor as follows:
if (!requireNamespace("BiocManager", quietly = TRUE))
#install.packages("BiocManager")
BiocManager::install("StructuralVariantAnnotation")
library(StructuralVariantAnnotation)
The VCF standard allows for four different structural variant notations. Here, we shall us a simple deletion as an example of the available notations:
CGTGTtgtagtaCCGTAA chr sequence
------- deleted bases
chr 5 sequence TTGTAGTA T . .
This is the format that most users will be familar with. It is the format used for small insertions and deletions. Unfortunately, it does not scale well to large events since both the REF
and ALT
sequences must be specified in full. For this reason, most structural variant callers will report events in a different notation.
chr 5 symbolic T <DEL> . . SVTYPE=DEL;SVLEN=-7;END=12
In symbolic notation, the ALT
is replaced with a special symbolic allele, optionally with a sub-type (e.g. <DEL:ME>
for a deletion known to be a mobile element deletion). The size of the event is determined via the SVLEN
and END
fields. The VCF specifications recognises <DEL>
, <DUP>
, <INS>
, <INV>
, and <CNV>
as top-level symoblic structural variant alleles with all other values taking on non-standard implementation-defined meaning.
There are two problems with this representation. Firstly, it is ambiguous whether a caller reporting an event is making a breakpoint claim, a copy number change, or both. Secondly, this notation does not allow inter-chromosomal events to be represented. These require yet another notation
chr 5 breakpoint1 T T[chr:13[ . . SVTYPE=BND;MATEID=breakpoint2
chr 13 breakpoint2 C ]chr:5]C . . SVTYPE=BND;MATEID=breakpoint1
In this notation, a breakpoint is represented by a pair of breakend records, each containing the location of the two breakends and linked by their ID
using the MATEID
INFO
field. Arbitrary rearrangements can be represented using breakend notation, including special-case syntax to represent the termination of a chromosome.
Finally, the VCF specifications support one additional notation for structural variants: single breakends.
chr 5 breakend T T. . . SVTYPE=BND
Single breakends are breakpoints in which only one side can be unambiguously determined. This can be due to one breakend being unable to be uniquely placed (e.g. a breakpoint into centromeric sequence), or due to the sequence of one side being novel with respect to the reference (e.g. a viral integration breakpoint).
Unlike breakpoint-centric data structures such as the Pairs
object that rtracklayer
uses to load BEDPE files, this package uses a breakend-centric notation. Breakends are stored in a GRanges object with strand used to indicate orientation and is consistent with VCF breakend notation. The position in the GRanges indicates the position of the base immediately adjacent to the breakpoint. +
indicates that the breakpoint orientation is “forward” the derivative chromosome consists of the bases at and before the break position. -
indicates that the breakpoint orientation is “backwards” the derivative chromosome consists of the bases at and after the break position. Breakpoints are represented using the partner
field. The partner
field contains the GRanges row name of the breakend at the other side of the breakpoint. Single breakends have an NA
partner
since the other side is unknown.
This notation has a number of advantages over other repentation formations. Firstly, it allows both breakpoints and single breakends to be represented in a single data structure. Secondly, it simplifies annotation as most genomic annotations are for a genomic position, or a contiguous genomic region (e.g. genes, introns, exons, repeats, mappability). Finally, it provides a unified representation format regardless of the format used to represent the variant.
Using our previous example, we can see that the direct sequence and symbolic representations require multiple rows to represent:
suppressPackageStartupMessages(library(StructuralVariantAnnotation))
vcf <- VariantAnnotation::readVcf(system.file("extdata", "representations.vcf", package = "StructuralVariantAnnotation"))
gr <- c(breakpointRanges(vcf), breakendRanges(vcf))
gr
#> GRanges object with 7 ranges and 13 metadata columns:
#> seqnames ranges strand | paramRangeID REF ALT
#> <Rle> <IRanges> <Rle> | <factor> <character> <character>
#> sequence_bp1 chr 5 + | NA TTGTAGTA T
#> sequence_bp2 chr 13 - | NA TTGTAGTA T
#> symbolic_bp1 chr 5 + | NA T <DEL>
#> symbolic_bp2 chr 13 - | NA T <DEL>
#> breakpoint1 chr 5 + | NA T T[chr:13[
#> breakpoint2 chr 13 - | NA C ]chr:5]C
#> breakend chr 5 + | NA T T.
#> QUAL FILTER sourceId partner svtype
#> <numeric> <character> <character> <character> <character>
#> sequence_bp1 NA . sequence sequence_bp2 <NA>
#> sequence_bp2 NA . sequence sequence_bp1 <NA>
#> symbolic_bp1 NA . symbolic symbolic_bp2 DEL
#> symbolic_bp2 NA . symbolic symbolic_bp1 DEL
#> breakpoint1 NA . breakpoint1 breakpoint2 BND
#> breakpoint2 NA . breakpoint2 breakpoint1 BND
#> breakend NA . breakend <NA> BND
#> svLen insSeq insLen event HOMLEN
#> <numeric> <character> <numeric> <character> <numeric>
#> sequence_bp1 -7 0 <NA> 0
#> sequence_bp2 -7 0 <NA> 0
#> symbolic_bp1 -7 <NA> 0 <NA> 0
#> symbolic_bp2 -7 <NA> 0 <NA> 0
#> breakpoint1 -7 0 <NA> 0
#> breakpoint2 -7 0 <NA> 0
#> breakend NA 0 <NA> 0
#> -------
#> seqinfo: 1 sequence from an unspecified genome
These rows are linked backed to their originating VCF row through the sourceId
field. All 3 breakpoint notations are identical when represented using the StructuralVariantAnnotation GRanges breakpoint notation.
StructuralVariantAnnotation is built on top of the Bioconductor package VariantAnnotation
. VCF files are loaded using VariantAnnotation::readVcf
, and converted to breakpoint GRanges notation using breakpointRanges
Any non-structural variants such as single nucleotide variants in a VCF will be silently ignored by StructuralVariantAnnotation. More information about VCF
objects can be found by consulting the vignettes in the VariantAnnotation package with browseVignettes("VariantAnnotation")
.
StructuralVariantAnnotation has additional parsing logic to enable reading VCF from popular callers that do not conform to VCF specifications. Specically, StructuralVariantAnnotation supports the following:
SVTYPE=RPL
. Used by Pindel to represent deletion-with-insertion events.INV3
, INV5
. Used Manta and DELLY to indicate that an <INV>
inversion call only includes one of the breakpoints required for an actual inversion.SVTYPE=TRA
/CHR2
/CT
Used by DELLY (and others) to indicate an inter-chromosomal breakpoint.SVTYPE=CTX
. BreakDancer-style notation used by TIGRA to indicate an inter-chromosomal breakpoint.A structural variant call can be ambiguous for one of two reasons. Firstly, a discordant read pair or read depth based caller can make IMPRECISE
call as the caller itself is unsure of the breakpoint position. Second, homology at the breakpoint can result (See Section 5.4.8 of the VCF version 4.3 specifications) in multiple possible breakpoint positions resulting in identical sequence. This ambiguity is intrinsic to the call itself.
StructuralVariantAnnotation incorporates both forms of ambiguity using both CIPOS
and HOMPOS
VCF fields into the GRanges breakend intervals. Breakend with ambiguity will have their start
and end
positions.
These breakpoint positional ambiguity bounds can be removed by specifying nominalPosition=FALSE
when calling breakpointRanges
.
library(StructuralVariantAnnotation)
vcf <- VariantAnnotation::readVcf(system.file("extdata", "gridss.vcf", package = "StructuralVariantAnnotation"))
gr <- breakpointRanges(vcf)
partner()
returns the breakpoint GRanges
object with the order rearranged such that the partner breakend on the other side of each breakpoint corresponds with the local breakend.
partner(gr)
#> GRanges object with 4 ranges and 13 metadata columns:
#> seqnames ranges strand | paramRangeID REF
#> <Rle> <IRanges> <Rle> | <factor> <character>
#> gridss2h chr12 84963533 - | NA G
#> gridss39h chr12 4886681 + | NA T
#> gridss39o chr12 84350 - | NA G
#> gridss2o chr1 18992158 + | NA C
#> ALT QUAL FILTER sourceId
#> <character> <numeric> <character> <character>
#> gridss2h ]chr1:18992158]AG 55.00 LOW_QUAL;SINGLE_ASSE.. gridss2h
#> gridss39h T[chr12:84350[ 627.96 . gridss39h
#> gridss39o ]chr12:4886681]G 627.96 . gridss39o
#> gridss2o CA[chr12:84963533[ 55.00 LOW_QUAL;SINGLE_ASSE.. gridss2o
#> partner svtype svLen insSeq insLen event
#> <character> <character> <numeric> <character> <integer> <character>
#> gridss2h gridss2o BND NA A 1 gridss2
#> gridss39h gridss39o BND 4802330 0 gridss39
#> gridss39o gridss39h BND 4802330 0 gridss39
#> gridss2o gridss2h BND NA A 1 gridss2
#> HOMLEN
#> <numeric>
#> gridss2h 0
#> gridss39h 0
#> gridss39o 0
#> gridss2o 0
#> -------
#> seqinfo: 84 sequences from an unspecified genome
Single breakends are loaded using the breakendRanges()
function. The GRanges
object is of the same form as breakpointRanges()
but as the breakend partner is not specified, the partner is NA. A single GRanges object can contain both breakend and breakpoint variants.
colo829_vcf <- VariantAnnotation::readVcf(system.file("extdata", "COLO829T.purple.sv.ann.vcf.gz", package = "StructuralVariantAnnotation"))
colo829_bpgr <- breakpointRanges(colo829_vcf)
colo829_begr <- breakendRanges(colo829_vcf)
colo829_gr <- sort(c(colo829_begr, colo829_bpgr))
colo829_gr[seqnames(colo829_gr) == "6"]
#> GRanges object with 10 ranges and 13 metadata columns:
#> seqnames ranges strand | paramRangeID
#> <Rle> <IRanges> <Rle> | <factor>
#> gridss36_2106o 6 26194117 + | NA
#> gridss36_2106h 6 26194406 + | NA
#> gridss37_233635b 6 65298376 + | NA
#> gridss38_18403o 6 94917253-94917268 + | NA
#> gridss40_299o 6 138774180-138774182 + | NA
#> gridss41_7816o 6 168570432-168570448 + | NA
#> gridss17_45233h 6 26194039-26194041 - | NA
#> gridss38_18403h 6 94917320-94917335 - | NA
#> gridss40_35285o 6 138774059 - | NA
#> gridss41_7816h 6 168570465-168570481 - | NA
#> REF ALT QUAL FILTER
#> <character> <character> <numeric> <character>
#> gridss36_2106o G G]6:26194406] 2500.79 PASS
#> gridss36_2106h A A]6:26194117] 2500.79 PASS
#> gridss37_233635b G GTGTTTTTTTCTCTGTGTTG.. 2871.65 PASS
#> gridss38_18403o T T[6:94917327[ 411.40 PON
#> gridss40_299o T T]15:23712618] 3009.75 PASS
#> gridss41_7816o G G[6:168570473[ 4495.46 PON
#> gridss17_45233h A [3:26431918[A 3842.54 PASS
#> gridss38_18403h T ]6:94917260]T 411.40 PON
#> gridss40_35285o T [15:23717166[GTATATT.. 2213.96 PASS
#> gridss41_7816h T ]6:168570440]T 4495.46 PON
#> sourceId svtype svLen
#> <character> <character> <numeric>
#> gridss36_2106o gridss36_2106o BND 288
#> gridss36_2106h gridss36_2106h BND 288
#> gridss37_233635b gridss37_233635b BND NA
#> gridss38_18403o gridss38_18403o BND -66
#> gridss40_299o gridss40_299o BND NA
#> gridss41_7816o gridss41_7816o BND -32
#> gridss17_45233h gridss17_45233h BND NA
#> gridss38_18403h gridss38_18403h BND -66
#> gridss40_35285o gridss40_35285o BND NA
#> gridss41_7816h gridss41_7816h BND -32
#> insSeq insLen event HOMLEN
#> <character> <integer> <character> <numeric>
#> gridss36_2106o 0 gridss36_2106 0
#> gridss36_2106h 0 gridss36_2106 0
#> gridss37_233635b TGTTTTTTTCTCTGTGTTGT.. 683 gridss37_233635 0
#> gridss38_18403o 0 gridss38_18403 15
#> gridss40_299o 0 gridss40_299 2
#> gridss41_7816o 0 gridss41_7816 16
#> gridss17_45233h 0 gridss17_45233 2
#> gridss38_18403h 0 gridss38_18403 15
#> gridss40_35285o GTATATTATC 10 gridss40_35285 0
#> gridss41_7816h 0 gridss41_7816 16
#> partner
#> <character>
#> gridss36_2106o gridss36_2106h
#> gridss36_2106h gridss36_2106o
#> gridss37_233635b <NA>
#> gridss38_18403o gridss38_18403h
#> gridss40_299o gridss40_299h
#> gridss41_7816o gridss41_7816h
#> gridss17_45233h gridss17_45233o
#> gridss38_18403h gridss38_18403o
#> gridss40_35285o gridss40_35285h
#> gridss41_7816h gridss41_7816o
#> -------
#> seqinfo: 25 sequences from an unspecified genome
Functions such as findBreakpointOverlaps()
require the GRanges
object to be consistent. Subsetting a breakpoint GRanges
object can result in one side of a breakpoint getting filtered with the remaining orphaned record no longer valid as its partner no longer exists. Such records need to be be filtered
colo828_chr6_breakpoints <- colo829_gr[seqnames(colo829_gr) == "6"]
# A call to findBreakpointOverlaps(colo828_chr6_breakpoints, colo828_chr6_breakpoints)
# will fail as there are a) single breakends, and b) breakpoints with missing partners
colo828_chr6_breakpoints <- colo828_chr6_breakpoints[hasPartner(colo828_chr6_breakpoints)]
# As expected, each call on chr6 only overlaps with itself
countBreakpointOverlaps(colo828_chr6_breakpoints, colo828_chr6_breakpoints)
#> [1] 1 1 1 1 1 1
Note that if you did want to include inter-chromosomal breakpoints involving chromosome 6, you would need to update the filtering criteria to include records with chr6 on either side. In such cases, the filtering logic can be simplified by the selfPartnerSingleBreakends
parameter of partner()
. When selfPartnerSingleBreakends=TRUE
, the partner of single breakend events is considered to be the single breakend itself.
colo828_chr6_breakpoints <- colo829_gr[
seqnames(colo829_gr) == "6" |
seqnames(partner(colo829_gr, selfPartnerSingleBreakends=TRUE)) == "6"]
# this way we keep the chr3<->chr6 breakpoint and don't create any orphans
head(colo828_chr6_breakpoints, 1)
#> GRanges object with 1 range and 13 metadata columns:
#> seqnames ranges strand | paramRangeID REF
#> <Rle> <IRanges> <Rle> | <factor> <character>
#> gridss17_45233o 3 26431917-26431919 - | NA T
#> ALT QUAL FILTER sourceId
#> <character> <numeric> <character> <character>
#> gridss17_45233o [6:26194040[T 3882.79 PASS gridss17_45233o
#> svtype svLen insSeq insLen event
#> <character> <numeric> <character> <integer> <character>
#> gridss17_45233o BND NA 0 gridss17_45233
#> HOMLEN partner
#> <numeric> <character>
#> gridss17_45233o 2 gridss17_45233h
#> -------
#> seqinfo: 25 sequences from an unspecified genome
findBreakpointOverlaps()
and countBreakpointOverlaps()
are functions for finding and counting overlaps between breakpoint objects. All breakends must have their partner breakend included in the GRanges. A valid overlap requires that breakends on boths sides overlap.
To demonstrate the countBreakpointOverlaps()
function, we use a small subset of data from our structural variant caller benchmarking paper to construct precision recall curves for a pair of callers.
truth_vcf <- readVcf(system.file("extdata", "na12878_chr22_Sudmunt2015.vcf", package = "StructuralVariantAnnotation"))
truth_svgr <- breakpointRanges(truth_vcf)
truth_svgr <- truth_svgr[seqnames(truth_svgr) == "chr22"]
crest_vcf <- readVcf(system.file("extdata", "na12878_chr22_crest.vcf", package = "StructuralVariantAnnotation"))
# Some SV callers don't report QUAL so we need to use a proxy
VariantAnnotation::fixed(crest_vcf)$QUAL <- info(crest_vcf)$left_softclipped_read_count + info(crest_vcf)$left_softclipped_read_count
crest_svgr <- breakpointRanges(crest_vcf)
crest_svgr$caller <- "crest"
hydra_vcf <- readVcf(system.file("extdata", "na12878_chr22_hydra.vcf", package = "StructuralVariantAnnotation"))
hydra_svgr <- breakpointRanges(hydra_vcf)
hydra_svgr$caller <- "hydra"
svgr <- c(crest_svgr, hydra_svgr)
svgr$truth_matches <- countBreakpointOverlaps(svgr, truth_svgr,
# read pair based callers make imprecise calls.
# A margin around the call position is required when matching with the truth set
maxgap=100,
# Since we added a maxgap, we also need to restrict the mismatch between the
# size of the events. We don't want to match a 100bp deletion with a
# 5bp duplication. This will happen if we have a 100bp margin but don't also
# require an approximate size match as well
sizemargin=0.25,
# We also don't want to match a 20bp deletion with a 20bp deletion 80bp away
# by restricting the margin based on the size of the event, we can make sure
# that simple events actually do overlap
restrictMarginToSizeMultiple=0.5,
# HYDRA makes duplicate calls and will sometimes report a variant multiple
# times with slightly different bounds. countOnlyBest prevents these being
# double-counted as multiple true positives.
countOnlyBest=TRUE)
Once we know which calls match the truth set, we can generate Precision-Recall and ROC curves for each caller using one of the many ROC R packages, or directly with dplyr.
suppressPackageStartupMessages(require(dplyr))
suppressPackageStartupMessages(require(ggplot2))
ggplot(as.data.frame(svgr) %>%
dplyr::select(QUAL, caller, truth_matches) %>%
dplyr::group_by(caller, QUAL) %>%
dplyr::summarise(
calls=dplyr::n(),
tp=sum(truth_matches > 0)) %>%
dplyr::group_by(caller) %>%
dplyr::arrange(dplyr::desc(QUAL)) %>%
dplyr::mutate(
cum_tp=cumsum(tp),
cum_n=cumsum(calls),
cum_fp=cum_n - cum_tp,
Precision=cum_tp / cum_n,
Recall=cum_tp/length(truth_svgr))) +
aes(x=Recall, y=Precision, colour=caller) +
geom_point() +
geom_line() +
labs(title="NA12878 chr22 CREST and HYDRA\nSudmunt 2015 truth set")
#> `summarise()` has grouped output by 'caller'. You can override using the
#> `.groups` argument.
Converting to a breakpoint-centric notation does not fully resolve the problem of identical variants reported in different VCF notations. There are two additional commonly encountered notations that required additional handling, especially when comparing short and long read caller variant calls.
Tandem duplications under 1,000bp are usually reported as insertion events by long read callers, but as duplication events by short read caller. These notations are equivalent but typical variant comparisons such as findBreakpointOverlaps()
will not match these variants. The findInsDupOverlaps()
function can be used to identify duplications and insertions that are equivalent.
Transitive breakpoints are breakpoints that can be explained by multiple breakpoints. Identifying transitive breakpoints is important not only for matching short and long read call sets but also for the correct downstream interpretation of genomic rearrangements. Call sets that contain both transitive breakpoints and their constituent breakpoints are internally inconsistent which in turn causes incorrect downstream karyotype reconstructions.
There are two type of transitive breakpoints:
For example If DNA segments A, B, C are rearranged in an A-B-C configuration and the B segment is short, a short read caller may report an A-C breakpoints. This imprecise A-C breakpoint supported only by read pairs, can be explained by the precise A-B and B-C breakpoints.
These are similar in form to imprecise breakpoints in that a complex A-B-C rearrangement is reported as an A-C breakpoint. The different between imprecise and precise transitive breakpoints is that for precise breakpoints, the B segments is not missing, but is reported as inserted sequence between the breakpoints. This is a relatively common occurrence in long read call sets as long read aligners are unable to place short DNA segments due to the high indel error rate of long read calling.
For example, NanoSV calling of Nanopore sequencing of the COLO829 cell line, results in multiple transitive SV calls:
colo829_truth_bpgr <- breakpointRanges(readVcf(system.file("extdata", "truthset_somaticSVs_COLO829.vcf", package = "StructuralVariantAnnotation")))
colo829_nanosv_bpgr <- breakpointRanges(readVcf(system.file("extdata", "colo829_nanoSV_truth_overlap.vcf", package = "StructuralVariantAnnotation")), inferMissingBreakends=TRUE, ignoreUnknownSymbolicAlleles=TRUE)
findTransitiveCalls(colo829_nanosv_bpgr, colo829_truth_bpgr)
#> DataFrame with 4 rows and 4 columns
#> transitive_breakpoint_name total_distance traversed_breakpoint_names
#> <character> <integer> <CharacterList>
#> 1 2294973 200 truthset_52_1,truthset_53_2
#> 2 5621450 184 truthset_43_1,truthset_7_2
#> 3 svrecord102_bp2 200 truthset_53_1,truthset_52_2
#> 4 svrecord149_bp2 184 truthset_7_1,truthset_43_2
#> distance_to_traversed_breakpoint
#> <IntegerList>
#> 1 0,200
#> 2 0,184
#> 3 0,200
#> 4 0,184
Note that since StructuralVariantAnnotation
treats both breakends equally, each transitive path will have two results. One for the path traversing from one breakend, another for the traversal in the opposite direction.
The package supports converting GRanges objects to BEDPE files. The BEDPE format is defined by bedtools
. This is achieved using breakpointgr2pairs
and pairs2breakpointgr
functions to convert to and from the GRanges Pairs
notation used by rtracklayer
.
suppressPackageStartupMessages(require(rtracklayer))
# Export to BEDPE
rtracklayer::export(breakpointgr2pairs(gr), con="example.bedpe")
# Import to BEDPE
bedpe.gr <- pairs2breakpointgr(rtracklayer::import("example.bedpe"))
One way of visualising paired breakpoints is by circos plots.
Here, we use the plotting package ggbio
which provides flexible track functions which bind with ggplot2
objects. It takes GRanges
objects as input and supports circos plots. To plot structural variant breakpoints in a circos plot using ggbio
, we need to first prepare the breakpoint GRanges. The function requires a special column, indicating the end of the link using GRanges format.
suppressPackageStartupMessages(require(ggbio))
gr.circos <- colo829_bpgr[seqnames(colo829_bpgr) %in% seqlevels(biovizBase::hg19sub)]
seqlevels(gr.circos) <- seqlevels(biovizBase::hg19sub)
mcols(gr.circos)$to.gr <- granges(partner(gr.circos))
We can then plot the breakpoints against reference genomes.
p <- ggbio() +
circle(gr.circos, geom="link", linked.to="to.gr") +
circle(biovizBase::hg19sub, geom='ideo', fill='gray70') +
circle(biovizBase::hg19sub, geom='scale', size=2) +
circle(biovizBase::hg19sub, geom='text', aes(label=seqnames), vjust=0, size=3)
#> Warning in recycleSingleBracketReplacementValue(value, x, nsbs): number of
#> values supplied is not a sub-multiple of the number of values to be replaced
p