The use of personalised or population-level variants opens the door to the possibility of creating a variant-modified reference genome, or transcriptome. The package transmogR allows the creation of both, with a focus on combining both in order to created a custom reference transcriptome, along with decoy transcripts for use with the pseudo-aligner salmon.
transmogR 1.3.0
The incorporation of personalised or population-level variants into a
reference genome has been shown to have a significant impact on subsequent
alignments (Kaminow et al. 2022).
Whilst implemented for splice-aware alignment or RNA-Seq data using
STARconsensus, the package transmogR
enables the creation of a
variant-modified reference transcriptome for use with pseudo aligners such as
salmon (Srivastava et al. 2020).
In addition, multiple visualisation and summarisation methods are included for a
cursory analysis of any custom variant sets being used.
Whilst the subsequent code is demonstrated on a small genomic region, the complete process for creating a modified a reference can run in under 20 minutes if using 4 or more cores.
Importantly, it is expected that the user will have carefully prepared their set of variants such that only the following exact variant types are included.
Any variants which do not conform to these criteria will likely cause a series of frustrating errors when attempting to use this package.
In order to perform the operations in this vignette, the following packages require installation.
if (!"BiocManager" %in% rownames(installed.packages()))
install.packages("BiocManager")
BiocManager::install("transmogR")
Once these packages are installed, we can load them easily
library(VariantAnnotation)
library(rtracklayer)
library(extraChIPs)
library(transmogR)
library(BSgenome.Hsapiens.UCSC.hg38)
In order to create a modified reference, three primary data objects are required: 1) a reference genome; 2) a set of genomic variants; and 3) a set of exon-level co-ordinates defining transcript structure.
For this vignette, we’ll use GRCh38 as our primary reference genome, but
restricting the sequences to chr1 only.
The package can take either a DNAStringSet
or BSgenome
object as the
reference genome.
chr1 <- getSeq(BSgenome.Hsapiens.UCSC.hg38, "chr1")
chr1 <- as(chr1, "DNAStringSet")
names(chr1) <- "chr1"
chr1
## DNAStringSet object of length 1:
## width seq names
## [1] 248956422 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNN chr1
A small set of variants has been provided with the package.
sq <- seqinfo(chr1)
genome(sq) <- "GRCh38"
vcf <- system.file("extdata/1000GP_subset.vcf.gz", package = "transmogR")
vcf_param <- ScanVcfParam(fixed = "ALT", info = NA, which = GRanges(sq))
var <- rowRanges(readVcf(vcf, param = vcf_param))
seqinfo(var) <- sq
var
## GRanges object with 100 ranges and 3 metadata columns:
## seqnames ranges strand | paramRangeID
## <Rle> <IRanges> <Rle> | <factor>
## 1:113969:C:T chr1 113969 * | chr1
## 1:121009:C:T chr1 121009 * | chr1
## 1:123511:G:A chr1 123511 * | chr1
## 1:126113:C:A chr1 126113 * | chr1
## 1:128798:C:T chr1 128798 * | chr1
## ... ... ... ... . ...
## 1:839405:G:A chr1 839405 * | chr1
## 1:839480:CACACACCTG:C chr1 839480-839494 * | chr1
## 1:839515:CTAGACACAC:C chr1 839515-839543 * | chr1
## 1:840046:G:A chr1 840046 * | chr1
## 1:840279:A:G chr1 840279 * | chr1
## REF ALT
## <DNAStringSet> <DNAStringSetList>
## 1:113969:C:T C T
## 1:121009:C:T C T
## 1:123511:G:A G A
## 1:126113:C:A C A
## 1:128798:C:T C T
## ... ... ...
## 1:839405:G:A G A
## 1:839480:CACACACCTG:C CACACACCTGGACAA C
## 1:839515:CTAGACACAC:C CTAGACACAC...CACACACACG C
## 1:840046:G:A G A
## 1:840279:A:G A G
## -------
## seqinfo: 1 sequence from GRCh38 genome
An additional set of transcripts derived from Gencode v441 https://www.gencodegenes.org/human/ has also been provided.
f <- system.file("extdata/gencode.v44.subset.gtf.gz", package = "transmogR")
gtf <- import.gff(f, which = GRanges(sq))
seqinfo(gtf) <- sq
gtf
## GRanges object with 603 ranges and 11 metadata columns:
## seqnames ranges strand | source type score
## <Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
## [1] chr1 89295-133723 - | rtracklayer gene NA
## [2] chr1 89295-120932 - | rtracklayer transcript NA
## [3] chr1 120775-120932 - | rtracklayer exon NA
## [4] chr1 112700-112804 - | rtracklayer exon NA
## [5] chr1 92091-92240 - | rtracklayer exon NA
## ... ... ... ... . ... ... ...
## [599] chr1 852671-852766 + | rtracklayer exon NA
## [600] chr1 853391-854096 + | rtracklayer exon NA
## [601] chr1 851348-852752 + | rtracklayer transcript NA
## [602] chr1 851348-852110 + | rtracklayer exon NA
## [603] chr1 852671-852752 + | rtracklayer exon NA
## phase gene_id gene_type gene_name
## <integer> <character> <character> <character>
## [1] <NA> ENSG00000238009.6 lncRNA ENSG00000238009
## [2] <NA> ENSG00000238009.6 lncRNA ENSG00000238009
## [3] <NA> ENSG00000238009.6 lncRNA ENSG00000238009
## [4] <NA> ENSG00000238009.6 lncRNA ENSG00000238009
## [5] <NA> ENSG00000238009.6 lncRNA ENSG00000238009
## ... ... ... ... ...
## [599] <NA> ENSG00000228794.12 lncRNA LINC01128
## [600] <NA> ENSG00000228794.12 lncRNA LINC01128
## [601] <NA> ENSG00000228794.12 lncRNA LINC01128
## [602] <NA> ENSG00000228794.12 lncRNA LINC01128
## [603] <NA> ENSG00000228794.12 lncRNA LINC01128
## transcript_id transcript_type transcript_name exon_number
## <character> <character> <character> <character>
## [1] <NA> <NA> <NA> <NA>
## [2] ENST00000466430.5 lncRNA ENST00000466430 <NA>
## [3] ENST00000466430.5 lncRNA ENST00000466430 1
## [4] ENST00000466430.5 lncRNA ENST00000466430 2
## [5] ENST00000466430.5 lncRNA ENST00000466430 3
## ... ... ... ... ...
## [599] ENST00000688420.1 lncRNA LINC01128-235 4
## [600] ENST00000688420.1 lncRNA LINC01128-235 5
## [601] ENST00000425657.1 retained_intron LINC01128-203 <NA>
## [602] ENST00000425657.1 retained_intron LINC01128-203 1
## [603] ENST00000425657.1 retained_intron LINC01128-203 2
## -------
## seqinfo: 1 sequence from GRCh38 genome
Splitting this gtf into feature types can also be very handy for downstream processes.
gtf <- splitAsList(gtf, gtf$type)
Knowing where our variants lie, and how they relate to each other can be informative, and as such, some simple visualisation and summarisation functions have been included. In the following, we can check to see how many exons directly overlap a variant, showing how many unique genes this summarises to. Any ids, which don’t overlap a variants are also described in the plot title.
upsetVarByCol(gtf$exon, var, mcol = "gene_id")
In addition, we can obtain a simple breakdown of overlapping regions using a
GRangesList.
We can use the function defineRegions()
from extraChIPs
to define regions
based on gene & transcript locations.
This function assigns each position in the genome uniquely to a given feature
hierarchically, using all provided transcripts, meaning some exons will be
considered as promoters.
To ensure that all exons are included as exons, we can just substitute in the
values from our gtf for this feature type.
regions <- defineRegions(gtf$gene, gtf$transcript, gtf$exon, proximal = 0)
regions$exon <- granges(gtf$exon)
overlapsByVar(regions, var)
## Deletion Insertion SNV Total
## promoter 1 1 46 48
## upstream_promoter 0 0 20 20
## exon 0 1 37 38
## intron 5 0 24 29
## intergenic 0 0 0 0
The simplest method for modifying a reference genome is to simply call
genomogrify()
with either a DNAStringSet
or BSgenome
object.
A tag can be optionally added to all sequence names to avoid confusion.
chr1_mod <- genomogrify(chr1, var, tag = "mod")
chr1_mod
## DNAStringSet object of length 1:
## width seq names
## [1] 248956362 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNN chr1_mod
The new reference genome can be exported to fasta format using
writeXStringSet()
.
The process for creating a variant-modified reference transcriptome is
essentially the same as above, with the addition of exon-level co-ordinates for
the set of transcripts.
An optional tag can be added to all transcripts to indicate which have been
modified, with an additional tag able to be included which indicates which type
of variant has been incorporated into the new transcript sequence.
Variant tags will be one of s
, i
or d
to indicate SNV, Insertions or
Deletions respectively.
In our example dataset, only one transcript contains both SNVs and an insertion.
trans_mod <- transmogrify(
chr1, var, gtf$exon, tag = "mod", var_tags = TRUE
)
trans_mod
## DNAStringSet object of length 112:
## width seq names
## [1] 1947 ACCCTCCTTGAGACAGCCCTCC...TAAACAATACACACGTGTTAAA ENST00000326734.2...
## [2] 1702 CACACCGTGAGCTGCTGAGACG...GTGCAGGGCACAGGTGGGCGCC ENST00000357876.6
## [3] 1358 AATCAGAACTCGCGGTGGGGGC...ATAAAATTAATGAGAATGATCT ENST00000412115.2...
## [4] 421 GACAGGGTCTCCCTCTGTTGTC...AAAGCATCCAGGATTCAATGAA ENST00000414688.6
## [5] 894 CTGCAATACCTCACTCAATCTC...GTATTTGATGGCCTCTGTTGTT ENST00000415295.1
## ... ... ...
## [108] 3097 GCCAGTGTCTCCGCCGGTTGAA...GATAAAACTTAAAAATATCCCC ENST00000701768.1
## [109] 1601 GGAGTGCGTTCGGTGCGCCGTG...AGCTATTAAAAGAGACAGAGGC ENST00000702098.1
## [110] 1432 GTCTGCGTCGGGTTCTGTTGGA...ATCAATAAAAATTTAAATGCTC ENST00000702273.1
## [111] 1460 CTGTTAGGTAAGAGGCGCGTGA...ATCAATAAAAATTTAAATGCTC ENST00000702557.1
## [112] 1651 GTTAGGTAAGAGGCGCGTGAGG...GCTATTAAAAGAGACAGAGGCA ENST00000702847.1
names(trans_mod)[grepl("_si", names(trans_mod))]
## [1] "ENST00000473798.2_mod_si"
This can be simply exported again using writeXStringSet()
.
If using decoy transcripts for salmon
, the gentrome
can also be simply
exported by combining the two modified references.
Both of the above processes rely on the lower-level functions owl()
and
indelcator()
which overwrite letters or substitute indels respectively.
These are also able to be called individually as shown in the help pages.
Beyond these lower-level functions, PAR-Y regions for hg38, hg19 and CHM13 are
able to obtained using parY()
and passing the appropriate seqinfo
object.
This seqinfo
object checks the length of the Y-chromosome and guesses which
reference genome has been used.
sq_hg38 <- seqinfo(BSgenome.Hsapiens.UCSC.hg38)
parY(sq_hg38)
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | PAR
## <Rle> <IRanges> <Rle> | <character>
## [1] chrY 10001-2781479 * | PAR1
## [2] chrY 56887903-57217415 * | PAR2
## -------
## seqinfo: 711 sequences (1 circular) from hg38 genome
If the user wishes to exclude transcripts in the PAR-Y region, these ranges can
be passed to transmogrify()
and any transcripts which overlap the PAR-Y
region will be excluded.
Alternatively, passing the entire Y-chromosome to transmogrify()
will
exclude all transcripts in the Y-chromosome, as may be preferred for
female-specific references.
These regions can also be passed to genomogrify()
as a mask, with all bases
within the masked region being re-assigned as an N.
In addition, the set of splice junctions associated with any transcript can be obtained using our known exons.
ec <- c("transcript_id", "transcript_name", "gene_id", "gene_name")
sj <- sjFromExons(gtf$exon, extra_cols = ec)
sj
## GRanges object with 730 ranges and 6 metadata columns:
## seqnames ranges strand | site transcript_id
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr1 182745-182752 + | donor ENST00000624431.2
## [2] chr1 183128-183132 + | acceptor ENST00000624431.2
## [3] chr1 183215-183222 + | donor ENST00000624431.2
## [4] chr1 183490-183494 + | acceptor ENST00000624431.2
## [5] chr1 183570-183577 + | donor ENST00000624431.2
## ... ... ... ... . ... ...
## [726] chr1 810061-810068 - | donor ENST00000590817.3
## [727] chr1 810170-810174 - | acceptor ENST00000447500.4
## [728] chr1 817367-817374 - | donor ENST00000447500.4
## [729] chr1 827664-827671 - | donor ENST00000635509.2
## [730] chr1 827664-827671 - | donor ENST00000634337.2
## transcript_name gene_id gene_name exon_number
## <character> <character> <character> <integer>
## [1] DDX11L17-201 ENSG00000279928.2 DDX11L17 1
## [2] DDX11L17-201 ENSG00000279928.2 DDX11L17 2
## [3] DDX11L17-201 ENSG00000279928.2 DDX11L17 2
## [4] DDX11L17-201 ENSG00000279928.2 DDX11L17 3
## [5] DDX11L17-201 ENSG00000279928.2 DDX11L17 3
## ... ... ... ... ...
## [726] ENST00000590817 ENSG00000230092.8 ENSG00000230092 1
## [727] ENST00000447500 ENSG00000290784.1 ENSG00000290784 2
## [728] ENST00000447500 ENSG00000290784.1 ENSG00000290784 1
## [729] ENST00000635509 ENSG00000230021.10 ENSG00000230021 1
## [730] ENST00000634337 ENSG00000230021.10 ENSG00000230021 1
## -------
## seqinfo: 1 sequence from GRCh38 genome
Many splice junctions will be shared across multiple transcripts, so the
returned set of junctions can also be simplified using chopMC()
from
extraChIPs
.
chopMC(sj)
## GRanges object with 196 ranges and 6 metadata columns:
## seqnames ranges strand | site
## <Rle> <IRanges> <Rle> | <character>
## [1] chr1 182745-182752 + | donor
## [2] chr1 183128-183132 + | acceptor
## [3] chr1 183215-183222 + | donor
## [4] chr1 183490-183494 + | acceptor
## [5] chr1 183570-183577 + | donor
## ... ... ... ... . ...
## [192] chr1 808623-808627 - | acceptor
## [193] chr1 810061-810068 - | donor
## [194] chr1 810170-810174 - | acceptor
## [195] chr1 817367-817374 - | donor
## [196] chr1 827664-827671 - | donor
## transcript_id transcript_name
## <CharacterList> <CharacterList>
## [1] ENST00000624431.2 DDX11L17-201
## [2] ENST00000624431.2 DDX11L17-201
## [3] ENST00000624431.2 DDX11L17-201
## [4] ENST00000624431.2 DDX11L17-201
## [5] ENST00000624431.2 DDX11L17-201
## ... ... ...
## [192] ENST00000447500.4,ENST00000590817.3 ENST00000447500,ENST00000590817
## [193] ENST00000447500.4,ENST00000590817.3 ENST00000447500,ENST00000590817
## [194] ENST00000447500.4 ENST00000447500
## [195] ENST00000447500.4 ENST00000447500
## [196] ENST00000635509.2,ENST00000634337.2 ENST00000635509,ENST00000634337
## gene_id gene_name
## <CharacterList> <CharacterList>
## [1] ENSG00000279928.2 DDX11L17
## [2] ENSG00000279928.2 DDX11L17
## [3] ENSG00000279928.2 DDX11L17
## [4] ENSG00000279928.2 DDX11L17
## [5] ENSG00000279928.2 DDX11L17
## ... ... ...
## [192] ENSG00000290784.1,ENSG00000230092.8 ENSG00000290784,ENSG00000230092
## [193] ENSG00000290784.1,ENSG00000230092.8 ENSG00000290784,ENSG00000230092
## [194] ENSG00000290784.1 ENSG00000290784
## [195] ENSG00000290784.1 ENSG00000290784
## [196] ENSG00000230021.10,ENSG00000230021.10 ENSG00000230021,ENSG00000230021
## exon_number
## <IntegerList>
## [1] 1
## [2] 2
## [3] 2
## [4] 3
## [5] 3
## ... ...
## [192] 3,2
## [193] 2,1
## [194] 2
## [195] 1
## [196] 1,1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
As a further alternative, splice junctions can be returned as a set of
interactions, with donor sites being assigned to the anchorOne
element, and
acceptor sites being placed in the anchorTwo
element
This enables the identification of all splice junctions for specific
transcripts.
sj <- sjFromExons(gtf$exon, extra_cols = ec, as = "GInteractions")
subset(sj, transcript_name == "DDX11L17-201")
## GInteractions object with 4 interactions and 5 metadata columns:
## seqnames1 ranges1 seqnames2 ranges2 | transcript_id
## <Rle> <IRanges> <Rle> <IRanges> | <character>
## [1] chr1 182745-182752 --- chr1 183128-183132 | ENST00000624431.2
## [2] chr1 183215-183222 --- chr1 183490-183494 | ENST00000624431.2
## [3] chr1 183570-183577 --- chr1 183736-183740 | ENST00000624431.2
## [4] chr1 183900-183907 --- chr1 183977-183981 | ENST00000624431.2
## transcript_name gene_id gene_name sj
## <character> <character> <character> <integer>
## [1] DDX11L17-201 ENSG00000279928.2 DDX11L17 1
## [2] DDX11L17-201 ENSG00000279928.2 DDX11L17 2
## [3] DDX11L17-201 ENSG00000279928.2 DDX11L17 3
## [4] DDX11L17-201 ENSG00000279928.2 DDX11L17 4
## -------
## regions: 196 ranges and 0 metadata columns
## seqinfo: 1 sequence from GRCh38 genome
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] BSgenome.Hsapiens.UCSC.hg38_1.4.5 BSgenome_1.75.0
## [3] BiocIO_1.17.0 transmogR_1.3.0
## [5] extraChIPs_1.11.0 tibble_3.2.1
## [7] ggside_0.3.1 ggplot2_3.5.1
## [9] BiocParallel_1.41.0 rtracklayer_1.67.0
## [11] VariantAnnotation_1.53.0 Rsamtools_2.23.0
## [13] Biostrings_2.75.0 XVector_0.47.0
## [15] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [17] GenomicRanges_1.59.0 GenomeInfoDb_1.43.0
## [19] IRanges_2.41.0 S4Vectors_0.45.0
## [21] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [23] BiocGenerics_0.53.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-9 filelock_1.0.3
## [3] polyclip_1.10-7 XML_3.99-0.17
## [5] rpart_4.1.23 lifecycle_1.0.4
## [7] httr2_1.0.5 edgeR_4.5.0
## [9] lattice_0.22-6 ensembldb_2.31.0
## [11] MASS_7.3-61 backports_1.5.0
## [13] magrittr_2.0.3 limma_3.63.0
## [15] Hmisc_5.2-0 sass_0.4.9
## [17] rmarkdown_2.28 jquerylib_0.1.4
## [19] yaml_2.3.10 metapod_1.15.0
## [21] Gviz_1.51.0 DBI_1.2.3
## [23] RColorBrewer_1.1-3 abind_1.4-8
## [25] zlibbioc_1.53.0 purrr_1.0.2
## [27] AnnotationFilter_1.31.0 biovizBase_1.55.0
## [29] RCurl_1.98-1.16 nnet_7.3-19
## [31] tweenr_2.0.3 rappdirs_0.3.3
## [33] GenomeInfoDbData_1.2.13 ggrepel_0.9.6
## [35] codetools_0.2-20 DelayedArray_0.33.0
## [37] xml2_1.3.6 ggforce_0.4.2
## [39] tidyselect_1.2.1 futile.logger_1.4.3
## [41] UCSC.utils_1.3.0 farver_2.1.2
## [43] ComplexUpset_1.3.3 BiocFileCache_2.15.0
## [45] base64enc_0.1-3 GenomicAlignments_1.43.0
## [47] jsonlite_1.8.9 Formula_1.2-5
## [49] tools_4.5.0 progress_1.2.3
## [51] Rcpp_1.0.13 glue_1.8.0
## [53] gridExtra_2.3 SparseArray_1.7.0
## [55] xfun_0.48 dplyr_1.1.4
## [57] withr_3.0.2 formatR_1.14
## [59] BiocManager_1.30.25 fastmap_1.2.0
## [61] latticeExtra_0.6-30 fansi_1.0.6
## [63] digest_0.6.37 R6_2.5.1
## [65] colorspace_2.1-1 jpeg_0.1-10
## [67] dichromat_2.0-0.1 biomaRt_2.63.0
## [69] RSQLite_2.3.7 utf8_1.2.4
## [71] tidyr_1.3.1 generics_0.1.3
## [73] data.table_1.16.2 prettyunits_1.2.0
## [75] InteractionSet_1.35.0 httr_1.4.7
## [77] htmlwidgets_1.6.4 S4Arrays_1.7.0
## [79] pkgconfig_2.0.3 gtable_0.3.6
## [81] blob_1.2.4 htmltools_0.5.8.1
## [83] bookdown_0.41 ProtGenerics_1.39.0
## [85] scales_1.3.0 png_0.1-8
## [87] knitr_1.48 lambda.r_1.2.4
## [89] rstudioapi_0.17.1 rjson_0.2.23
## [91] checkmate_2.3.2 curl_5.2.3
## [93] cachem_1.1.0 stringr_1.5.1
## [95] parallel_4.5.0 foreign_0.8-87
## [97] AnnotationDbi_1.69.0 restfulr_0.0.15
## [99] pillar_1.9.0 grid_4.5.0
## [101] vctrs_0.6.5 dbplyr_2.5.0
## [103] cluster_2.1.6 htmlTable_2.4.3
## [105] evaluate_1.0.1 VennDiagram_1.7.3
## [107] GenomicFeatures_1.59.0 cli_3.6.3
## [109] locfit_1.5-9.10 compiler_4.5.0
## [111] futile.options_1.0.1 rlang_1.1.4
## [113] crayon_1.5.3 labeling_0.4.3
## [115] interp_1.1-6 forcats_1.0.0
## [117] stringi_1.8.4 deldir_2.0-4
## [119] munsell_0.5.1 lazyeval_0.2.2
## [121] csaw_1.41.0 Matrix_1.7-1
## [123] hms_1.1.3 patchwork_1.3.0
## [125] bit64_4.5.2 KEGGREST_1.47.0
## [127] statmod_1.5.0 highr_0.11
## [129] igraph_2.1.1 broom_1.0.7
## [131] memoise_2.0.1 bslib_0.8.0
## [133] bit_4.5.0 GenomicInteractions_1.41.0