SynExtend
is a package of tools for working with objects of class Synteny
built from the package DECIPHER
’s FindSynteny()
function.
Synteny maps provide a powerful tool for quantifying and visualizing where pairs of genomes share order. Typically these maps are built from predictions of orthologous pairs, where groups of pairs that provide contiguous and sequential blocks in their respective genomes are deemed a ‘syntenic block’. That designation of synteny can then used to further interrogate the predicted orthologs themselves, or query topics like genomic rearrangements or ancestor genome reconstruction.
FindSynteny
takes a different approach, finding exactly matched shared k-mers and determining where shared k-mers, or blocks of proximate shared k-mers are significant. Combining the information generated by FindSynteny
with locations of genomic features allows us to simply mark where features are linked by syntenic k-mers. These linked features represent potential orthologous pairs, and can be easily evaluated on the basis of the k-mers that they share, or alignment.
Currently SynExtend
contains one set of functions for performig orthology predictions, as well as a rearrangement estimation function that is currently under construction.
SynExtend
in R by running the following commands:if (!requireNamespace("BiocManager",
quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("SynExtend")
Using the FindSynteny
function in DECIPHER
builds an object of class Synteny
. In this tutorial, a prebuilt DECIPHER
database is used. For database construction see ?Seqs2DB
in DECIPHER
. This example starts with a database containing three endosymbiont genomes that were chosen to keep package data to a minimum.
library(SynExtend)
## Loading required package: DECIPHER
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## Loading required package: generics
##
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
## setequal, union
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
## mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
## unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
##
## Attaching package: 'SynExtend'
## The following object is masked from 'package:stats':
##
## dendrapply
DBPATH <- system.file("extdata",
"Endosymbionts_v03a.sqlite",
package = "SynExtend")
Syn <- FindSynteny(dbFile = DBPATH)
## ================================================================================
##
## Time difference of 24.9 secs
Synteny maps represent where genomes share order. Simply printing a synteny object to the console displays a gross level view of the data inside. Objects of class Synteny
can also be plotted to provide clear visual representations of the data inside. The genomes used in this example are distantly related and fairly dissimilar.
Syn
## 1 2 3 4
## 1 1 seq 17% hits 16% hits 8.2% hits
## 2 73 blocks 1 seq 16% hits 5.4% hits
## 3 68 blocks 196 blocks 1 seq 4.4% hits
## 4 38 blocks 61 blocks 46 blocks 1 seq
pairs(Syn)
Data present inside objects of class Synteny
can also be accessed relatively easily. The object itself is functionally a matrix of lists, with data describing exactly matched k-mers present in the upper triangle, and data describing blocks of chained k-mers in the lower triangle. For more information see ?FindSynteny
in the package DECIPHER
.
print(head(Syn[[1, 2]]))
## index1 index2 strand width start1 start2 frame1 frame2
## [1,] 1 1 0 1587 70082 398904 2 3
## [2,] 1 1 0 1260 253700 334742 2 2
## [3,] 1 1 0 123 159564 2516455 3 1
## [4,] 1 1 0 1080 159696 2516605 3 1
## [5,] 1 1 0 825 3748 3918707 1 2
## [6,] 1 1 0 471 17701 4103 1 2
print(head(Syn[[2, 1]]))
## index1 index2 strand score start1 start2 end1 end2 first_hit
## [1,] 1 1 0 1025 70082 398904 71668 400490 1
## [2,] 1 1 0 876 253700 334742 254959 336001 2
## [3,] 1 1 0 559 159564 2516455 160775 2517684 3
## [4,] 1 1 0 557 3748 3918707 4572 3919531 5
## [5,] 1 1 0 439 17701 4103 19398 5797 6
## [6,] 1 1 0 349 171438 3882325 172199 3883086 10
## last_hit
## [1,] 1
## [2,] 2
## [3,] 4
## [4,] 5
## [5,] 9
## [6,] 10
The above printed objects show the data for the comparison between the first and second genome in our database.
To take advantage of these synteny maps, we can then overlay the gene calls for each genome present on top of our map.
Next, GFF annotations for the associated genomes are parsed to provide gene calls in a use-able format. GFFs are not the only possible source of appropriate gene calls, but they are the source that was used during package construction and testing. Parsed GFFs can be constructed with gffToDataFrame
, for full functionality, or GFFs can be imported via rtracklater::import()
for limited functionality. GeneCalls for both the PairSummaries
and NucleotideOverlap
functions must be named list, and those names must match dimnames(Syn)[[1]]
.
# generating genecalls with local data:
GC <- gffToDataFrame(GFF = system.file("extdata",
"GCF_902860225.1_Azoamicus_ciliaticola_assembly_genomic.gff.gz",
package = "SynExtend"),
Verbose = TRUE)
## ================================================================================
## Time difference of 3.798205 secs
# in an effort to be space conscious, not all original gffs are kept within this package
GeneCalls <- get(data("Endosymbionts_GeneCalls", package = "SynExtend"))
SynExtend
’s gffToDataFrame
function will directly import gff files into a usable format, and includes other extracted information.
print(head(GeneCalls[[1]]))
## DataFrame with 6 rows and 11 columns
## Index Strand Start Stop Type ID
## <integer> <integer> <integer> <integer> <character> <character>
## 1 1 1 96 1103 gene gene-ESZ_RS00005
## 2 1 0 1189 1422 gene gene-ESZ_RS00010
## 3 1 0 1429 2946 gene gene-ESZ_RS00015
## 4 1 0 2956 4587 gene gene-ESZ_RS00020
## 5 1 0 4623 4695 gene gene-ESZ_RS00025
## 6 1 0 4736 5989 gene gene-ESZ_RS00030
## Range Product Coding Translation_Table
## <IRangesList> <character> <logical> <character>
## 1 96-1103 tRNA (adenosine(37)-.. TRUE 11
## 2 1189-1422 30S ribosomal protei.. TRUE 11
## 3 1429-2946 DNA primase TRUE 11
## 4 2956-4587 RNA polymerase sigma.. TRUE 11
## 5 4623-4695 tRNA-Ile FALSE NA
## 6 4736-5989 UbiA family prenyltr.. TRUE 11
## Contig
## <character>
## 1 NZ_LR794158.1
## 2 NZ_LR794158.1
## 3 NZ_LR794158.1
## 4 NZ_LR794158.1
## 5 NZ_LR794158.1
## 6 NZ_LR794158.1
Raw GFF imports are also acceptable, but prevent alignments in amino acid space with PairSummaries()
.
X01 <- rtracklayer::import(system.file("extdata",
"GCF_902860225.1_Azoamicus_ciliaticola_assembly_genomic.gff.gz",
package = "SynExtend"))
class(X01)
print(X01)
SynExtend
’s primary functions provide a way to identify where pairs of genes are explicitly linked by syntenic hits, and then summarize those links. The first step is just identifying those links.
Links <- NucleotideOverlap(SyntenyObject = Syn,
GeneCalls = GeneCalls,
Verbose = TRUE)
##
## Reconciling genecalls.
## ================================================================================
## Finding connected features.
## ================================================================================
## Time difference of 0.4339626 secs
The Links
object generated by NucleotideOverlap is a raw representation of positions on the synteny map where shared k-mers link genes between paired genomes. As such, it is analagous in shape to objects of class Synteny
. This raw object is unlikely to be useful to most users, but has been left exposed to ensure that this data remains accessible should a user desire to have access to it.
class(Links)
## [1] "LinkedPairs"
print(Links)
## 1 2 3 4
## 1 1 Contig 69 Pairs 61 Pairs 31 Pairs
## 2 134 Kmers 1 Contig 181 Pairs 51 Pairs
## 3 173 Kmers 391 Kmers 1 Contig 35 Pairs
## 4 104 Kmers 108 Kmers 103 Kmers 1 Contig
This raw data can be processed to provide a straightforward summary of predicted pairs.
LinkedPairs1 <- PairSummaries(SyntenyLinks = Links,
DBPATH = DBPATH,
PIDs = FALSE,
Verbose = TRUE)
##
## Preparing overhead data.
## Overhead complete.
## Collecting pairs.
## ================================================================================
## Time difference of 3.577887 secs
The object LinkedPairs1
is a data.frame where each row is populated by information about a predicted orthologous pair. By default PairSummaries
uses a simple model to determine whether the k-mers that link a pair of genes are likely to provide an erroneous link. When set to Model = "Global"
, is is simply a prediction of whether the involved nucleotides are likely to describe a pair of genomic features whose alignment would result in a PID that falls within a random distribution. This model is effective if somewhat permissive, but is significantly faster than performing many pairwise alignments.
print(head(LinkedPairs1))
## p1 p2 ExactMatch Consensus TotalKmers MaxKmer p1FeatureLength
## 1 1_1_4 2_1_3729 825 0.9675952 1 825 1632
## 2 1_1_28 2_1_4 1677 0.9969392 4 813 2385
## 3 1_1_34 2_1_4433 504 0.9842255 2 294 969
## 4 1_1_41 2_1_4555 687 0.9946023 1 687 1614
## 5 1_1_43 2_1_4548 1281 0.9967678 3 756 1866
## 6 1_1_45 2_1_4540 333 0.9840446 2 240 783
## p2FeatureLength Adjacent TetDist PIDType PredictedPID
## 1 1872 0 0.11640382 AA 0.9945130
## 2 2424 0 0.10347594 AA 0.9501826
## 3 930 0 0.09708407 AA 0.8802065
## 4 1656 0 0.11448838 AA 0.9923819
## 5 1875 0 0.11172669 AA 0.9781976
## 6 867 1 0.11073001 AA 0.7290626
PairSummaries includes arguments that allow for aligning all pairs that are predicted, via PIDs = TRUE
, while IgnoreDefaultStringSet = FALSE
indicates that alignments should be performed in nucleotide or amino acid space as is appropriate for the linked sequences. Setting IgnoreDefaultStringSet = TRUE
will force all alignments into nucleotide space.
As of SynExtend v 1.3.13, the functions ExtractBy
and DisjointSet
have been added to provide users with direct tools to work with PairSummaries
objects.
SingleLinkageClusters <- DisjointSet(Pairs = LinkedPairs1,
Verbose = TRUE)
##
## Assigning initial root:
## ================================================================================
##
## Time difference of 0.004498482 secs
##
## Assigning final root:
## ================================================================================
##
## Time difference of 0.005007267 secs
##
## Assigning single linkage clusters.
## Assignments complete.
##
## Time difference of 0.01261878 secs
# extract the first 10 clusters
Sets <- ExtractBy(x = LinkedPairs1,
y = DBPATH,
z = SingleLinkageClusters[1:10],
Verbose = TRUE)
##
## Extracting Sequences:
## ================================================================================
##
## Arranging Sequences:
## ================================================================================
##
## Time difference of 0.3249714 secs
head(Sets)
## [[1]]
## DNAStringSet object of length 4:
## width seq names
## [1] 1632 ATGTATGATTTAGATTCAGAATA...AAGAATTTTCGTATGATGATTGA 1_1_4
## [2] 1872 GTGTCGAACACTGAGGATCAGAT...GCTTCCTCGACGGCGACAACTAA 2_1_3729
## [3] 1836 ATGGAACATAACCCACAATCACA...TGCGAAGCTTTTTAGATGATTAA 3_1_599
## [4] 1935 ATGACAAGAGATGATCTAATAAA...TTGAAACCCCTCCAATCAAGTAA 4_1_240
##
## [[2]]
## DNAStringSet object of length 4:
## width seq names
## [1] 2385 ATGATTAATTATGATTCTTCGAA...CAAATTTTGACTTAGATTTTTAA 1_1_28
## [2] 2424 ATGACCGACAACGACAAATCCAG...CCGTCGAAAATCTCGACGTCTAA 2_1_4
## [3] 2415 ATGTTGAGTGCTTATGATTCTTC...AAGCAACAAATGTAGATTACTAA 3_1_640
## [4] 2385 ATGGATTCAGACGATATTAAGAA...ATGTGAGAAATTTAGATGTATAA 4_1_4
##
## [[3]]
## DNAStringSet object of length 2:
## width seq names
## [1] 969 ATGAGAAAAAAAATATCTTTAAT...TCAAATCTAAATTAAATATTTAA 1_1_34
## [2] 930 ATGCAGAAAGTTGCGATCGTGGG...TCGAACGAATTCCGTTAGCCTGA 2_1_4433
##
## [[4]]
## DNAStringSet object of length 3:
## width seq names
## [1] 1614 ATGATTAAAAATATTAAATATAT...TTATTGTTAAAAACATTTTATAA 1_1_41
## [2] 1656 ATGGATAACATTCGACTCTTTCT...TCGAAGAAGGTGCGAAGAGCTGA 2_1_4555
## [3] 1929 ATGGTATCACCATTTGTGTTAAT...GATTACACAATAAAAATAACTAA 3_1_644
##
## [[5]]
## DNAStringSet object of length 3:
## width seq names
## [1] 1866 ATGTTACGTTGTTTTAATTATTT...TTTATATAAAAAAAATTTTTTGA 1_1_43
## [2] 1875 ATGTTTCACAGCGAGGCCTACGA...ACTTGAAAAAGAAGTCCGCCTGA 2_1_4548
## [3] 1899 ATGTTCTATCCCACTCATTTTGA...TGTTAGGATACAATACATGTTAA 3_1_655
##
## [[6]]
## DNAStringSet object of length 3:
## width seq names
## [1] 783 ATGAGTGTAAATAATAGTGTTAT...GTTTAGCAGAAGAAGCACATTAA 1_1_45
## [2] 867 ATGGCTGGCTCTGAGACGATTAC...TGGCGCATTCTGAACATCACTGA 2_1_4540
## [3] 813 ATGTCAGGAATCAAAGGCACTTC...CTACAGCTCATGATCCCTGTTGA 3_1_654
Session Info:
sessionInfo()
## R Under development (unstable) (2025-01-20 r87609)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SynExtend_1.19.6 DECIPHER_3.3.2 Biostrings_2.75.3
## [4] GenomeInfoDb_1.43.4 XVector_0.47.2 IRanges_2.41.2
## [7] S4Vectors_0.45.2 BiocGenerics_0.53.6 generics_0.1.3
## [10] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] bit_4.5.0.1 jsonlite_1.8.9 compiler_4.5.0
## [4] BiocManager_1.30.25 crayon_1.5.3 Rcpp_1.0.14
## [7] tinytex_0.54 blob_1.2.4 magick_2.8.5
## [10] parallel_4.5.0 jquerylib_0.1.4 yaml_2.3.10
## [13] fastmap_1.2.0 R6_2.5.1 knitr_1.49
## [16] bookdown_0.42 GenomeInfoDbData_1.2.13 DBI_1.2.3
## [19] bslib_0.9.0 rlang_1.1.5 cachem_1.1.0
## [22] xfun_0.50 sass_0.4.9 bit64_4.6.0-1
## [25] RSQLite_2.3.9 memoise_2.0.1 cli_3.6.3
## [28] magrittr_2.0.3 digest_0.6.37 lifecycle_1.0.4
## [31] vctrs_0.6.5 evaluate_1.0.3 rmarkdown_2.29
## [34] httr_1.4.7 pkgconfig_2.0.3 tools_4.5.0
## [37] htmltools_0.5.8.1 UCSC.utils_1.3.1