Contents

1 Overview

The “Spaced Words Projection (sWeeP)” is a method for representing biological sequences using relatively, it uses the spacedwords concept by scanning sequences and generating indices to create a higherdimensional vector that is later projected into a smaller randomly oriented orthonormal base. This function is suitable for making high quality comparisons between sequences allowing analyzes that are not possible due to the computational limitation of the traditional techniques. The method is available at sWeeP (PIERRI, 2019). This tool has it’s main speed gain in constanci processing time. The response time grows linear to the number of inputs, while in other methods it grow is exponencial.

1.1 Functions

The package has two functions: orthBase, that generates an orthonormal matrix of a chosen size, and sWeeP, a function that applies the sWeeP method

2 Quick Start

The orthBase function can create a quasi-orthonormal matrix in any desired size. Here it is used to create a matrix to project the sWeeP method, so it must have 160.000 rows and the columns of the size wished for projection.

library(rSWeeP)
baseMatrix <- orthBase(160000,10)

The exdna.fas dataset consists in a list of three strings that simulates a DNA sequence used for demonstration purposes only.

path <- system.file(package = "rSWeeP", "extdata", "exdna.fas")

Then the sWeeP method is applied and the returns a matrix that represents the sequences compared by a vectorial method. And then it’s possible to see a graphic representation in a phylogenetic tree

return <- sWeeP(path,baseMatrix)
distancia <- dist(return, method = "euclidean")
tree <- hclust(distancia, method="ward.D")
plot(tree, hang = -1, cex = 1)

3 Session information

## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rSWeeP_1.14.0    BiocStyle_2.30.0
## 
## loaded via a namespace (and not attached):
##  [1] crayon_1.5.2            cli_3.6.1               knitr_1.44             
##  [4] magick_2.8.1            rlang_1.1.1             xfun_0.40              
##  [7] jsonlite_1.8.7          S4Vectors_0.40.0        RCurl_1.98-1.12        
## [10] Biostrings_2.70.0       htmltools_0.5.6.1       pracma_2.4.2           
## [13] sass_0.4.7              stats4_4.3.1            rmarkdown_2.25         
## [16] evaluate_0.22           jquerylib_0.1.4         bitops_1.0-7           
## [19] fastmap_1.1.1           GenomeInfoDb_1.38.0     yaml_2.3.7             
## [22] IRanges_2.36.0          bookdown_0.36           BiocManager_1.30.22    
## [25] compiler_4.3.1          Rcpp_1.0.11             XVector_0.42.0         
## [28] digest_0.6.33           R6_2.5.1                GenomeInfoDbData_1.2.11
## [31] magrittr_2.0.3          bslib_0.5.1             tools_4.3.1            
## [34] zlibbioc_1.48.0         BiocGenerics_0.48.0     cachem_1.0.8

4 References

Appendix