BiocNeighbors 1.20.2
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 9170 2187 3043 4787 861 2532 5533 2606 9158 4469
## [2,] 3827 4027 5679 2543 1496 8751 5169 2427 4364 5374
## [3,] 2490 6220 491 782 1402 8024 5193 241 4212 3119
## [4,] 7890 3473 6915 8848 6597 7120 4462 6592 427 2074
## [5,] 4081 4380 8101 928 1315 5265 9589 3443 1797 6275
## [6,] 8776 8308 4594 866 8936 9907 2138 3964 2636 1667
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8146991 0.8668554 0.8736681 0.8850469 0.9021078 0.9033343 0.9171228
## [2,] 0.8051909 0.8937845 0.8978544 0.9111407 0.9298306 0.9401582 0.9517565
## [3,] 0.8022693 0.8939355 0.9111685 0.9647251 0.9694588 0.9939378 0.9946436
## [4,] 0.6883357 0.9274211 0.9429976 0.9458513 0.9663146 0.9871839 0.9910587
## [5,] 0.9946769 1.0624418 1.0891799 1.1080674 1.1111146 1.1336514 1.1581675
## [6,] 0.7206981 0.8502777 0.8793322 0.9103470 0.9132983 0.9172049 0.9481169
## [,8] [,9] [,10]
## [1,] 0.9353919 0.9707053 0.9714199
## [2,] 0.9609740 0.9773840 0.9798102
## [3,] 1.0049042 1.0062886 1.0068802
## [4,] 1.0054924 1.0091718 1.0177615
## [5,] 1.1589582 1.1643794 1.1648592
## [6,] 0.9489189 0.9538189 0.9592605
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 862 6302 1223 4677 5413
## [2,] 5172 732 8591 7410 3946
## [3,] 4982 4212 2830 3798 2292
## [4,] 4808 4269 3125 5776 8491
## [5,] 5411 8148 462 7649 1857
## [6,] 7683 9486 50 5077 1361
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9219522 1.0506451 1.0899365 1.1152922 1.1193435
## [2,] 0.9985842 1.0723281 1.0783435 1.0823529 1.0984902
## [3,] 0.9523321 0.9549680 0.9740916 0.9771364 0.9997572
## [4,] 0.7610670 0.9169366 0.9436059 0.9979706 1.0366374
## [5,] 0.9318078 0.9839750 0.9871143 1.0033323 1.0038940
## [6,] 0.8195763 0.9064996 0.9649357 0.9751717 0.9966074
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/Rtmp3mVRlN/file11b4df45400394.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.3.2 Patched (2023-11-13 r85521)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.20.2 knitr_1.45 BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.2 rlang_1.1.2 xfun_0.41
## [4] jsonlite_1.8.8 S4Vectors_0.40.2 htmltools_0.5.7
## [7] stats4_4.3.2 sass_0.4.8 rmarkdown_2.25
## [10] grid_4.3.2 evaluate_0.23 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.8 lifecycle_1.0.4
## [16] bookdown_0.37 BiocManager_1.30.22 compiler_4.3.2
## [19] codetools_0.2-19 Rcpp_1.0.11 BiocParallel_1.36.0
## [22] lattice_0.22-5 digest_0.6.33 R6_2.5.1
## [25] parallel_4.3.2 bslib_0.6.1 Matrix_1.6-4
## [28] tools_4.3.2 BiocGenerics_0.48.1 cachem_1.0.8