BiocNeighbors 1.0.0

The *BiocNeighbors* package provides an implementation of the k-means for k-nearest neighbors (KMKNN) algorithm, as described by Wang (2012).
For a dataset with \(N\) points, the pre-training is done as follows:

- Apply k-means clustering to all points, partitioning the data into \(\sqrt{N}\) clusters.
- Compute the distance from each data point to its cluster center.
- Store the cluster identities and distances.

For each query point, identification of the nearest neighbors is done as follows:

- Start with a threshold distance \(d\) to the current kth-nearest neighbor (this can be set with arbitrary points).
- Compute the distance from the query to each cluster center.
- For any given cluster center, apply the triangle inequality on the query-center distance, the center-point distances and \(d\). Only compute query-point distances for points where the triangle inequality holds.
- Update \(d\) with the new closest kth-nearest neighbor and repeat for the next cluster.

The pre-clustering arranges the points in a manner that effectively reduces the search space, even in high-dimensional data.
Note that, while `kmeans`

itself is random, the k-nearest neighbors result is fully deterministic1 Except in the presence of ties, see `?findKNN`

for details..

The algorithm is implemented in a combination of R and C++, derived from code in *cydar* (Lun, Richard, and Marioni 2017).
We observe 2-5-fold speed-ups in 20- to 50-dimensional data, compared to KD-trees in *FNN* and *RANN* (see https://github.com/LTLA/OkNN2018 for timings).
This is consistent with results from Wang (2012).

The most obvious application is to perform a k-nearest neighbors search. We’ll mock up an example here with a hypercube of points, for which we want to identify the 10 nearest neighbors for each point.

```
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
```

The `findKNN()`

method expects a numeric matrix as input with data points as the rows and variables/dimensions as the columns.
We indicate that we want to use the KMKNN algorithm by setting `BNPARAM=KmknnParam()`

(which is also the default, so this is not strictly necessary here).

```
fout <- findKNN(data, k=10, BNPARAM=KmknnParam())
head(fout$index)
```

```
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 8348 2079 4736 3502 9414 572 2442 1267 7704 836
## [2,] 2337 4921 8189 6943 1764 4972 4591 2667 7761 3753
## [3,] 5538 7759 8677 666 9359 6286 2244 4550 5 5858
## [4,] 233 6731 2933 2571 5413 9063 993 3348 9286 3006
## [5,] 8482 3114 8329 7454 5157 3850 7096 7214 1537 6223
## [6,] 6759 4456 986 5793 9699 1628 3969 6168 2692 2962
```

`head(fout$distance)`

```
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.9389646 1.0246597 1.0809231 1.0848316 1.0854004 1.1159382 1.127869
## [2,] 0.9756834 1.0439384 1.0904817 1.0982062 1.1187872 1.1229278 1.127581
## [3,] 0.9672927 0.9799397 1.0203636 1.0643326 1.0750760 1.0783592 1.081310
## [4,] 0.9633158 1.0318064 1.0653143 1.0826126 1.1119344 1.1153617 1.117624
## [5,] 0.8818073 0.8873177 0.9171854 0.9613600 0.9915028 0.9915082 1.013779
## [6,] 0.8738909 0.9602641 0.9647539 0.9789634 0.9963525 1.0135661 1.016573
## [,8] [,9] [,10]
## [1,] 1.129126 1.156162 1.157357
## [2,] 1.139209 1.140677 1.155761
## [3,] 1.105882 1.111450 1.122835
## [4,] 1.131882 1.132397 1.135052
## [5,] 1.016158 1.021197 1.022092
## [6,] 1.034331 1.046464 1.046562
```

Each row of the `index`

matrix corresponds to a point in `data`

and contains the row indices in `data`

that are its nearest neighbors.
For example, the 3rd point in `data`

has the following nearest neighbors:

`fout$index[3,]`

`## [1] 5538 7759 8677 666 9359 6286 2244 4550 5 5858`

… with the following distances to those neighbors:

`fout$distance[3,]`

```
## [1] 0.9672927 0.9799397 1.0203636 1.0643326 1.0750760 1.0783592 1.0813100
## [8] 1.1058818 1.1114497 1.1228352
```

Note that the reported neighbors are sorted by distance.

Another application is to identify the k-nearest neighbors in one dataset based on query points in another dataset. Again, we mock up a small data set:

```
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
```

We then use the `queryKNN()`

function to identify the 5 nearest neighbors in `data`

for each point in `query`

.

```
qout <- queryKNN(data, query, k=5, BNPARAM=KmknnParam())
head(qout$index)
```

```
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1595 2232 7850 3846 5146
## [2,] 1868 776 971 5400 901
## [3,] 4493 8490 6240 4826 9763
## [4,] 9987 9241 5364 7506 9294
## [5,] 1305 878 755 3071 9608
## [6,] 8043 3931 8989 1913 3324
```

`head(qout$distance)`

```
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9267549 1.0948798 1.1076541 1.1148193 1.1215079
## [2,] 0.8081274 0.8638353 0.8764378 0.8878907 0.8881987
## [3,] 0.8871820 0.9110784 0.9861469 0.9915133 1.0194031
## [4,] 0.9987595 1.0064124 1.0105680 1.0153516 1.0988825
## [5,] 0.8459096 0.8743964 0.9751048 0.9978839 1.0123009
## [6,] 0.9226638 1.0071283 1.0274564 1.0449529 1.0450088
```

Each row of the `index`

matrix contains the row indices in `data`

that are the nearest neighbors of a point in `query`

.
For example, the 3rd point in `query`

has the following nearest neighbors in `data`

:

`qout$index[3,]`

`## [1] 4493 8490 6240 4826 9763`

… with the following distances to those neighbors:

`qout$distance[3,]`

`## [1] 0.8871820 0.9110784 0.9861469 0.9915133 1.0194031`

Again, the reported neighbors are sorted by distance.

Users can perform the search for a subset of query points using the `subset=`

argument.
This yields the same result as but is more efficient than performing the search for all points and subsetting the output.

`findKNN(data, k=5, subset=3:5)`

```
## $index
## [,1] [,2] [,3] [,4] [,5]
## [1,] 5538 7759 8677 666 9359
## [2,] 233 6731 2933 2571 5413
## [3,] 8482 3114 8329 7454 5157
##
## $distance
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9672927 0.9799397 1.0203636 1.064333 1.0750760
## [2,] 0.9633158 1.0318064 1.0653143 1.082613 1.1119344
## [3,] 0.8818073 0.8873177 0.9171854 0.961360 0.9915028
```

If only the indices are of interest, users can set `get.distance=FALSE`

to avoid returning the matrix of distances.
This will save some time and memory.

`names(findKNN(data, k=2, get.distance=FALSE))`

`## [1] "index"`

It is also simple to speed up functions by parallelizing the calculations with the *BiocParallel* framework.

`out <- findKNN(data, k=10, BPPARAM=MulticoreParam(3))`

For multiple queries to a constant `data`

, the pre-clustering can be performed in a separate step with `buildNNIndex()`

.
The result can then be passed to multiple calls, avoiding the overhead of repeated clustering2 The algorithm type is automatically determined when `BNINDEX`

is specified, so there is no need to also specify `BNPARAM`

in the later functions..

```
pre <- buildNNIndex(data, BNPARAM=KmknnParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
```

Advanced users may also be interested in the `raw.index=`

argument, which returns indices directly to the precomputed object rather than to `data`

.
This may be useful inside package functions where it may be more convenient to work on a common precomputed object.

`sessionInfo()`

```
## R version 3.5.1 Patched (2018-07-12 r74967)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.0.0 BiocParallel_1.16.0 knitr_1.20
## [4] BiocStyle_2.10.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.19 bookdown_0.7 digest_0.6.18
## [4] rprojroot_1.3-2 backports_1.1.2 stats4_3.5.1
## [7] magrittr_1.5 evaluate_0.12 stringi_1.2.4
## [10] S4Vectors_0.20.0 rmarkdown_1.10 tools_3.5.1
## [13] stringr_1.3.1 parallel_3.5.1 xfun_0.4
## [16] yaml_2.2.0 compiler_3.5.1 BiocGenerics_0.28.0
## [19] BiocManager_1.30.3 htmltools_0.3.6
```

Lun, A. T. L., A. C. Richard, and J. C. Marioni. 2017. “Testing for differential abundance in mass cytometry data.” *Nat. Methods* 14 (7):707–9.

Wang, X. 2012. “A Fast Exact k-Nearest Neighbors Algorithm for High Dimensional Search Using k-Means Clustering and Triangle Inequality.” *Proc Int Jt Conf Neural Netw* 43 (6):2351–8.