TileDBArray 1.2.1
TileDB implements a framework for local and remote storage of dense and sparse arrays.
We can use this as a DelayedArray
backend to provide an array-level abstraction,
thus allowing the data to be used in many places where an ordinary array or matrix might be used.
The TileDBArray package implements the necessary wrappers around TileDB-R
to support read/write operations on TileDB arrays within the DelayedArray framework.
TileDBArray
Creating a TileDBArray
is as easy as:
X <- matrix(rnorm(1000), ncol=10)
library(TileDBArray)
writeTileDBArray(X)
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] 1.5668693 0.7501519 0.2134371 . -1.318506337 0.707569761
## [2,] -0.9787653 -0.5947098 0.8619550 . -1.224000511 -1.093936202
## [3,] -0.8824716 0.6492722 -0.1467865 . -0.008784831 0.379411506
## [4,] 0.3328920 1.6748393 -0.6845375 . -0.268534759 1.092371570
## [5,] -0.4179091 0.6464936 2.1560817 . -0.082097702 -1.412250855
## ... . . . . . .
## [96,] 1.26804451 1.02126747 1.45797350 . 1.98973031 0.58542266
## [97,] 2.46960382 -0.64363716 -0.89848883 . 0.06967948 0.79184944
## [98,] -2.07132954 0.31172917 -0.05687525 . -0.09038648 -0.82502208
## [99,] 0.61747571 -0.57533886 -2.26104897 . 1.91703056 -0.04693724
## [100,] -0.18802040 -0.41005580 0.08412570 . -0.01407438 0.46800058
Alternatively, we can use coercion methods:
as(X, "TileDBArray")
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] 1.5668693 0.7501519 0.2134371 . -1.318506337 0.707569761
## [2,] -0.9787653 -0.5947098 0.8619550 . -1.224000511 -1.093936202
## [3,] -0.8824716 0.6492722 -0.1467865 . -0.008784831 0.379411506
## [4,] 0.3328920 1.6748393 -0.6845375 . -0.268534759 1.092371570
## [5,] -0.4179091 0.6464936 2.1560817 . -0.082097702 -1.412250855
## ... . . . . . .
## [96,] 1.26804451 1.02126747 1.45797350 . 1.98973031 0.58542266
## [97,] 2.46960382 -0.64363716 -0.89848883 . 0.06967948 0.79184944
## [98,] -2.07132954 0.31172917 -0.05687525 . -0.09038648 -0.82502208
## [99,] 0.61747571 -0.57533886 -2.26104897 . 1.91703056 -0.04693724
## [100,] -0.18802040 -0.41005580 0.08412570 . -0.01407438 0.46800058
This process works also for sparse matrices:
Y <- Matrix::rsparsematrix(1000, 1000, density=0.01)
writeTileDBArray(Y)
## <1000 x 1000> sparse matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,999] [,1000]
## [1,] 0 0 0 . 0 0
## [2,] 0 0 0 . 0 0
## [3,] 0 0 0 . 0 0
## [4,] 0 0 0 . 0 0
## [5,] 0 0 0 . 0 0
## ... . . . . . .
## [996,] 0 0 0 . 0.0 0.0
## [997,] 0 0 0 . 0.0 0.0
## [998,] 0 0 0 . 0.0 1.2
## [999,] 0 0 0 . 0.0 0.0
## [1000,] 0 0 0 . 0.0 0.0
Logical and integer matrices are supported:
writeTileDBArray(Y > 0)
## <1000 x 1000> sparse matrix of class TileDBMatrix and type "logical":
## [,1] [,2] [,3] ... [,999] [,1000]
## [1,] FALSE FALSE FALSE . FALSE FALSE
## [2,] FALSE FALSE FALSE . FALSE FALSE
## [3,] FALSE FALSE FALSE . FALSE FALSE
## [4,] FALSE FALSE FALSE . FALSE FALSE
## [5,] FALSE FALSE FALSE . FALSE FALSE
## ... . . . . . .
## [996,] FALSE FALSE FALSE . FALSE FALSE
## [997,] FALSE FALSE FALSE . FALSE FALSE
## [998,] FALSE FALSE FALSE . FALSE TRUE
## [999,] FALSE FALSE FALSE . FALSE FALSE
## [1000,] FALSE FALSE FALSE . FALSE FALSE
As are matrices with dimension names:
rownames(X) <- sprintf("GENE_%i", seq_len(nrow(X)))
colnames(X) <- sprintf("SAMP_%i", seq_len(ncol(X)))
writeTileDBArray(X)
## <100 x 10> matrix of class TileDBMatrix and type "double":
## SAMP_1 SAMP_2 SAMP_3 ... SAMP_9 SAMP_10
## GENE_1 1.5668693 0.7501519 0.2134371 . -1.318506337 0.707569761
## GENE_2 -0.9787653 -0.5947098 0.8619550 . -1.224000511 -1.093936202
## GENE_3 -0.8824716 0.6492722 -0.1467865 . -0.008784831 0.379411506
## GENE_4 0.3328920 1.6748393 -0.6845375 . -0.268534759 1.092371570
## GENE_5 -0.4179091 0.6464936 2.1560817 . -0.082097702 -1.412250855
## ... . . . . . .
## GENE_96 1.26804451 1.02126747 1.45797350 . 1.98973031 0.58542266
## GENE_97 2.46960382 -0.64363716 -0.89848883 . 0.06967948 0.79184944
## GENE_98 -2.07132954 0.31172917 -0.05687525 . -0.09038648 -0.82502208
## GENE_99 0.61747571 -0.57533886 -2.26104897 . 1.91703056 -0.04693724
## GENE_100 -0.18802040 -0.41005580 0.08412570 . -0.01407438 0.46800058
TileDBArray
sTileDBArray
s are simply DelayedArray
objects and can be manipulated as such.
The usual conventions for extracting data from matrix-like objects work as expected:
out <- as(X, "TileDBArray")
dim(out)
## [1] 100 10
head(rownames(out))
## [1] "GENE_1" "GENE_2" "GENE_3" "GENE_4" "GENE_5" "GENE_6"
head(out[,1])
## GENE_1 GENE_2 GENE_3 GENE_4 GENE_5 GENE_6
## 1.5668693 -0.9787653 -0.8824716 0.3328920 -0.4179091 0.6054562
We can also perform manipulations like subsetting and arithmetic.
Note that these operations do not affect the data in the TileDB backend;
rather, they are delayed until the values are explicitly required,
hence the creation of the DelayedMatrix
object.
out[1:5,1:5]
## <5 x 5> matrix of class DelayedMatrix and type "double":
## SAMP_1 SAMP_2 SAMP_3 SAMP_4 SAMP_5
## GENE_1 1.566869277 0.750151913 0.213437070 0.251327561 0.532867279
## GENE_2 -0.978765309 -0.594709772 0.861954971 1.117249146 0.517750180
## GENE_3 -0.882471612 0.649272188 -0.146786544 -0.002772576 1.336458441
## GENE_4 0.332891991 1.674839305 -0.684537461 0.249554432 -0.689168052
## GENE_5 -0.417909067 0.646493642 2.156081699 1.348324603 -1.605948717
out * 2
## <100 x 10> matrix of class DelayedMatrix and type "double":
## SAMP_1 SAMP_2 SAMP_3 ... SAMP_9 SAMP_10
## GENE_1 3.1337386 1.5003038 0.4268741 . -2.63701267 1.41513952
## GENE_2 -1.9575306 -1.1894195 1.7239099 . -2.44800102 -2.18787240
## GENE_3 -1.7649432 1.2985444 -0.2935731 . -0.01756966 0.75882301
## GENE_4 0.6657840 3.3496786 -1.3690749 . -0.53706952 2.18474314
## GENE_5 -0.8358181 1.2929873 4.3121634 . -0.16419540 -2.82450171
## ... . . . . . .
## GENE_96 2.5360890 2.0425349 2.9159470 . 3.97946061 1.17084531
## GENE_97 4.9392076 -1.2872743 -1.7969777 . 0.13935895 1.58369888
## GENE_98 -4.1426591 0.6234583 -0.1137505 . -0.18077296 -1.65004415
## GENE_99 1.2349514 -1.1506777 -4.5220979 . 3.83406112 -0.09387448
## GENE_100 -0.3760408 -0.8201116 0.1682514 . -0.02814877 0.93600116
We can also do more complex matrix operations that are supported by DelayedArray:
colSums(out)
## SAMP_1 SAMP_2 SAMP_3 SAMP_4 SAMP_5 SAMP_6 SAMP_7
## 5.712808 -12.639891 12.623548 -6.372757 -8.748190 14.363428 2.881060
## SAMP_8 SAMP_9 SAMP_10
## 9.841876 -3.997361 16.891927
out %*% runif(ncol(out))
## <100 x 1> matrix of class DelayedMatrix and type "double":
## y
## GENE_1 1.0325688
## GENE_2 1.4491838
## GENE_3 0.9435826
## GENE_4 1.6382264
## GENE_5 1.4473230
## ... .
## GENE_96 3.4079898
## GENE_97 0.5748998
## GENE_98 2.6475800
## GENE_99 -0.8384569
## GENE_100 0.5334791
We can adjust some parameters for creating the backend with appropriate arguments to writeTileDBArray()
.
For example, the example below allows us to control the path to the backend
as well as the name of the attribute containing the data.
X <- matrix(rnorm(1000), ncol=10)
path <- tempfile()
writeTileDBArray(X, path=path, attr="WHEE")
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] 0.88954215 -0.27977045 -0.33700991 . 0.55178509 -1.78147217
## [2,] -0.41382804 0.50938665 0.77263912 . 1.75329960 0.99115895
## [3,] 1.24371589 0.11596520 0.06516864 . 0.35632503 -0.09072073
## [4,] -0.45384171 -0.38525065 -0.39026818 . -0.62569933 0.49756326
## [5,] -0.85770889 0.65777988 -0.02926731 . -1.26404231 -1.25795105
## ... . . . . . .
## [96,] 1.1284492 0.5346054 -1.2747662 . 0.89129388 -0.23489928
## [97,] 1.9232763 1.6159466 0.6413157 . -0.29156094 0.03074512
## [98,] -1.7864194 -2.1214400 -0.7274204 . -0.24777406 0.73042066
## [99,] 0.4116313 -0.9426214 0.2337231 . 1.04524748 -1.04803932
## [100,] 1.1676334 -0.8855340 0.4690667 . 0.97268552 0.02875474
As these arguments cannot be passed during coercion, we instead provide global variables that can be set or unset to affect the outcome.
path2 <- tempfile()
setTileDBPath(path2)
as(X, "TileDBArray") # uses path2 to store the backend.
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] 0.88954215 -0.27977045 -0.33700991 . 0.55178509 -1.78147217
## [2,] -0.41382804 0.50938665 0.77263912 . 1.75329960 0.99115895
## [3,] 1.24371589 0.11596520 0.06516864 . 0.35632503 -0.09072073
## [4,] -0.45384171 -0.38525065 -0.39026818 . -0.62569933 0.49756326
## [5,] -0.85770889 0.65777988 -0.02926731 . -1.26404231 -1.25795105
## ... . . . . . .
## [96,] 1.1284492 0.5346054 -1.2747662 . 0.89129388 -0.23489928
## [97,] 1.9232763 1.6159466 0.6413157 . -0.29156094 0.03074512
## [98,] -1.7864194 -2.1214400 -0.7274204 . -0.24777406 0.73042066
## [99,] 0.4116313 -0.9426214 0.2337231 . 1.04524748 -1.04803932
## [100,] 1.1676334 -0.8855340 0.4690667 . 0.97268552 0.02875474
sessionInfo()
## R version 4.1.0 RC (2021-05-10 r80283)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows Server x64 (build 17763)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] TileDBArray_1.2.1 DelayedArray_0.18.0 IRanges_2.26.0
## [4] S4Vectors_0.30.0 MatrixGenerics_1.4.0 matrixStats_0.58.0
## [7] BiocGenerics_0.38.0 Matrix_1.3-3 BiocStyle_2.20.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.6 bslib_0.2.5.1 compiler_4.1.0
## [4] BiocManager_1.30.15 jquerylib_0.1.4 tools_4.1.0
## [7] digest_0.6.27 bit_4.0.4 jsonlite_1.7.2
## [10] evaluate_0.14 lattice_0.20-44 nanotime_0.3.2
## [13] rlang_0.4.11 RcppCCTZ_0.2.9 yaml_2.2.1
## [16] xfun_0.23 stringr_1.4.0 knitr_1.33
## [19] sass_0.4.0 bit64_4.0.5 grid_4.1.0
## [22] R6_2.5.0 rmarkdown_2.8 bookdown_0.22
## [25] tiledb_0.9.2 magrittr_2.0.1 htmltools_0.5.1.1
## [28] stringi_1.6.2 zoo_1.8-9