1 Introduction

Scientific computing in python is well-established. This package takes advantage of new work at Rstudio that fosters python-R interoperability. Identifying good practices of interface design will require extensive discussion and experimentation, and this package takes an initial step in this direction.

A key motivation is experimenting with an incremental PCA implementation with very large out-of-memory data. We have also provided an interface to the sklearn.cluster.KMeans procedure.

2 Basic concepts

2.1 Module references

The package includes a list of references to python modules.

library(BiocSklearn)

2.2 Python documentation

We can acquire python documentation of included modules with reticulate’s py_help: The following result could get stale:

3.1 PCA

We have a python representation of the iris data. We compute the PCA as follows:

ppca = skPCA(data.matrix(iris[,1:4]))
ppca
## SkDecomp instance, method:  PCA
## use getTransformed() to acquire projected input.

This returns an object that can be reused through python methods. The numerical transformation is accessed via getTransformed.

tx = getTransformed(ppca)
dim(tx)
## [1] 150   4
head(tx)
##           [,1]       [,2]        [,3]         [,4]
## [1,] -2.684126  0.3193972 -0.02791483 -0.002262437
## [2,] -2.714142 -0.1770012 -0.21046427 -0.099026550
## [3,] -2.888991 -0.1449494  0.01790026 -0.019968390
## [4,] -2.745343 -0.3182990  0.03155937  0.075575817
## [5,] -2.728717  0.3267545  0.09007924  0.061258593
## [6,] -2.280860  0.7413304  0.16867766  0.024200858

The native methods can be applied to the pyobj output.

pyobj(ppca)$fit_transform(irismat)[1:3,] Concordance with the R computation can be checked: round(cor(tx, fullpc),3) ## PC1 PC2 PC3 PC4 ## [1,] 1 0 0 0 ## [2,] 0 -1 0 0 ## [3,] 0 0 -1 0 ## [4,] 0 0 0 -1 3.2 Incremental PCA A computation supporting a priori bounding of memory consumption is available. In this procedure one can also select the number of principal components to compute. ippca = skIncrPCA(irismat) # ippcab = skIncrPCA(irismat, batch_size=25L) round(cor(getTransformed(ippcab), fullpc),3) 3.3 Manual incremental PCA with explicit chunking This procedure can be used when data are provided in chunks, perhaps from a stream. We iteratively update the object, for which there is no container at present. Again the number of components computed can be specified. ta = skels$np$take # provide slicer utility ipc = skPartialPCA_step(ta(irismat,0:49,0L)) ipc = skPartialPCA_step(ta(irismat,50:99,0L), obj=ipc) ipc = skPartialPCA_step(ta(irismat,100:149,0L), obj=ipc) ipc$transform(ta(irismat,0:5,0L))
fullpc[1:5,]

4 Interoperation with HDF5 matrix

We have extracted methylation data for the Yoruban subcohort of CEPH from the yriMulti package. Data from chr6 and chr17 are available in an HDF5 matrix in this BiocSklearn package. A reference to the dataset through the h5py File interface is created by H5matref.

THIS CODE IS NOT BASILISK COMPLIANT AND IS BLOCKED. See skPartialPCA_h5 for valid interface.

fn = system.file("ban_6_17/assays.h5", package="BiocSklearn")
ban = H5matref(fn)
ban 

We will explicitly define the numpy matrix.

np = import("numpy", convert=FALSE) # ensure
ban$shape We’ll treat genes as records and individuals as features. ban2 = np$matrix(ban)$T We’ll define three chunks of the data and update the partial PCA contributions in the object st. st = skPartialPCA_step(ta(ban2, 0:999, 0L)) st = skPartialPCA_step(ta(ban2, 1000:10999, 0L), obj=st) st = skPartialPCA_step(ta(ban2, 11000:44559, 0L), obj=st) sss = st$transform(ban2)

Verify against the standard PCA, checking correlation between the projections to the first four PCs.

iii = skPCA(ban2)
dim(getTransformed(iii))
round(cor(sss[,1:4], getTransformed(iii)[,1:4]),3)

5 Conclusions

We need more applications and profiling.