gpuMagic 1.4.0

The gpuMagic package is designed for ultilizing the computing power of modern GPU. The package has the ability to compile R code and implement the code on a GPU kernel. Therefore, it is easy for the user to design their own algorithm in pure R language and run the code in GPU. The package uses openCL as its parallel language so parallelization using the CPU is also supported.

To get started, it is always a good idea to check your GPU avalability.

```
getDeviceList()
#> No device has been found, please make sure the computer has a graphic card or the driver has been properly installed.
#> Hint:
#> For CPU, you can install the intel's / ATI's graphic driver for the intel's / AMD's CPU respectively.
#> For GPU, you need to download the graphic driver from your vendor's website.
#> No device has been found!
#> NULL
```

The devices listed above show you the current avalable devices, if you cannot see the device, you should install the openCL driver obtained from the corresponding vender. The default device is the device 1, if you have multiple devices, you can query and change your device by

```
getDeviceInfo(1)
#> No device has been found, please make sure the computer has a graphic card or the driver has been properly installed.
#> Hint:
#> For CPU, you can install the intel's / ATI's graphic driver for the intel's / AMD's CPU respectively.
#> For GPU, you need to download the graphic driver from your vendor's website.
#> No device has been found!
#> NULL
setDevice(1)
#> No device has been found!
#> NULL
```

The package also has a GPU resources manager. You can check your memory usage at any time by

```
gpuMagic.getMemUsage()
#> Device memory usage:
#> --Used: (%) --Total:
```

After the proper device has been selected, you are ready to write your GPU code.

###Warning If you are using a Nvidia graphics card, it is a good practice to change your Timeout Detection and Recovery(TDR) setting as the driver will reset the graphics card if the function takes too long to run the code. The default setting is 2 seconds, which is too short for most operations. If you are using another device such as an Intel graphic card, you are safe to continue.

The setting can be found in NVIDIA Nsight. Change it to a larger number such as 20 seconds will make most programs work.

The GPU code should be written as an R function and should be compatible with the **sapply** function. Here is an example of matrix multiplication

```
#C=A%*%B
#Each time the function will compute a column of C
matMul<-function(ind,A,B){
C=A%*%B[,ind]
return(C)
}
```

You can test your function by using the **sapply** function

```
n=10
m=50
k=100
A=matrix(runif(n*m),n,m)
B=matrix(runif(m*k),m,k)
C_sapply=sapply(1:k,matMul,A,B)
#The internal matrix operation function in R
C_internel=A%*%B
#Compute error
max(abs(C_sapply-C_internel))
#> [1] 0
```

As you see here, the result from sapply is consistent with the result from the R internal matrix multiplication function. To run the above function in GPU, it can be simply achieved by changing **sapply** to **gpuSapply**

```
C_gpu=gpuSapply(1:k,matMul,A,B)
#> No device has been found! Auto dispatch the gpuSapply function to sapply
#Compute error
max(abs(C_gpu-C_internel))
#> [1] 0
```

That is all you need to do to run the GPU code! We can try a larger sample size to see the speed differences between GPU and R code.

```
n=1000
m=1000
k=1000
A=matrix(runif(n*m),n,m)
B=matrix(runif(n*m),m,k)
system.time(
gpuSapply(1:k,matMul,A,B)
)
#> No device has been found! Auto dispatch the gpuSapply function to sapply
#> user system elapsed
#> 1.681 0.000 1.680
system.time(
A%*%B
)
#> user system elapsed
#> 0.751 0.000 0.752
```

In my test environment(GF 940m), the above code takes about 0.1 secs and CPU(i5-6300U) takes 0.6 secs to finish, a 6 times speed up in the matrix operation!

Here we discuss some common questions that you will have when using the package. Some are due to the limitations of openCL language and some are due to the lazyness of the package author.

##Important notes
+ All objects in openCL are either a scalar or a matrix, vector is not supported
+ `NA`

and `NULL`

values are not supported, I did not test them yet but it should be OK to pass them to the device, it just may not work as you expect. This feature will be added in the near future.
+ Do not allocate(create) variables in for-loops or if-statements, it may cause an unexpected error. Please allocate them before the loops or if-statements.
+ Be careful when you do the self assignments, the code like `A[1:2]=A[2:1]`

does not work, since the opencl does not have a temporary space to store the intermediate results and therefore the assignments will be excuted in sequence, so the above code is equivalent to `A[1]=A[2];A[2]=A[1]`

. The value in `A[2]`

will be incorrect.
+ For saving the memory space, the function arguments are shared by all threads. Therefore changing the value of the function arguments is depreciated unless the change made by one thread will not be read by the other threads.
+If you find any bugs in the package that prevent you from running the openCL code, a temporary solution would be using the scalar opration, because most bugs are introduced by the poor design of the matrix operation.
## Dynamic allocation? Nope!
In R language, it is fairly easy to create a matrix without knowing its size in advance, for example,

```
n=sample(1:10,1)
A=matrix(0,n,n)
```

will create an n-by-n matrix A. The size of A is undetermined until the first line of the code is evaluated. The ability to create a matrix without knowing its size before running the code is called dynamic allocation. Unfortunately, due to the language limitation, openCL does not support this feature, which means the size of every matrix in the openCL code should be known before its excution. Therefore, it is impossible to create a matrix dynamically. To solve this problem, the package provides two alternative ways to make a dynamic-like matrix.

The `gpuSapply`

function has an argument `.macroParms`

to specify the macro variable. Therefore, the value of the macro will be plugged into the function and is known for the compiler. for example

```
dynamic_example1<-function(ind,A,n){
B=matrix(0,n)
C=A+B
return(C)
}
n=10
A=matrix(runif(n),n)
#The code below wouldn't work,
#the value of n is unknown in the compilation stage
#res_device=gpuSapply(1,dynamic_example1,A,n)
#works, the value of n is knwown as a macro in the compilation stage
res_device=gpuSapply(1,dynamic_example1,A,n,.macroParms = c("n"))
#> No device has been found! Auto dispatch the gpuSapply function to sapply
```

Please note that when you specify a variable as a macro, you cannot change the value of it inside the function since it is fixed

If you do need a dynamic feature in your code, you can use the reference function to create it. The reference function works like a macro. For example

```
#Perform the column sum of the first n elements in each column of A
dynamic_example2<-function(ind,A,n){
#Dynamically subsetting the first n elements in the ind column of the matrix A
#Almost equivalent to A_dyn=A[1:n,ind]
#The size of A_dyn is unknown in the compilation stage since the value of n is unknown
A_dyn=subRef(A,1:n,ind)
C=sum(A_dyn)
return(C)
}
n=10
M=100
A=matrix(runif(M),M,M)
res_device=gpuSapply(1:M,dynamic_example2,A,n)
#> No device has been found! Auto dispatch the gpuSapply function to sapply
res_cpu=sapply(1:M,dynamic_example2,A,n)
#check error
range(res_device-res_cpu)
#> [1] 0 0
```

With the reference function, you can allocate a large matrix and then dynamically subset it to create the matrix with the size you need. However, you need to be careful when you do the operations on the reference object. In the example above, any changes to the reference object `A_dyn`

will cause the change in its target object `A`

. Please refer to the function list to see the available reference functions.

##Dynamic return
Since dynamic allocation is not possible, you may wonder if returning an object with dynamic size is possible. The answer is yes but it is depreciated. If you write multiple `return`

functions in your code. The package will first compute the maximum length of your return variables, and then allocate the memory according to the maximum length. If the length of one return variable cannot be correctly infered, you will get a warning message, and the length of that variable will be ignored. Therefore, in order to correctly return a variable of undetermined length, you should create a matrix which size is always larger than the return variable, then you return that matrix in the end. Therefore the compiler will always have enough space for your return variable( but with redundant space). For example:

```
dynamic_return<-function(ind,A,n){
#Dynamically subsetting the first n elements in the ind column of the matrix A
#equivalent to A_dyn=A[1:n,ind]
A_dyn=subRef(A,1:n,ind)
#return A_dyn, the size is unknown
return(A_dyn)
#The largest length of the variable A_dyn
tmp=matrix(0,nrow(A))
#tell the compiler the return size should be at least the same as the variable tmp
return(tmp)
}
n=10
M=20
A=matrix(runif(M*M),M,M)
#Expect a warning message: Undetermined return size has been found
suppressWarnings({
res_device=gpuSapply(1:M,dynamic_return,A,n)
})
#> No device has been found! Auto dispatch the gpuSapply function to sapply
#retrieve the result (The first n rows)
res_device=res_device[1:n,]
#check error
range(res_device-A[1:n,])
#> [1] 0 0
```

In the future, the package will provide a more concise way to specify the return size.

##Oprations between two matrice with different size? Not implemented.
The language used in the package is only a subset of the R language, it fully supports scalar to scalar, matrix to matrix, scalar to matrix, and matrix to scalar opration. However, for the matrix to matrix operations, they must have the same size, otherwise an error will be given(eg. matrix + vector is not allowed). An alternative solution for the matrix to vector operations is using the `sweep`

function, please refer to the R documentation to see the usage.

##Error checking
For performance reasons, the package only performs the error checking in the compilation stage. Therefore, when an error occurs on GPU, the code will continue to run silently, and crash R at some point without any hint. It is always a good idea to run a sample code with the`sapply`

function to check for possible errors before you launch the rocker with your precious data.

##Sequence variable
For the sequence variable, you can create it without limitation. As you see in the `dynamic_example2`

function, we have implicitly created a sequence variable `1:n`

, whose size is unknown. You can use the function `seq`

or `:`

to create the sequence. the only exception is that you cannot change the value of the sequence variable, because the variable is in a compact format and does not own a memory. If you do want to create a sequence variable in memory, an alternative way is to create the sequence first and assign it to a new variable, then the compiler will allocate the memory for the new variable.

##Calling another user-defined R function inside the R function The current version does not support calling user-defined R functions inside your main function, this feature is under development and will come in the near future.

#More examples

Here we provide more examples to show the usage of the package. Please note that not all the functions will be efficient when running on the GPU, it is possible that the CPU code can be faster than the GPU. Depending on the job type, you should decide which device you want to run on by calling `setDevice()`

.

##Finding the indices of largest k values in each column of A

```
largestK<-function(ind,A,k){
#get a column of A
A_col=subRef(A,,ind)
#the largest k values and indices of A_col
largest_value=matrix(0,k)
largest_ind=matrix(0,k)
#loop over A_col to find the largest k values
for(i in 1:length(A_col)){
for(j in 1:k){
if(A_col[i]>largest_value[j]){
if(j!=k){
#Backward assignment to avoid self assignment problem
#(see section: Important note)
ind_src=(k-1):j
largest_value[ind_src+1]=largest_value[ind_src]
largest_ind[ind_src+1]=largest_ind[ind_src]
}
largest_value[j]=A_col[i]
largest_ind[j]=i
break
}
}
}
return_nocpy(largest_ind)
}
N=1000
M=1000
k=10
A=matrix(runif(N*M),N,M)
#Warmup
warm_res=gpuSapply(1:M,largestK,A=A,k=k,.macroParms = "k")
#> No device has been found! Auto dispatch the gpuSapply function to sapply
warm_res=sapply(1:M,largestK,A=A,k=k)
#count the time
system.time({res_device=gpuSapply(1:M,largestK,A=A,k=k,.macroParms = "k")})
#> No device has been found! Auto dispatch the gpuSapply function to sapply
#> user system elapsed
#> 1.179 0.012 1.199
system.time({res_cpu=sapply(1:M,largestK,A=A,k=k)})
#> user system elapsed
#> 1.100 0.000 1.111
#Check the error
range(res_device-res_cpu)
#> [1] 0 0
```

##Brute-force method to find the solution of the linear regression with L1 loss function

```
computeLoss<-function(ind,x,y,parms){
#Find the parameters for the thread
parm=parms[ind,]
#Compute y hat, use no-copy transpose
y_hat=x%*%t_nocpy(parm)
#absolute loss value(L1 loss)
loss=sum(abs(y-y_hat))
return(loss)
}
#Sample size
n=1000
#Number of parameters
p=2
beta=c(2,3)
#Generate data
x=matrix(runif(n*p),n,p)
e=runif(n)
y=x%*%beta+e
#Brute-force search
#The seaching range and precision
search_range=seq(0,10,0.1)
parms=expand.grid(search_range,search_range)
#Warmup
warm_res=gpuSapply(seq_len(nrow(parms)),computeLoss,x=x,y=y,parms=parms)
#> No device has been found! Auto dispatch the gpuSapply function to sapply
warm_res=sapply(seq_len(nrow(parms)),computeLoss,x=x,y=y,parms=parms)
#count the time
system.time({res_device=gpuSapply(seq_len(nrow(parms)),computeLoss,x=x,y=y,parms=parms)})
#> No device has been found! Auto dispatch the gpuSapply function to sapply
#> user system elapsed
#> 1.289 0.000 1.289
system.time({res_cpu=sapply(seq_len(nrow(parms)),computeLoss,x=x,y=y,parms=parms)})
#> user system elapsed
#> 1.307 0.000 1.307
#print out the parameters which minimize the loss function
parms[which.min(res_device),]
#> Var1 Var2
#> 3560 2.4 3.5
parms[which.min(res_cpu),]
#> Var1 Var2
#> 3560 2.4 3.5
```

#Supported functions
The package currently supports a limited number of functions, the usage of them are similar but not promised to be the same as the R functions. For example, the **drop** argument in ** [** function is not supported yet. Therefore, in the current stage, it is better to only call the simplest form of the function. The current available functions are:

Function | Note |
---|---|

nrow | the value can be used as a dimensional parameter to create a matrix |

ncol | the value can be used as a dimensional parameter to create a matrix |

length | |

matrix | |

[ | does not workdrop |

abs | |

floor | |

ceiling | |

t | |

sum | Supports one argument only |

rowSums | Supports one argument only |

colSums | Supports one argument only |

rowMeans | Supports one argument only |

colMeans | Supports one argument only |

sort | Supports one argument only, ascending sort |

sweep | |

+ | |

- | |

* | |

/ | |

^ | |

> | |

>= | |

< | |

<= | |

== | |

%*% | |

return |

Explicit declaration function | note |
---|---|

subRef | Create a reference object to a subset of a matrix, the usage is same as function[ |

seq | Support arguments onlyfrom, to, by,length.out |

: | |

Matrix | Declare a matrix(Depreciated) |

Scalar | Declare a scalar(Depreciated) |

Besides traditional R function, the package also provides a few no-copy methods. These methods do not allocate memory for the result. Instead, they return a referent(AKA pointer) pointing to the target matrix. Therefore, any changes made in the reference object will cause the changes in the target matrix. This feature is for saving memory usage, but it is could be very useful when you need “dynamic” allocation. You can first allocate a matrix with fixed dimension and then create a reference object pointing to that memory space, the size of the reference object can be undetermined in the compilation stage.

Function | Note |
---|---|

subRef | No-copy subsetting, the usage is same as `[` |

t_nocpy | No-copy transpose, it return a reference of the transpose of a matrix |

return_nocpy | No-copy return, return the variable without an additional copy, it does not support reference object |

#R version and Session info Here are the platform information used to compile this vignette.

```
version
#> _
#> platform x86_64-pc-linux-gnu
#> arch x86_64
#> os linux-gnu
#> system x86_64, linux-gnu
#> status
#> major 4
#> minor 0.0
#> year 2020
#> month 04
#> day 24
#> svn rev 78286
#> language R
#> version.string R version 4.0.0 (2020-04-24)
#> nickname Arbor Day
sessionInfo()
#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] gpuMagic_1.4.0 BiocStyle_2.16.0
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.4.6 knitr_1.28 magrittr_1.5
#> [4] BiocGenerics_0.34.0 MASS_7.3-51.6 lattice_0.20-41
#> [7] Deriv_4.0 rlang_0.4.5 stringr_1.4.0
#> [10] tools_4.0.0 parallel_4.0.0 grid_4.0.0
#> [13] xfun_0.13 htmltools_0.4.0 DescTools_0.99.34
#> [16] yaml_2.2.1 digest_0.6.25 bookdown_0.18
#> [19] Matrix_1.2-18 pryr_0.1.4 BiocManager_1.30.10
#> [22] codetools_0.2-16 evaluate_0.14 rmarkdown_2.1
#> [25] stringi_1.4.6 compiler_4.0.0 boot_1.3-25
#> [28] expm_0.999-4 mvtnorm_1.1-0
```