Vignette of the pengls package

Stijn Hawinkel

1 Introduction

This vignette demonstrates the use of the pengls package for high-dimensional data with spatial or temporal autocorrelation. It consists of an iterative loop around the nlme and glmnet packages. Currently, only continuous outcomes and \(R^2\) as performance measure are implemented.

2 Installation instuctions

The pengls package is available from BioConductor, and can be installed as follows:

library(BiocManager)
install("pengls")

Once installed, it can be loaded and version info printed.

suppressPackageStartupMessages(library(pengls))
cat("pengls package version", as.character(packageVersion("pengls")), "\n")
## pengls package version 1.0.0

3 Illustration

3.1 Spatial autocorrelation

We first create a toy dataset with spatial coordinates.

The pengls method requires prespecification of a functional form for the autocorrelation. This is done through the corStruct objects defined by the nlme package. We specify a correlation decaying as a Gaussian curve with distance, and with a nugget parameter. The nugget parameter is a proportion that indicates how much of the correlation structure explained by independent errors; the rest is attributed to spatial autocorrelation. The starting values are chosen as reasonable guesses; they will be overwritten in the fitting process.

Finally the model is fitted with a single outcome variable and large number of regressors, with the chosen covariance structure and for a prespecified penalty parameter \(\lambda=0.2\).

## Starting iterations...
## Iteration 1 
## Iteration 2 
## Iteration 3

Standard extraction functions like print(), coef() and predict() are defined for the new “pengls” object.

## pengls model with correlation structure: corGaus 
##  and 29 non-zero coefficients

3.2 Temporal autocorrelation

The method can also account for temporal autocorrelation by defining another correlation structure from the nlme package, e.g. autocorrelation structure of order 1:

The fitting command is similar, this time the \(\lambda\) parameter is found through cross-validation of the naive glmnet (for full cross-validation , see below). We choose \(\alpha=0.5\) this time, fitting an elastic net model.

## Fitting naieve model...
## Starting iterations...
## Iteration 1 
## Iteration 2 
## Iteration 3

Show the output

## pengls model with correlation structure: corAR1 
##  and 42 non-zero coefficients

3.3 Penalty parameter and cross-validation

The pengls package also provides cross-validation for finding the optimal \(\lambda\) value. If the tuning parameter \(\lambda\) is not supplied, the optimal \(\lambda\) according to cross-validation with the naive glmnet function (the one that ignores dependence) is used. Hence we recommend to use the following function to use cross-validation. Multithreading is supported through the BiocParallel package :

The function is called similarly to cv.glmnet:

Check the result:

## Cross-validated pengls model with correlation structure: corGaus 
##  and 37 non-zero coefficients.
##  10 fold cross-validation yielded an estimated R2 of 0.7682299 .

By default, the 1 standard error is used to determine the optimal value of \(\lambda\) :

## [1] 0.0268676
## [1] 0.7682299

Extract coefficients and fold IDs.

## [1] 0.01530464 0.00000000 0.00000000 0.00000000 0.96883687 0.00000000
##  38  74  41  49  93  18  26  17  63  25  92  61  82  90  83  31  76  89   1  47 
##   6   9   9   1   8   6   1   6   9   9   8   9   9   3   8   8   3   4   5   7 
##  20  91  11  84  65  88  44  24  40 100  56  50  36  78  12  66  58  30  27  71 
##   6   2   5   8   2  10   7   5   6   3   7   7   6   1   5   1   7   4   7   9 
##  42   5  16  85  96  64  59  79  54  57  21  67  29   3  37   6  60  28  13  46 
##   8   6   7  10  10   2   4   3   5   3   5   1   1   5   1   6   3   7   5   4 
##  10  45   7  32  70  35  15  95  87  97   2  81  53  69  19 
##   6   1   6   5   3   9   6  10  10   3   5   2   9   1   6

By default, blocked cross-validation is used, but random cross-validation is also available (but not recommended for timecourse or spatial data). First we illustrate the different ways graphically, again using the timecourse example:

To perform random cross-validation

To negate baseline differences at different timepoints, it may be useful to center or scale the outcomes in the cross validation. For instance for centering only:

## [1] 0.9828069

4 Session info

sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocParallel_1.28.0 nlme_3.1-153        pengls_1.0.0       
## 
## loaded via a namespace (and not attached):
##  [1] knitr_1.36       magrittr_2.0.1   splines_4.1.1    lattice_0.20-45 
##  [5] R6_2.5.1         rlang_0.4.12     fastmap_1.1.0    foreach_1.5.1   
##  [9] highr_0.9        stringr_1.4.0    tools_4.1.1      parallel_4.1.1  
## [13] grid_4.1.1       glmnet_4.1-2     xfun_0.27        jquerylib_0.1.4 
## [17] htmltools_0.5.2  iterators_1.0.13 yaml_2.2.1       survival_3.2-13 
## [21] digest_0.6.28    Matrix_1.3-4     sass_0.4.0       codetools_0.2-18
## [25] shape_1.4.6      evaluate_0.14    rmarkdown_2.11   stringi_1.7.5   
## [29] compiler_4.1.1   bslib_0.3.1      jsonlite_1.7.2