Data (pre-)processing and data analysis of Metabolomics and other omics datasets using struct and structToolbox, including univariate/multivariate statistics and machine learning approaches.

structToolbox 1.12.3

The ‘structToolbox’ includes an extensive set of data (pre-)processing and analysis tools for metabolomics and other omics, with a strong emphasis on statistics and machine learning. The methods and tools have been implemented using class-based templates available via the `struct`

(Statistics in R Using Class-based Templates) package. The aim of this vignette is to introduce the reader to basic and more advanced structToolbox-based operations and implementations, such as the use of `struct`

objects, getting/setting methods/parameters, and building workflows for the analysis of mass spectrometry (MS) and nuclear magnetic resonance (NMR)-based Metabolomics and proteomics datasets. The workflows demonstrated here include a wide range of methods and tools including pre-processing such as filtering, normalisation and scaling, followed by univariate and/or multivariate statistics, and machine learning approaches.

The latest version of `structToolbox`

compatible with your current R version can be installed using `BiocManager`

.

```
# install BiocManager if not present
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# install structToolbox and dependencies
BiocManager::install("structToolbox")
```

A number of additional packages are needed for this vignette.

```
## install additional bioc packages for vignette if needed
#BiocManager::install(c('pmp', 'ropls', 'BiocFileCache'))
## install additional CRAN packages if needed
#install.packages(c('cowplot', 'openxlsx'))
suppressPackageStartupMessages({
# Bioconductor packages
library(structToolbox)
library(pmp)
library(ropls)
library(BiocFileCache)
# CRAN libraries
library(ggplot2)
library(gridExtra)
library(cowplot)
library(openxlsx)
})
# use the BiocFileCache
bfc <- BiocFileCache(ask = FALSE)
```

`struct`

objects, including models, model sequences, model charts and ontology.PCA (Principal Component Analysis) and PLS (Partial Least Squares) are commonly applied methods for exploring and analysing multivariate datasets. Here we use these two statistical methods to demonstrate the different types of `struct`

(STatistics in R Using Class Templates) objects that are available as part of the `structToolbox`

and how these objects (i.e. class templates) can be used to conduct unsupervised and supervised multivariate statistical analysis.

For demonstration purposes we will use the “Iris” dataset. This famous (Fisher’s or Anderson’s) dataset contains measurements of sepal length and width and petal length and width, in centimeters, for 50 flowers from each of 3 class of Iris. The class are Iris setosa, versicolor, and virginica. See here (https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/iris.html) for more information.

Note: this vignette is also compatible with the Direct infusion mass spectrometry metabolomics “benchmark” dataset described in Kirwan et al., *Sci Data* *1*, 140012 (2014) (https://doi.org/10.1038/sdata.2014.12).

Both datasets are available as part of the structToolbox package and already prepared as a `DatasetExperiment`

object.

```
## Iris dataset (comment if using MTBLS79 benchmark data)
D = iris_DatasetExperiment()
D$sample_meta$class = D$sample_meta$Species
## MTBLS (comment if using Iris data)
# D = MTBLS79_DatasetExperiment(filtered=TRUE)
# M = pqn_norm(qc_label='QC',factor_name='sample_type') +
# knn_impute(neighbours=5) +
# glog_transform(qc_label='QC',factor_name='sample_type') +
# filter_smeta(mode='exclude',levels='QC',factor_name='sample_type')
# M = model_apply(M,D)
# D = predicted(M)
# show info
D
```

```
## A "DatasetExperiment" object
## ----------------------------
## name: Fisher's Iris dataset
## description: This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of
## the variables sepal length and width and petal length and width,
## respectively, for 50 flowers from each of 3 species of iris. The species are
## Iris setosa, versicolor, and virginica.
## data: 150 rows x 4 columns
## sample_meta: 150 rows x 2 columns
## variable_meta: 4 rows x 1 columns
```

The `DatasetExperiment`

object is an extension of the `SummarizedExperiment`

class used by the Bioconductor community. It contains three main parts:

`data`

A data frame containing the measured data for each sample.`sample_meta`

A data frame of additional information related to the samples e.g. group labels.`variable_meta`

A data frame of additional information related to the variables (features) e.g. annotations

Like all `struct`

objects it also contains `name`

and `description`

fields (called “slots” is R language).

A key difference between `DatasetExperiment`

and `SummarizedExperiment`

objects is that the data is transposed. i.e. for `DatasetExperiment`

objects the samples are in rows and the features are in columns, while the opposite is true for `SummarizedExperiment`

objects.

All slots are accessible using dollar notation.

```
# show some data
head(D$data[,1:4])
```

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5.0 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
```

`struct`

model objectsBefore we can apply e.g. PCA we first need to create a PCA object. This object contains all the inputs, outputs and methods needed to apply PCA. We can set parameters such as the number of components when the PCA model is created, but we can also use dollar notation to change/view it later.

```
P = PCA(number_components=15)
P$number_components=5
P$number_components
```

`## [1] 5`

The inputs for a model can be listed using `param_ids(object)`

:

`param_ids(P)`

`## [1] "number_components"`

Or a summary of the object can be printed to the console:

`P`

```
## A "PCA" object
## --------------
## name: Principal Component Analysis (PCA)
## description: PCA is a multivariate data reduction technique. It summarises the data in a smaller number of
## Principal Components that maximise variance.
## input params: number_components
## outputs: scores, loadings, eigenvalues, ssx, correlation, that
## predicted: that
## seq_in: data
```

Unless you have good reason not to, it is usually sensible to mean centre the columns of the data before PCA. Using the `STRUCT`

framework we can create a model sequence that will mean centre and then apply PCA to the mean centred data.

`M = mean_centre() + PCA(number_components = 4)`

In `structToolbox`

mean centring and PCA are both model objects, and joining them using “+” creates a model_sequence object. In a model_sequence the outputs of the first object (mean centring) are automatically passed to the inputs of the second object (PCA), which allows you chain together modelling steps in order to build a workflow.

The objects in the model_sequence can be accessed by indexing, and we can combine this with dollar notation. For example, the PCA object is the second object in our sequence and we can access the number of components as follows:

`M[2]$number_components`

`## [1] 4`

Model and model_sequence objects need to be trained using data in the form of a `DatasetExperiment`

object. For example, the PCA model sequence we created (`M`

) can be trained using the iris `DatasetExperiment`

object (‘D’).

`M = model_train(M,D)`

This model sequence has now mean centred the original data and calculated the PCA scores and loadings.

Model objects can be used to generate predictions for test datasets. For the PCA model sequence this involves mean centring the test data using the mean of training data, and the projecting the centred test data onto the PCA model using the loadings. The outputs are all stored in the model sequence and can be accessed using dollar notation. For this example we will just use the training data again (sometimes called autoprediction), which for PCA allows us to explore the training data in more detail.

`M = model_predict(M,D)`

Sometimes models don’t make use the training/test approach e.g. univariate statsitics, filtering etc. For these models the `model_apply`

method can be used instead. For models that do provide training/test methods, `model_apply`

applies autoprediction by default i.e. it is a short-cut for applying `model_train`

and `model_predict`

to the same data.

`M = model_apply(M,D)`

The available outputs for an object can be listed and accessed like input params, using dollar notation:

`output_ids(M[2])`

```
## [1] "scores" "loadings" "eigenvalues" "ssx" "correlation"
## [6] "that"
```

`M[2]$scores`

```
## A "DatasetExperiment" object
## ----------------------------
## name:
## description:
## data: 150 rows x 4 columns
## sample_meta: 150 rows x 2 columns
## variable_meta: 4 rows x 1 columns
```

The `struct`

framework includes chart objects. Charts associated with a model object can be listed.

`chart_names(M[2])`

```
## [1] "pca_biplot" "pca_correlation_plot" "pca_dstat_plot"
## [4] "pca_loadings_plot" "pca_scores_plot" "pca_scree_plot"
```

Like model objects, chart objects need to be created before they can be used. Here we will plot the PCA scores plot for our mean centred PCA model.

```
C = pca_scores_plot(factor_name='class') # colour by class
chart_plot(C,M[2])
```

Note that indexing the PCA model is required because the `pca_scores_plot`

object requires a PCA object as input, not a model_sequence.

If we make changes to the input parameters of the chart, `chart_plot`

must be called again to see the effects.

```
# add petal width to meta data of pca scores
M[2]$scores$sample_meta$example=D$data[,1]
# update plot
C$factor_name='example'
chart_plot(C,M[2])
```

```
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
```