Contents

Preamble

If you are following our online course, the following vignette will be useful as a complementary learning tool. This vignette also covers the essential use cases of various methods in this package for the general mixOmcis user. The below methods will be covered:

As outlined in 1.3, this is not an exhaustive list of all the methods found within mixOmics. More information can be found at our website and you can ask questions via our discourse forum.

**Different types of analyses with mixOmics** [@mixomics].The biological questions, the number of data sets to integrate, and the type of response variable, whether qualitative (classification), quantitative (regression), one (PLS1) or several (PLS) responses, all drive the choice of analytical method. All methods featured in this diagram include variable selection except rCCA. In N-integration, rCCA and PLS enable the integration of two quantitative data sets, whilst the block PLS methods (that derive from the methods from @Ten11) can integrate more than two data sets. In P-integration, our method MINT is based on multi-group PLS [@Esl14b].The following activities cover some of these methods.

Figure 1: Different types of analyses with mixOmics (Rohart, Gautier, Singh, and Le Cao 2017).The biological questions, the number of data sets to integrate, and the type of response variable, whether qualitative (classification), quantitative (regression), one (PLS1) or several (PLS) responses, all drive the choice of analytical method
All methods featured in this diagram include variable selection except rCCA. In N-integration, rCCA and PLS enable the integration of two quantitative data sets, whilst the block PLS methods (that derive from the methods from Tenenhaus and Tenenhaus (2011)) can integrate more than two data sets. In P-integration, our method MINT is based on multi-group PLS (Eslami et al. 2014).The following activities cover some of these methods.

1 Introduction {#01}

mixOmics is an R toolkit dedicated to the exploration and integration of biological data sets with a specific focus on variable selection. The package currently includes more than twenty multivariate methodologies, mostly developed by the mixOmics team (see some of our references in 1.2.3). Originally, all methods were designed for omics data, however, their application is not limited to biological data only. Other applications where integration is required can be considered, but mostly for the case where the predictor variables are continuous (see also 1.1).

In mixOmics, a strong focus is given to graphical representation to better translate and understand the relationships between the different data types and visualize the correlation structure at both sample and variable levels.

1.1 Input data {#01:datatypes}

Note the data pre-processing requirements before analysing data with mixOmics:

  • Types of data. Different types of biological data can be explored and integrated with mixOmics. Our methods can handle molecular features measured on a continuous scale (e.g. microarray, mass spectrometry-based proteomics and metabolomics) or sequenced-based count data (RNA-seq, 16S, shotgun metagenomics) that become `continuous’ data after pre-processing and normalisation.

  • Normalisation. The package does not handle normalisation as it is platform-specific and we cover a too wide variety of data! Prior to the analysis, we assume the data sets have been normalised using appropriate normalisation methods and pre-processed when applicable.

  • Prefiltering. While mixOmics methods can handle large data sets (several tens of thousands of predictors), we recommend pre-filtering the data to less than 10K predictor variables per data set, for example by using Median Absolute Deviation (Teng et al. 2016) for RNA-seq data, by removing consistently low counts in microbiome data sets (Lê Cao et al. 2016) or by removing near-zero variance predictors. Such step aims to lessen the computational time during the parameter tuning process.

  • Data format. Our methods use matrix decomposition techniques. Therefore, the numeric data matrix or data frames have \(n\) observations or samples in rows and \(p\) predictors or variables (e.g. genes, proteins, OTUs) in columns.

  • Covariates. In the current version of mixOmics, covariates that may confound the analysis are not included in the methods. We recommend correcting for those covariates beforehand using appropriate univariate or multivariate methods for batch effect removal. Contact us for more details as we are currently working on this aspect.

1.2 Methods

1.2.1 Some background knowledge {#01:methods}

We list here the main methodological or theoretical concepts you need to know to be able to efficiently apply mixOmics:

  • Individuals, observations or samples: the experimental units on which information are collected, e.g. patients, cell lines, cells, faecal samples etc.

  • Variables, predictors: read-out measured on each sample, e.g. gene (expression), protein or OTU (abundance), weight etc.

  • Variance: measures the spread of one variable. In our methods, we estimate the variance of components rather that variable read-outs. A high variance indicates that the data points are very spread out from the mean, and from one another (scattered).

  • Covariance: measures the strength of the relationship between two variables, i.e. whether they co-vary. A high covariance value indicates a strong relationship, e.g. weight and height in individuals frequently vary roughly in the same way; roughly, the heaviest are the tallest. A covariance value has no lower or upper bound.

  • Correlation: a standardized version of the covariance that is bounded by -1 and 1.

  • Linear combination: variables are combined by multiplying each of them by a coefficient and adding the results. A linear combination of height and weight could be \(2 * weight - 1.5 * height\) with the coefficients \(2\) and \(-1.5\) assigned with weight and height respectively.

  • Component: an artificial variable built from a linear combination of the observed variables in a given data set. Variable coefficients are optimally defined based on some statistical criterion. For example in Principal Component Analysis, the coefficients of a (principal) component are defined so as to maximise the variance of the component.

  • Loadings: variable coefficients used to define a component.

  • Sample plot: representation of the samples projected in a small space spanned (defined) by the components. Samples coordinates are determined by their components values or scores.

  • Correlation circle plot: representation of the variables in a space spanned by the components. Each variable coordinate is defined as the correlation between the original variable value and each component. A correlation circle plot enables to visualise the correlation between variables - negative or positive correlation, defined by the cosine angle between the centre of the circle and each variable point) and the contribution of each variable to each component - defined by the absolute value of the coordinate on each component. For this interpretation, data need to be centred and scaled (by default in most of our methods except PCA). For more details on this insightful graphic, see Figure 1 in (González et al. 2012).

  • Unsupervised analysis: the method does not take into account any known sample groups and the analysis is exploratory. Examples of unsupervised methods covered in this vignette are Principal Component Analysis (PCA, Chapter 3), Projection to Latent Structures (PLS, Chapter 4), and also Canonical Correlation Analysis (CCA, not covered here but see the website page).

  • Supervised analysis: the method includes a vector indicating the class membership of each sample. The aim is to discriminate sample groups and perform sample class prediction. Examples of supervised methods covered in this vignette are PLS Discriminant Analysis (PLS-DA, Chapter 5), DIABLO (Chapter 6) and also MINT (Chapter 7).

If the above descriptions were not comprehensive enough and you have some more questions, feel free to explore the glossary on our website.

1.2.2 Overview {#01:overview}

Here is an overview of the most widely used methods in mixOmics that will be further detailed in this vignette, with the exception of rCCA. We depict them along with the type of data set they can handle.

newplot

FIGURE 1: An overview of what quantity and type of dataset each method within mixOmics requires. Thin columns represent a single variable, while the larger blocks represent datasets of multiple samples and variables.

List of methods in mixOmics, sparse indicates methods that perform variable selection

Figure 2: List of methods in mixOmics, sparse indicates methods that perform variable selection

Main functions and parameters of each method

Figure 3: Main functions and parameters of each method

1.2.3 Key publications {#01:pubs}

The methods implemented in mixOmics are described in detail in the following publications. A more extensive list can be found at this link.

1.3 Outline of this Vignette {#01:outline}

  • Chapter 2: details some practical aspects to get started
  • Chapter 3: Principal Components Analysis (PCA)
  • Chapter 4: Projection to Latent Structures (PLS)
  • Chapter 5: Projection to Latent Structure - Discriminant Analysis (PLS-DA)
  • Chapter 6: Integrative analysis for multiple data sets, across samples (namely DIABLO)
  • Chapter 7: Integrative analysis for multiple data, across features (namely MINT)

Each methods chapter has the following outline:

  1. Type of biological question to be answered
  2. Brief description of an illustrative data set
  3. Principle of the method
  4. Quick start of the method with the main functions and arguments
  5. To go further: customized plots, additional graphical outputs, and tuning parameters
  6. FAQ

1.4 Other methods not covered in this vignette

Other methods not covered in this document are described on our website and the following references:

2 Let’s get started {#02}

2.1 Installation {#02:install}

First, download the latest version from Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
 BiocManager::install("mixOmics")

Alternatively, you can install the latest GitHub version of the package:

BiocManager::install("mixOmicsTeam/mixOmics")

The mixOmics package should directly import the following packages: igraph, rgl, ellipse, corpcor, RColorBrewer, plyr, parallel, dplyr, tidyr, reshape2, methods, matrixStats, rARPACK, gridExtra. For Apple mac users: if you are unable to install the imported package rgl, you will need to install the XQuartz software first.

2.2 Load the package {#02:load-data}

library(mixOmics)

Check that there is no error when loading the package, especially for the rgl library (see above).

2.3 Upload data

The examples we give in this vignette use data that are already part of the package. To upload your own data, check first that your working directory is set, then read your data from a .txt or .csv format, either by using File > Import Dataset in RStudio or via one of these command lines:

# from csv file
data <- read.csv("your_data.csv", row.names = 1, header = TRUE)

# from txt file
data <- read.table("your_data.txt", header = TRUE)

For more details about the arguments used to modify those functions, type ?read.csv or ?read.table in the R console.

2.4 Quick start in mixOmics {#02:quick-start}

Each analysis should follow this workflow:

  1. Run the method
  2. Graphical representation of the samples
  3. Graphical representation of the variables

Then use your critical thinking and additional functions and visual tools to make sense of your data! (some of which are listed in 1.2.2) and will be described in the next Chapters.

For instance, for Principal Components Analysis, we first load the data:

data(nutrimouse)
X <- nutrimouse$gene

Then use the following steps:

MyResult.pca <- pca(X)  # 1 Run the method
plotIndiv(MyResult.pca) # 2 Plot the samples

plotVar(MyResult.pca)   # 3 Plot the variables

This is only a first quick-start, there will be many avenues you can take to deepen your exploratory and integrative analyses. The package proposes several methods to perform variable, or feature selection to identify the relevant information from rather large omics data sets. The sparse methods are listed in the Table in 1.2.2.

Following our example here, sparse PCA can be applied to select the top 5 variables contributing to each of the two components in PCA. The user specifies the number of variables to selected on each component, for example, here 5 variables are selected on each of the first two components (keepX=c(5,5)):

MyResult.spca <- spca(X, keepX=c(5,5)) # 1 Run the method
plotIndiv(MyResult.spca)               # 2 Plot the samples

plotVar(MyResult.spca)                 # 3 Plot the variables

You can see know that we have considerably reduced the number of genes in the plotVar correlation circle plot.

Do not stop here! We are not done yet. You can enhance your analyses with the following:

  • Have a look at our manual and each of the functions and their examples, e.g. ?pca, ?plotIndiv, ?sPCA, …

  • Run the examples from the help file using the example function: example(pca), example(plotIndiv), …

  • Have a look at our website that features many tutorials and case studies,

  • Keep reading this vignette, this is just the beginning!

3 PCA on the multidrug study {#03}

To illustrate PCA and is variants, we will analyse the multidrug case study available in the package. This pharmacogenomic study investigates the patterns of drug activity in cancer cell lines (Szakács et al. 2004). These cell lines come from the NCI-60 Human Tumor Cell Lines established by the Developmental Therapeutics Program of the National Cancer Institute (NCI) to screen for the toxicity of chemical compound repositories in diverse cancer cell lines. NCI-60 includes cell lines derived from cancers of colorectal (7 cell lines), renal (8), ovarian (6), breast (8), prostate (2), lung (9) and central nervous system origin (6), as well as leukemia (6) and melanoma (8).

Two separate data sets (representing two types of measurements) on the same NCI-60 cancer cell lines are available in multidrug (see also ?multidrug):

Additional information will also be used in the outputs:

In this activity, we illustrate PCA performed on the human ABC transporters ABC.trans, and sparse PCA on the compound data compound.

3.1 Load the data {#03:load-data}

The input data matrix \(\boldsymbol{X}\) is of size \(N\) samples in rows and \(P\) variables (e.g. genes) in columns. We start with the ABC.trans data.

library(mixOmics)
data(multidrug)
X <- multidrug$ABC.trans
dim(X) # Check dimensions of data
## [1] 60 48

3.2 Example: PCA {#03:pca}

3.2.1 Choose the number of components {#03:pca-ncomp}

Contrary to the minimal code example, here we choose to also scale the variables for the reasons detailed earlier. The function tune.pca() calculates the cumulative proportion of explained variance for a large number of principal components (here we set ncomp = 10). A screeplot of the proportion of explained variance relative to the total amount of variance in the data for each principal component is output (Fig. 4):

tune.pca.multi <- tune.pca(X, ncomp = 10, scale = TRUE)
plot(tune.pca.multi)
Screeplot from the PCA performed on the ABC.trans data: Amount of explained variance for each principal component on the ABC transporter data.

Figure 4: Screeplot from the PCA performed on the ABC.trans data: Amount of explained variance for each principal component on the ABC transporter data.

# tune.pca.multidrug$cum.var       # Outputs cumulative proportion of variance

From the numerical output (not shown here), we observe that the first two principal components explain 22.87% of the total variance, and the first three principal components explain 29.88% of the total variance. The rule of thumb for choosing the number of components is not so much to set a hard threshold based on the cumulative proportion of explained variance (as this is data-dependent), but to observe when a drop, or elbow, appears on the screeplot. The elbow indicates that the remaining variance is spread over many principal components and is not relevant in obtaining a low dimensional ‘snapshot’ of the data. This is an empirical way of choosing the number of principal components to retain in the analysis. In this specific example we could choose between 2 to 3 components for the final PCA, however these criteria are highly subjective and the reader must keep in mind that visualisation becomes difficult above three dimensions.

3.2.2 PCA with fewer components {#03:pca-final}

Based on the preliminary analysis above, we run a PCA with three components. Here we show additional input, such as whether to center or scale the variables.

final.pca.multi <- pca(X, ncomp = 3, center = TRUE, scale = TRUE)
# final.pca.multi  # Lists possible outputs

The output is similar to the tuning step above. Here the total variance in the data is:

final.pca.multi$var.tot
## [1] 47.98305

By summing the variance explained from all possible components, we would achieve the same amount of explained variance. The proportion of explained variance per component is:

final.pca.multi$prop_expl_var$X
##        PC1        PC2        PC3 
## 0.12677541 0.10194929 0.07011818

The cumulative proportion of variance explained can also be extracted (as displayed in Figure 4):

final.pca.multi$cum.var
##       PC1       PC2       PC3 
## 0.1267754 0.2287247 0.2988429

3.2.3 Identify the informative variables {#03:pca-vars}

To calculate components, we use the variable coefficient weights indicated in the loading vectors. Therefore, the absolute value of the coefficients in the loading vectors inform us about the importance of each variable in contributing to the definition of each component. We can extract this information through the selectVar() function which ranks the most important variables in decreasing order according to their absolute loading weight value for each principal component.

# Top variables on the first component only:
head(selectVar(final.pca.multi, comp = 1)$value)
##        value.var
## ABCE1  0.3242162
## ABCD3  0.2647565
## ABCF3  0.2613074
## ABCA8 -0.2609394
## ABCB7  0.2493680
## ABCF1  0.2424253

Note:

  • Here the variables are not selected (all are included), but ranked according to their importance in defining each component.

3.2.4 Sample plots {#03:pca-sample-plot}

We project the samples into the space spanned by the principal components to visualise how the samples cluster and assess for biological or technical variation in the data. We colour the samples according to the cell line information available in multidrug$cell.line$Class by specifying the argument group (Fig. 5).

plotIndiv(final.pca.multi,
          comp = c(1, 2),   # Specify components to plot
          ind.names = TRUE, # Show row names of samples
          group = multidrug$cell.line$Class,
          title = 'ABC transporters, PCA comp 1 - 2',
          legend = TRUE, legend.title = 'Cell line')
Sample plot from the PCA performed on the ABC.trans data. Samples are projected into the space spanned by the first two principal components, and coloured according to cell line type. Numbers indicate the rownames of the data.

Figure 5: Sample plot from the PCA performed on the ABC.trans data. Samples are projected into the space spanned by the first two principal components, and coloured according to cell line type. Numbers indicate the rownames of the data.

Because we have run PCA on three components, we can examine the third component, either by plotting the samples onto the principal components 1 and 3 (PC1 and PC3) in the code above (comp = c(1, 3)) or by using the 3D interactive plot (code shown below). The addition of the third principal component only seems to highlight a potential outlier (sample 8, not shown). Potentially, this sample could be removed from the analysis, or, noted when doing further downstream analysis. The removal of outliers should be exercised with great caution and backed up with several other types of analyses (e.g. clustering) or graphical outputs (e.g. boxplots, heatmaps, etc).

# Interactive 3D plot will load the rgl library.
plotIndiv(final.pca.multi, style = '3d',
           group = multidrug$cell.line$Class,
          title = 'ABC transporters, PCA comp 1 - 3')

These plots suggest that the largest source of variation explained by the first two components can be attributed to the melanoma cell line, while the third component highlights a single outlier sample. Hence, the interpretation of the following outputs should primarily focus on the first two components.

Note:

  • Had we not scaled the data, the separation of the melanoma cell lines would have been more obvious with the addition of the third component, while PC1 and PC2 would have also highlighted the sample outliers 4 and 8. Thus, centering and scaling are important steps to take into account in PCA.

3.2.5 Variable plot: correlation circle plot {#03:pca-variable-plot}

Correlation circle plots indicate the contribution of each variable to each component using the plotVar() function, as well as the correlation between variables (indicated by a ‘cluster’ of variables). Note that to interpret the latter, the variables need to be centered and scaled in PCA:

plotVar(final.pca.multi, comp = c(1, 2),
        var.names = TRUE,
        cex = 3,         # To change the font size
        # cutoff = 0.5,  # For further cutoff
        title = 'Multidrug transporter, PCA comp 1 - 2')
Correlation Circle plot from the PCA performed on the ABC.trans data. The plot shows groups of transporters that are highly correlated, and also contribute to PC1 - near the big circle on the right hand side of the plot (transporters grouped with those in orange), or PC1 and PC2 - top left and top bottom corner of the plot, transporters grouped with those in pink and yellow.

Figure 6: Correlation Circle plot from the PCA performed on the ABC.trans data. The plot shows groups of transporters that are highly correlated, and also contribute to PC1 - near the big circle on the right hand side of the plot (transporters grouped with those in orange), or PC1 and PC2 - top left and top bottom corner of the plot, transporters grouped with those in pink and yellow.

The plot in Figure 6 highlights a group of ABC transporters that contribute to PC1: ABCE1, and to some extent the group clustered with ABCB8 that contributes positively to PC1, while ABCA8 contributes negatively. We also observe a group of transporters that contribute to both PC1 and PC2: the group clustered with ABCC2 contributes positively to PC2 and negatively to PC1, and a cluster of ABCC12 and ABCD2 that contributes negatively to both PC1 and PC2. We observe that several transporters are inside the small circle. However, examining the third component (argument comp = c(1, 3)) does not appear to reveal further transporters that contribute to this third component. The additional argument cutoff = 0.5 could further simplify this plot.

3.2.6 Biplot: samples and variables {#03:pca-biplot}

A biplot allows us to display both samples and variables simultaneously to further understand their relationships. Samples are displayed as dots while variables are displayed at the tips of the arrows. Similar to correlation circle plots, data must be centered and scaled to interpret the correlation between variables (as a cosine angle between variable arrows).

biplot(final.pca.multi, group = multidrug$cell.line$Class, 
       legend.title = 'Cell line')
Biplot from the PCA performed on the ABS.trans data. The plot highlights which transporter expression levels may be related to specific cell lines, such as melanoma.

Figure 7: Biplot from the PCA performed on the ABS.trans data. The plot highlights which transporter expression levels may be related to specific cell lines, such as melanoma.

The biplot in Figure 7 shows that the melanoma cell lines seem to be characterised by a subset of transporters such as the cluster around ABCC2 as highlighted previously in Figure 6. Further examination of the data, such as boxplots (as shown in Fig. 8), can further elucidate the transporter expression levels for these specific samples.

ABCC2.scale <- scale(X[, 'ABCC2'], center = TRUE, scale = TRUE)

boxplot(ABCC2.scale ~
        multidrug$cell.line$Class, col = color.mixo(1:9),
        xlab = 'Cell lines', ylab = 'Expression levels, scaled',
        par(cex.axis = 0.5), # Font size
        main = 'ABCC2 transporter')
Boxplots of the transporter ABCC2 identified from the PCA correlation circle plot (Fig. 6) and the biplot (Fig. 7) show the level of ABCC2 expression related to cell line types. The expression level of ABCC2 was centered and scaled in the PCA, but similar patterns are also observed in the original data.

Figure 8: Boxplots of the transporter ABCC2 identified from the PCA correlation circle plot (Fig. 6) and the biplot (Fig. 7) show the level of ABCC2 expression related to cell line types. The expression level of ABCC2 was centered and scaled in the PCA, but similar patterns are also observed in the original data.

3.3 Example: sparse PCA {#03:spca}

In the ABC.trans data, there is only one missing value. Missing values can be handled by sPCA via the NIPALS algorithm . However, if the number of missing values is large, we recommend imputing them with NIPALS, as we describe in our website in www.mixOmics.org.

3.3.1 Choose the number of variables to select {#03:spca-vars}

First, we must decide on the number of components to evaluate. The previous tuning step indicated that ncomp = 3 was sufficient to explain most of the variation in the data, which is the value we choose in this example. We then set up a grid of keepX values to test, which can be thin or coarse depending on the total number of variables. We set up the grid to be thin at the start, and coarse as the number of variables increases. The ABC.trans data includes a sufficient number of samples to perform repeated 5-fold cross-validation to define the number of folds and repeats (leave-one-out CV is also possible if the number of samples \(N\) is small by specifying folds = \(N\)). The computation may take a while if you are not using parallelisation (see additional parameters in tune.spca()), here we use a small number of repeats for illustrative purposes. We then plot the output of the tuning function.

grid.keepX <- c(seq(5, 30, 5))
# grid.keepX  # To see the grid

set.seed(30) # For reproducibility with this handbook, remove otherwise
tune.spca.result <- tune.spca(X, ncomp = 3, 
                              folds = 5, 
                              test.keepX = grid.keepX, nrepeat = 10) 

# Consider adding up to 50 repeats for more stable results
tune.spca.result$choice.keepX
## comp1 comp2 comp3 
##    15    15    25
plot(tune.spca.result)
Tuning the number of variables to select with sPCA on the ABC.trans data. For a grid of number of variables to select indicated on the x-axis, the average correlation between predicted and actual components based on cross-validation is calculated and shown on the y-axis for each component. The optimal number of variables to select per component is assessed via one-sided \(t-\)tests and is indicated with a diamond.

Figure 9: Tuning the number of variables to select with sPCA on the ABC.trans data. For a grid of number of variables to select indicated on the x-axis, the average correlation between predicted and actual components based on cross-validation is calculated and shown on the y-axis for each component. The optimal number of variables to select per component is assessed via one-sided \(t-\)tests and is indicated with a diamond.

The tuning function outputs the averaged correlation between predicted and actual components per keepX value for each component. It indicates the optimal number of variables to select for which the averaged correlation is maximised on each component. Figure 9 shows that this is achieved when selecting 15 transporters on the first component, and 15 on the second. Given the drop in values in the averaged correlations for the third component, we decide to only retain two components.

Note:

  • If the tuning results suggest a large number of variables to select that is close to the total number of variables, we can arbitrarily choose a much smaller selection size.

3.3.2 Final sparse PCA {#03:spca-final}

Based on the tuning above, we perform the final sPCA where the number of variables to select on each component is specified with the argument keepX. Arbitrary values can also be input if you would like to skip the tuning step for more exploratory analyses:

# By default center = TRUE, scale = TRUE
keepX.select <- tune.spca.result$choice.keepX[1:2]

final.spca.multi <- spca(X, ncomp = 2, keepX = keepX.select)

# Proportion of explained variance:
final.spca.multi$prop_expl_var$X
##       PC1       PC2 
## 0.1171694 0.1004163

Overall when considering two components, we lose approximately 1.1 % of explained variance compared to a full PCA, but the aim of this analysis is to identify key transporters driving the variation in the data, as we show below.

3.3.3 Sample and variable plots {#03:spca-plots}

We first examine the sPCA sample plot:

plotIndiv(final.spca.multi,
          comp = c(1, 2),   # Specify components to plot
          ind.names = TRUE, # Show row names of samples
          group = multidrug$cell.line$Class,
          title = 'ABC transporters, sPCA comp 1 - 2',
          legend = TRUE, legend.title = 'Cell line')
Sample plot from the sPCA performed on the ABC.trans data. Samples are projected onto the space spanned by the first two sparse principal components that are calculated based on a subset of selected variables. Samples are coloured by cell line type and numbers indicate the sample IDs.

Figure 10: Sample plot from the sPCA performed on the ABC.trans data. Samples are projected onto the space spanned by the first two sparse principal components that are calculated based on a subset of selected variables. Samples are coloured by cell line type and numbers indicate the sample IDs.

In Figure 10, component 2 in sPCA shows clearer separation of the melanoma samples compared to the full PCA. Component 1 is similar to the full PCA. Overall, this sample plot shows that little information is lost compared to a full PCA.

A biplot can also be plotted that only shows the selected transporters (Figure 11):

biplot(final.spca.multi, group = multidrug$cell.line$Class, 
       legend =FALSE)
Biplot from the sPCA performed on the ABS.trans data after variable selection. The plot highlights in more detail which transporter expression levels may be related to specific cell lines, such as melanoma, compared to a classical PCA.

Figure 11: Biplot from the sPCA performed on the ABS.trans data after variable selection. The plot highlights in more detail which transporter expression levels may be related to specific cell lines, such as melanoma, compared to a classical PCA.

The correlation circle plot highlights variables that contribute to component 1 and component 2 (Fig. 12):

plotVar(final.spca.multi, comp = c(1, 2), var.names = TRUE, 
        cex = 3, # To change the font size 
        title = 'Multidrug transporter, sPCA comp 1 - 2')
Correlation Circle plot from the sPCA performed on the ABC.trans data. Only the transporters selected by the sPCA are shown on this plot. Transporters coloured in green are discussed in the text.

Figure 12: Correlation Circle plot from the sPCA performed on the ABC.trans data. Only the transporters selected by the sPCA are shown on this plot. Transporters coloured in green are discussed in the text.

The transporters selected by sPCA are amongst the most important ones in PCA. Those coloured in green in Figure 6 (ABCA9, ABCB5, ABCC2 and ABCD1) show an example of variables that contribute positively to component 2, but with a larger weight than in PCA. Thus, they appear as a clearer cluster in the top part of the correlation circle plot compared to PCA. As shown in the biplot in Figure 11, they contribute in explaining the variation in the melanoma samples.

We can extract the variable names and their positive or negative contribution to a given component (here 2), using the selectVar() function:

# On the first component, just a head
head(selectVar(final.spca.multi, comp = 2)$value)
##        value.var
## ABCA9  0.4510621
## ABCB5  0.4184759
## ABCC2  0.4046096
## ABCD1  0.3921438
## ABCA3 -0.2779984
## ABCD2 -0.2255945

The loading weights can also be visualised with plotLoading(), where variables are ranked from the least important (top) to the most important (bottom) in Figure 13). Here on component 2:

plotLoadings(final.spca.multi, comp = 2)
sPCA loading plot of the ABS.trans data for component 2. Only the transporters selected by sPCA on component 2 are shown, and are ranked from least important (top) to most important. Bar length indicates the loading weight in PC2.

Figure 13: sPCA loading plot of the ABS.trans data for component 2. Only the transporters selected by sPCA on component 2 are shown, and are ranked from least important (top) to most important. Bar length indicates the loading weight in PC2.

4 PLS on the liver toxicity study {#04}

The data come from a liver toxicity study in which 64 male rats were exposed to non-toxic (50 or 150 mg/kg), moderately toxic (1500 mg/kg) or severely toxic (2000 mg/kg) doses of acetaminophen (paracetamol) (Bushel, Wolfinger, and Gibson 2007). Necropsy was performed at 6, 18, 24 and 48 hours after exposure and the mRNA was extracted from the liver. Ten clinical measurements of markers for liver injury are available for each subject. The microarray data contain expression levels of 3,116 genes. The data were normalised and preprocessed by Bushel, Wolfinger, and Gibson (2007).

liver toxicity contains the following:

We can analyse these two data sets (genes and clinical measurements) using sPLS1, then sPLS2 with a regression mode to explain or predict the clinical variables with respect to the gene expression levels.

4.1 Load the data {#04:load-data}

library(mixOmics)
data(liver.toxicity)
X <- liver.toxicity$gene
Y <- liver.toxicity$clinic

As we have discussed previously for integrative analysis, we need to ensure that the samples in the two data sets are in the same order, or matching, as we are performing data integration:

head(data.frame(rownames(X), rownames(Y)))
##   rownames.X. rownames.Y.
## 1       ID202       ID202
## 2       ID203       ID203
## 3       ID204       ID204
## 4       ID206       ID206
## 5       ID208       ID208
## 6       ID209       ID209

4.2 Example: sPLS1 regression {#04:spls1}

We first start with a simple case scenario where we wish to explain one \(\boldsymbol Y\) variable with a combination of selected \(\boldsymbol X\) variables (transcripts). We choose the following clinical measurement which we denote as the \(\boldsymbol y\) single response variable:

y <- liver.toxicity$clinic[, "ALB.g.dL."]

4.2.1 Number of dimensions using the \(Q^2\) criterion {#04:spls1-ncomp}

Defining the ‘best’ number of dimensions to explain the data requires we first launch a PLS1 model with a large number of components. Some of the outputs from the PLS1 object are then retrieved in the perf() function to calculate the \(Q^2\) criterion using repeated 10-fold cross-validation.

tune.pls1.liver <- pls(X = X, Y = y, ncomp = 4, mode = 'regression')
set.seed(33)  # For reproducibility with this handbook, remove otherwise
Q2.pls1.liver <- perf(tune.pls1.liver, validation = 'Mfold', 
                      folds = 10, nrepeat = 5)
plot(Q2.pls1.liver, criterion = 'Q2')
\(Q^2\) criterion to choose the number of components in PLS1. For each dimension added to the PLS model, the \(Q^2\) value is shown. The horizontal line of 0.0975 indicates the threshold below which adding a dimension may not be beneficial to improve accuracy in PLS.

Figure 14: \(Q^2\) criterion to choose the number of components in PLS1. For each dimension added to the PLS model, the \(Q^2\) value is shown. The horizontal line of 0.0975 indicates the threshold below which adding a dimension may not be beneficial to improve accuracy in PLS.

The plot in Figure 14 shows that the \(Q^2\) value varies with the number of dimensions added to PLS1, with a decrease to negative values from 2 dimensions. Based on this plot we would choose only one dimension, but we will still add a second dimension for the graphical outputs.

Note:

  • One dimension is not unusual given that we only include one \(\boldsymbol y\) variable in PLS1.

4.2.2 Number of variables to select in \(\boldsymbol X\) {#04:spls1-tuning}

We now set a grid of values - thin at the start, but also restricted to a small number of genes for a parsimonious model, which we will test for each of the two components in the tune.spls() function, using the MAE criterion.

# Set up a grid of values: 
list.keepX <- c(5:10, seq(15, 50, 5))     

# list.keepX  # Inspect the keepX grid
set.seed(33)  # For reproducibility with this handbook, remove otherwise
tune.spls1.MAE <- tune.spls(X, y, ncomp= 2, 
                            test.keepX = list.keepX, 
                            validation = 'Mfold', 
                            folds = 10,
                            nrepeat = 5, 
                            progressBar = FALSE, 
                            measure = 'MAE')
plot(tune.spls1.MAE)
Mean Absolute Error criterion to choose the number of variables to select in PLS1, using repeated CV times for a grid of variables to select. The MAE increases with the addition of a second dimension comp 1 to 2, suggesting that only one dimension is sufficient. The optimal keepX is indicated with a diamond.

Figure 15: Mean Absolute Error criterion to choose the number of variables to select in PLS1, using repeated CV times for a grid of variables to select. The MAE increases with the addition of a second dimension comp 1 to 2, suggesting that only one dimension is sufficient. The optimal keepX is indicated with a diamond.

Figure 15 confirms that one dimension is sufficient to reach minimal MAE. Based on the tune.spls() function we extract the final parameters:

choice.ncomp <- tune.spls1.MAE$choice.ncomp$ncomp
# Optimal number of variables to select in X based on the MAE criterion
# We stop at choice.ncomp
choice.keepX <- tune.spls1.MAE$choice.keepX[1:choice.ncomp]  

choice.ncomp
## [1] 1
choice.keepX
## comp1 
##    20

Note:

  • Other criterion could have been used and may bring different results. For example, when using measure = 'MSE, the optimal keepX was rather unstable, and is often smaller than when using the MAE criterion. As we have highlighted before, there is some back and forth in the analyses to choose the criterion and parameters that best fit our biological question and interpretation.

4.2.3 Final sPLS1 model {#04:spls1-final}

Here is our final model with the tuned parameters:

spls1.liver <- spls(X, y, ncomp = choice.ncomp, keepX = choice.keepX, 
                    mode = "regression")

The list of genes selected on component 1 can be extracted with the command line (not output here):

selectVar(spls1.liver, comp = 1)$X$name

We can compare the amount of explained variance for the \(\boldsymbol X\) data set based on the sPLS1 (on 1 component) versus PLS1 (that was run on 4 components during the tuning step):

spls1.liver$prop_expl_var$X
##      comp1 
## 0.08150917
tune.pls1.liver$prop_expl_var$X
##      comp1      comp2      comp3      comp4 
## 0.11079101 0.14010577 0.21714518 0.06433377

The amount of explained variance in \(\boldsymbol X\) is lower in sPLS1 than PLS1 for the first component. However, we will see in this case study that the Mean Squared Error Prediction is also lower (better) in sPLS1 compared to PLS1.

4.2.4 Sample plots {#04:spls1-sample-plots}

For further graphical outputs, we need to add a second dimension in the model, which can include the same number of keepX variables as in the first dimension. However, the interpretation should primarily focus on the first dimension. In Figure 16 we colour the samples according to the time of treatment and add symbols to represent the treatment dose. Recall however that such information was not included in the sPLS1 analysis.

spls1.liver.c2 <- spls(X, y, ncomp = 2, keepX = c(rep(choice.keepX, 2)), 
                   mode = "regression")

plotIndiv(spls1.liver.c2,
          group = liver.toxicity$treatment$Time.Group,
          pch = as.factor(liver.toxicity$treatment$Dose.Group),
          legend = TRUE, legend.title = 'Time', legend.title.pch = 'Dose')
Sample plot from the PLS1 performed on the liver.toxicity data with two dimensions. Components associated to each data set (or block) are shown. Focusing only on the projection of the sample on the first component shows that the genes selected in \(\boldsymbol X\) tend to explain the 48h length of treatment vs the earlier time points. This is somewhat in agreement with the levels of the \(\boldsymbol y\) variable. However, more insight can be obtained by plotting the first components only, as shown in Figure 17.

Figure 16: Sample plot from the PLS1 performed on the liver.toxicity data with two dimensions. Components associated to each data set (or block) are shown. Focusing only on the projection of the sample on the first component shows that the genes selected in \(\boldsymbol X\) tend to explain the 48h length of treatment vs the earlier time points. This is somewhat in agreement with the levels of the \(\boldsymbol y\) variable. However, more insight can be obtained by plotting the first components only, as shown in Figure 17.

The alternative is to plot the component associated to the \(\boldsymbol X\) data set (here corresponding to a linear combination of the selected genes) vs. the component associated to the \(\boldsymbol y\) variable (corresponding to the scaled \(\boldsymbol y\) variable in PLS1 with one dimension), or calculate the correlation between both components:

# Define factors for colours matching plotIndiv above
time.liver <- factor(liver.toxicity$treatment$Time.Group, 
                     levels = c('18', '24', '48', '6'))
dose.liver <- factor(liver.toxicity$treatment$Dose.Group, 
                     levels = c('50', '150', '1500', '2000'))
# Set up colours and symbols
col.liver <- color.mixo(time.liver)
pch.liver <- as.numeric(dose.liver)

plot(spls1.liver$variates$X, spls1.liver$variates$Y,
     xlab = 'X component', ylab = 'y component / scaled y',
     col = col.liver, pch = pch.liver)
legend('topleft', col = color.mixo(1:4), legend = levels(time.liver),
       lty = 1, title = 'Time')
legend('bottomright', legend = levels(dose.liver), pch = 1:4,
       title = 'Dose')
Sample plot from the sPLS1 performed on the liver.toxicity data on one dimension. A reduced representation of the 20 genes selected and combined in the \(\boldsymbol X\) component on the \(x-\)axis with respect to the \(\boldsymbol y\) component value (equivalent to the scaled values of \(\boldsymbol y\)) on the \(y-\)axis. We observe a separation between the high doses 1500 and 2000 mg/kg (symbols \(+\) and \(\times\)) at 48h and 18h while low and medium doses cluster in the middle of the plot. High doses for 6h and 18h have high scores for both components.

Figure 17: Sample plot from the sPLS1 performed on the liver.toxicity data on one dimension. A reduced representation of the 20 genes selected and combined in the \(\boldsymbol X\) component on the \(x-\)axis with respect to the \(\boldsymbol y\) component value (equivalent to the scaled values of \(\boldsymbol y\)) on the \(y-\)axis. We observe a separation between the high doses 1500 and 2000 mg/kg (symbols \(+\) and \(\times\)) at 48h and 18h while low and medium doses cluster in the middle of the plot. High doses for 6h and 18h have high scores for both components.

cor(spls1.liver$variates$X, spls1.liver$variates$Y)
##           comp1
## comp1 0.7515489

Figure 17 is a reduced representation of a multivariate regression with PLS1. It shows that PLS1 effectively models a linear relationship between \(\boldsymbol y\) and the combination of the 20 genes selected in \(\boldsymbol X\).

4.2.5 Performance assessment of sPLS1 {#04:spls1-perf}

The performance of the final model can be assessed with the perf() function, using repeated cross-validation (CV). Because a single performance value has little meaning, we propose to compare the performances of a full PLS1 model (with no variable selection) with our sPLS1 model based on the MSEP (other criteria can be used):

set.seed(33)  # For reproducibility with this handbook, remove otherwise

# PLS1 model and performance
pls1.liver <- pls(X, y, ncomp = choice.ncomp, mode = "regression")
perf.pls1.liver <- perf(pls1.liver, validation = "Mfold", folds =10, 
                   nrepeat = 5, progressBar = FALSE)
perf.pls1.liver$measures$MSEP$summary
##   feature comp      mean         sd
## 1       Y    1 0.7281681 0.04134627
# To extract values across all repeats:
# perf.pls1.liver$measures$MSEP$values

# sPLS1 performance
perf.spls1.liver <- perf(spls1.liver, validation = "Mfold", folds = 10, 
                   nrepeat = 5, progressBar = FALSE)
perf.spls1.liver$measures$MSEP$summary
##   feature comp      mean         sd
## 1       Y    1 0.5958565 0.02697727

The MSEP is lower with sPLS1 compared to PLS1, indicating that the \(\boldsymbol{X}\) variables selected (listed above with selectVar()) can be considered as a good linear combination of predictors to explain \(\boldsymbol y\).

4.3 Example: PLS2 regression {#04:spls2}

PLS2 is a more complex problem than PLS1, as we are attempting to fit a linear combination of a subset of \(\boldsymbol{Y}\) variables that are maximally covariant with a combination of \(\boldsymbol{X}\) variables. The sparse variant allows for the selection of variables from both data sets.

As a reminder, here are the dimensions of the \(\boldsymbol{Y}\) matrix that includes clinical parameters associated with liver failure.

dim(Y)
## [1] 64 10

4.3.1 Number of dimensions using the \(Q^2\) criterion {#04:spls2-ncomp}

Similar to PLS1, we first start by tuning the number of components to select by using the perf() function and the \(Q^2\) criterion using repeated cross-validation.

tune.pls2.liver <- pls(X = X, Y = Y, ncomp = 5, mode = 'regression')

set.seed(33)  # For reproducibility with this handbook, remove otherwise
Q2.pls2.liver <- perf(tune.pls2.liver, validation = 'Mfold', folds = 10, 
                      nrepeat = 5)
plot(Q2.pls2.liver, criterion = 'Q2.total')
\(Q^2\) criterion to choose the number of components in PLS2. For each component added to the PLS2 model, the averaged \(Q^2\) across repeated cross-validation is shown, with the horizontal line of 0.0975 indicating the threshold below which the addition of a dimension may not be beneficial to improve accuracy.

Figure 18: \(Q^2\) criterion to choose the number of components in PLS2. For each component added to the PLS2 model, the averaged \(Q^2\) across repeated cross-validation is shown, with the horizontal line of 0.0975 indicating the threshold below which the addition of a dimension may not be beneficial to improve accuracy.

Figure 18 shows that one dimension should be sufficient in PLS2. We will include a second dimension in the graphical outputs, whilst focusing our interpretation on the first dimension.

Note:

  • Here we chose repeated cross-validation, however, the conclusions were similar for nrepeat = 1.

4.3.2 Number of variables to select in both \(\boldsymbol X\) and \(\boldsymbol Y\) {#04:spls2-tuning}

Using the tune.spls() function, we can perform repeated cross-validation to obtain some indication of the number of variables to select. We show an example of code below which may take some time to run (see ?tune.spls() to use parallel computing). We had refined the grid of tested values as the tuning function tended to favour a very small signature. Hence we decided to constrain the start of the grid to 3 for a more insightful signature. Both measure = 'cor and RSS gave similar signature sizes, but this observation might differ for other case studies.

The optimal parameters can be output, along with a plot showing the tuning results, as shown in Figure 19:

# This code may take several min to run, parallelisation option is possible
list.keepX <- c(seq(5, 50, 5))
list.keepY <- c(3:10)

# added this to avoid errors where num_workers exceeded limits set by devtools::check()
    chk <- Sys.getenv("_R_CHECK_LIMIT_CORES_", "")
    if (nzchar(chk) && chk == "TRUE") {
      # use 2 cores in CRAN/Travis/AppVeyor
      num_workers <- 2L
    } else {
      # use all cores in devtools::test()
      num_workers <- parallel::detectCores()
    }

set.seed(33)  # For reproducibility with this handbook, remove otherwise
tune.spls.liver <- tune.spls(X, Y, test.keepX = list.keepX, 
                             test.keepY = list.keepY, ncomp = 2, 
                             nrepeat = 1, folds = 10, mode = 'regression', 
                             measure = 'cor', 
                            #   the following uses two CPUs for faster computation
                            # it can be commented out
                            BPPARAM = BiocParallel::SnowParam(workers = num_workers)
                            )

plot(tune.spls.liver)
Tuning plot for sPLS2. For every grid value of keepX and keepY, the averaged correlation coefficients between the \(\boldsymbol t\) and \(\boldsymbol u\) components are shown across repeated CV, with optimal values (here corresponding to the highest mean correlation) indicated in a green square for each dimension and data set.

Figure 19: Tuning plot for sPLS2. For every grid value of keepX and keepY, the averaged correlation coefficients between the \(\boldsymbol t\) and \(\boldsymbol u\) components are shown across repeated CV, with optimal values (here corresponding to the highest mean correlation) indicated in a green square for each dimension and data set.

4.3.3 Final sPLS2 model {#04:spls2-final}

Here is our final model with the tuned parameters for our sPLS2 regression analysis. Note that if you choose to not run the tuning step, you can still decide to set the parameters of your choice here.

#Optimal parameters
choice.keepX <- tune.spls.liver$choice.keepX
choice.keepY <- tune.spls.liver$choice.keepY
choice.ncomp <- length(choice.keepX)

spls2.liver <- spls(X, Y, ncomp = choice.ncomp, 
                   keepX = choice.keepX,
                   keepY = choice.keepY,
                   mode = "regression")

4.3.3.1 Numerical outputs {#04:spls2-variance}

The amount of explained variance can be extracted for each dimension and each data set:

spls2.liver$prop_expl_var
## $X
##      comp1      comp2 
## 0.19377955 0.08534548 
## 
## $Y
##     comp1     comp2 
## 0.3649944 0.2179876

4.3.3.2 Importance variables {#04:spls2-variables}

The selected variables can be extracted from the selectVar() function, for example for the \(\boldsymbol X\) data set, with either their $name or the loading $value (not output here):

selectVar(spls2.liver, comp = 1)$X$value

The VIP measure is exported for all variables in \(\boldsymbol X\), here we only subset those that were selected (non null loading value) for component 1:

vip.spls2.liver <- vip(spls2.liver)
# just a head
head(vip.spls2.liver[selectVar(spls2.liver, comp = 1)$X$name,1])
## A_42_P620915  A_43_P14131 A_42_P578246  A_43_P11724 A_42_P840776 A_42_P675890 
##     24.20292     22.10492     15.72275     14.98876     13.92233     13.11236

The (full) output shows that most \(\boldsymbol X\) variables that were selected are important for explaining \(\boldsymbol Y\), since their VIP is greater than 1.

We can examine how frequently each variable is selected when we subsample the data using the perf() function to measure how stable the signature is (Table 1). The same could be output for other components and the \(\boldsymbol Y\) data set.

perf.spls2.liver <- perf(spls2.liver, validation = 'Mfold', folds = 10, nrepeat = 5)
# Extract stability
stab.spls2.liver.comp1 <- perf.spls2.liver$features$stability.X$comp1
# Averaged stability of the X selected features across CV runs, as shown in Table
stab.spls2.liver.comp1[1:choice.keepX[1]]

# We extract the stability measures of only the variables selected in spls2.liver
extr.stab.spls2.liver.comp1 <- stab.spls2.liver.comp1[selectVar(spls2.liver, 
                                                                  comp =1)$X$name]

Table 1: Stability measure (occurence of selection) of the bottom 20 variables from X selected with sPLS2 across repeated 10-fold subsampling on component 1.
x
0.64
0.78
0.56
0.44
0.50
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA

We recommend to mainly focus on the interpretation of the most stable selected variables (with a frequency of occurrence greater than 0.8).

4.3.3.3 Graphical outputs {#04:spls2-plots}

Using the plotIndiv() function, we display the sample and metadata information using the arguments group (colour) and pch (symbol) to better understand the similarities between samples modelled with sPLS2.

The plot on the left hand side corresponds to the projection of the samples from the \(\boldsymbol X\) data set (gene expression) and the plot on the right hand side the \(\boldsymbol Y\) data set (clinical variables).

plotIndiv(spls2.liver, ind.names = FALSE, 
          group = liver.toxicity$treatment$Time.Group, 
          pch = as.factor(liver.toxicity$treatment$Dose.Group), 
          col.per.group = color.mixo(1:4),
          legend = TRUE, legend.title = 'Time', 
          legend.title.pch = 'Dose')