tune.block.splsda {mixOmics}R Documentation

Tuning function for block.splsda method (N-integration with sparse Discriminant Analysis)

Description

Computes M-fold or Leave-One-Out Cross-Validation scores based on a user-input grid to determine the optimal parsity parameters values for method block.splsda.

Usage

tune.block.splsda(
  X,
  Y,
  indY,
  ncomp = 2,
  test.keepX,
  already.tested.X,
  validation = "Mfold",
  folds = 10,
  dist = "max.dist",
  measure = "BER",
  weighted = TRUE,
  progressBar = FALSE,
  tol = 1e-06,
  max.iter = 100,
  near.zero.var = FALSE,
  nrepeat = 1,
  design,
  scheme = "horst",
  scale = TRUE,
  init = "svd",
  light.output = TRUE,
  signif.threshold = 0.01,
  cpus = 1
)

Arguments

X

numeric matrix of predictors. NAs are allowed.

Y

Either a factor or a class vector for the discrete outcome, or a numeric vector or matrix of continuous responses (for multi-response models).

indY

To supply if Y is missing, indicates the position of the matrix response in the list X.

ncomp

the number of components to include in the model.

test.keepX

A named list with the same length and names as X (without the outcome Y, if it is provided in X and designated using indY). Each entry of this list is a numeric vector for the different keepX values to test for that specific block.

already.tested.X

Optional, if ncomp > 1 A named list of numeric vectors each of length n_tested indicating the number of variables to select from the X data set on the first n_tested components.

validation

character. What kind of (internal) validation to use, matching one of "Mfold" or "loo" (see below). Default is "Mfold".

folds

the folds in the Mfold cross-validation. See Details.

dist

distance metric to estimate the classification error rate, should be a subset of "centroids.dist", "mahalanobis.dist" or "max.dist" (see Details).

measure

Two misclassification measure are available: overall misclassification error overall or the Balanced Error Rate BER

weighted

tune using either the performance of the Majority vote or the Weighted vote.

progressBar

by default set to TRUE to output the progress bar of the computation.

tol

Numeric, convergence tolerance criteria.

max.iter

Integer, the maximum number of iterations.

near.zero.var

boolean, see the internal nearZeroVar function (should be set to TRUE in particular for data with many zero values). Default value is FALSE

nrepeat

Number of times the Cross-Validation process is repeated.

design

numeric matrix of size (number of blocks in X) x (number of blocks in X) with values between 0 and 1. Each value indicates the strenght of the relationship to be modelled between two blocks; a value of 0 indicates no relationship, 1 is the maximum value. If Y is provided instead of indY, the design matrix is changed to include relationships to Y.

scheme

Either "horst", "factorial" or "centroid". Default = centroid, see reference.

scale

a logical value indicating whether the variables should be scaled to have unit variance before the analysis takes place. The default is FALSE for consistency with prcomp function, but in general scaling is advisable. Alternatively, a vector of length equal the number of columns of X can be supplied. The value is passed to scale.

init

Mode of initialization use in the algorithm, either by Singular Value Decomposition of the product of each block of X with Y ('svd') or each block independently ('svd.single'). Default = svd.single.

light.output

if set to FALSE, the prediction/classification of each sample for each of test.keepX and each comp is returned.

signif.threshold

numeric between 0 and 1 indicating the significance threshold required for improvement in error rate of the components. Default to 0.01.

cpus

Integer, number of cpus to use. If greater than 1, the code will run in parallel when repeating the cross-validation, which is usually the most computationally intensive process. If there is excess CPU, the cross-vaidation is also parallelised on *nix-based OS which support mclapply.

Details

This tuning function should be used to tune the keepX parameters in the block.splsda function (N-integration with sparse Discriminant Analysis).

M-fold or LOO cross-validation is performed with stratified subsampling where all classes are represented in each fold.

If validation = "Mfold", M-fold cross-validation is performed. The number of folds to generate is to be specified in the argument folds.

If validation = "loo", leave-one-out cross-validation is performed. By default folds is set to the number of unique individuals.

All combination of test.keepX values are tested. A message informs how many will be fitted on each component for a given test.keepX.

More details about the prediction distances in ?predict and the supplemental material of the mixOmics article (Rohart et al. 2017). Details about the PLS modes are in ?pls.

BER is appropriate in case of an unbalanced number of samples per class as it calculates the average proportion of wrongly classified samples in each class, weighted by the number of samples in each class. BER is less biased towards majority classes during the performance assessment.

Value

A list that contains:

error.rate

returns the prediction error for each test.keepX on each component, averaged across all repeats and subsampling folds. Standard deviation is also output. All error rates are also available as a list.

choice.keepX

returns the number of variables selected (optimal keepX) on each component, for each block.

choice.ncomp

returns the optimal number of components for the model fitted with $choice.keepX.

error.rate.class

returns the error rate for each level of Y and for each component computed with the optimal keepX

predict

Prediction values for each sample, each test.keepX, each comp and each repeat. Only if light.output=FALSE

class

Predicted class for each sample, each test.keepX, each comp and each repeat. Only if light.output=FALSE

cor.value

compute the correlation between latent variables for two-factor sPLS-DA analysis.

Author(s)

Florian Rohart, Amrit Singh, Kim-Anh Lê Cao, AL J Abadi

References

Method:

Singh A., Gautier B., Shannon C., Vacher M., Rohart F., Tebbutt S. and Lê Cao K.A. (2016). DIABLO: multi omics integration for biomarker discovery.

mixOmics article:

Rohart F, Gautier B, Singh A, Lê Cao K-A. mixOmics: an R package for 'omics feature selection and multiple data integration. PLoS Comput Biol 13(11): e1005752

See Also

block.splsda and http://www.mixOmics.org for more details.

Examples

data("breast.TCGA")
# this is the X data as a list of mRNA and miRNA; the Y data set is a single data set of proteins
data = list(mrna = breast.TCGA$data.train$mrna, mirna = breast.TCGA$data.train$mirna,
protein = breast.TCGA$data.train$protein)
# set up a full design where every block is connected
# could also consider other weights, see our mixOmics manuscript
design = matrix(1, ncol = length(data), nrow = length(data),
dimnames = list(names(data), names(data)))
diag(design) =  0
design
# set number of component per data set
ncomp = 5

# Tuning the first two components
# -------------
## Not run: 
# definition of the keepX value to be tested for each block mRNA miRNA and protein
# names of test.keepX must match the names of 'data'
test.keepX = list(mrna = seq(10,40,20), mirna = seq(10,30,10), protein = seq(1,10,5))

# the following may take some time to run, note that for through tuning
# nrepeat should be > 1
tune = tune.block.splsda(X = data, Y = breast.TCGA$data.train$subtype,
ncomp = ncomp, test.keepX = test.keepX, design = design, nrepeat = 3)

tune$choice.ncomp
tune$choice.keepX

# Only tuning the second component
# -------------

already.mrna = 4 # 4 variables selected on comp1 for mrna
already.mirna  = 2 # 2 variables selected on comp1 for mirna
already.prot  = 1 # 1 variables selected on comp1 for protein

already.tested.X = list(mrna = already.mrna, mirna = already.mirna, protein = already.prot)

tune = tune.block.splsda(X = data, Y = breast.TCGA$data.train$subtype,
ncomp = 2, test.keepX = test.keepX, design = design,
already.tested.X = already.tested.X)

tune$choice.keepX

## End(Not run)


[Package mixOmics version 6.12.2 Index]