tune.block.splsda {mixOmics} | R Documentation |
Computes M-fold or Leave-One-Out Cross-Validation scores based on a
user-input grid to determine the optimal parsity parameters values for
method block.splsda
.
tune.block.splsda( X, Y, indY, ncomp = 2, test.keepX, already.tested.X, validation = "Mfold", folds = 10, dist = "max.dist", measure = "BER", weighted = TRUE, progressBar = FALSE, tol = 1e-06, max.iter = 100, near.zero.var = FALSE, nrepeat = 1, design, scheme = "horst", scale = TRUE, init = "svd", light.output = TRUE, signif.threshold = 0.01, cpus = 1 )
X |
numeric matrix of predictors. |
Y |
Either a factor or a class vector for the discrete outcome, or a numeric vector or matrix of continuous responses (for multi-response models). |
indY |
To supply if |
ncomp |
the number of components to include in the model. |
test.keepX |
A named list with the same length and names as X
(without the outcome Y, if it is provided in X and designated using
|
already.tested.X |
Optional, if |
validation |
character. What kind of (internal) validation to use,
matching one of |
folds |
the folds in the Mfold cross-validation. See Details. |
dist |
distance metric to estimate the
classification error rate, should be a subset of |
measure |
Two misclassification measure are available: overall
misclassification error |
weighted |
tune using either the performance of the Majority vote or the Weighted vote. |
progressBar |
by default set to |
tol |
Numeric, convergence tolerance criteria. |
max.iter |
Integer, the maximum number of iterations. |
near.zero.var |
boolean, see the internal |
nrepeat |
Number of times the Cross-Validation process is repeated. |
design |
numeric matrix of size (number of blocks in X) x (number of
blocks in X) with values between 0 and 1. Each value indicates the strenght
of the relationship to be modelled between two blocks; a value of 0
indicates no relationship, 1 is the maximum value. If |
scheme |
Either "horst", "factorial" or "centroid". Default =
|
scale |
a logical value indicating whether the variables should be
scaled to have unit variance before the analysis takes place. The default is
|
init |
Mode of initialization use in the algorithm, either by Singular
Value Decomposition of the product of each block of X with Y ('svd') or each
block independently ('svd.single'). Default = |
light.output |
if set to FALSE, the prediction/classification of each
sample for each of |
signif.threshold |
numeric between 0 and 1 indicating the significance threshold required for improvement in error rate of the components. Default to 0.01. |
cpus |
Integer, number of cpus to use. If greater than 1, the code will
run in parallel when repeating the cross-validation, which is usually the
most computationally intensive process. If there is excess CPU, the
cross-vaidation is also parallelised on *nix-based OS which support
|
This tuning function should be used to tune the keepX parameters in the
block.splsda
function (N-integration with sparse Discriminant
Analysis).
M-fold or LOO cross-validation is performed with stratified subsampling where all classes are represented in each fold.
If validation = "Mfold"
, M-fold cross-validation is performed. The
number of folds to generate is to be specified in the argument folds
.
If validation = "loo"
, leave-one-out cross-validation is performed.
By default folds
is set to the number of unique individuals.
All combination of test.keepX values are tested. A message informs how many will be fitted on each component for a given test.keepX.
More details about the prediction distances in ?predict
and the
supplemental material of the mixOmics article (Rohart et al. 2017). Details
about the PLS modes are in ?pls
.
BER is appropriate in case of an unbalanced number of samples per class as it calculates the average proportion of wrongly classified samples in each class, weighted by the number of samples in each class. BER is less biased towards majority classes during the performance assessment.
A list that contains:
error.rate |
returns the prediction error
for each |
choice.keepX |
returns the number of variables selected (optimal keepX) on each component, for each block. |
choice.ncomp |
returns the optimal number of components for the model
fitted with |
error.rate.class |
returns the
error rate for each level of |
predict |
Prediction values for each sample, each |
class |
Predicted class for each sample, each |
cor.value |
compute the correlation between latent variables for two-factor sPLS-DA analysis. |
Florian Rohart, Amrit Singh, Kim-Anh Lê Cao, AL J Abadi
Method:
Singh A., Gautier B., Shannon C., Vacher M., Rohart F., Tebbutt S. and Lê Cao K.A. (2016). DIABLO: multi omics integration for biomarker discovery.
mixOmics article:
Rohart F, Gautier B, Singh A, Lê Cao K-A. mixOmics: an R package for 'omics feature selection and multiple data integration. PLoS Comput Biol 13(11): e1005752
block.splsda
and http://www.mixOmics.org for more
details.
data("breast.TCGA") # this is the X data as a list of mRNA and miRNA; the Y data set is a single data set of proteins data = list(mrna = breast.TCGA$data.train$mrna, mirna = breast.TCGA$data.train$mirna, protein = breast.TCGA$data.train$protein) # set up a full design where every block is connected # could also consider other weights, see our mixOmics manuscript design = matrix(1, ncol = length(data), nrow = length(data), dimnames = list(names(data), names(data))) diag(design) = 0 design # set number of component per data set ncomp = 5 # Tuning the first two components # ------------- ## Not run: # definition of the keepX value to be tested for each block mRNA miRNA and protein # names of test.keepX must match the names of 'data' test.keepX = list(mrna = seq(10,40,20), mirna = seq(10,30,10), protein = seq(1,10,5)) # the following may take some time to run, note that for through tuning # nrepeat should be > 1 tune = tune.block.splsda(X = data, Y = breast.TCGA$data.train$subtype, ncomp = ncomp, test.keepX = test.keepX, design = design, nrepeat = 3) tune$choice.ncomp tune$choice.keepX # Only tuning the second component # ------------- already.mrna = 4 # 4 variables selected on comp1 for mrna already.mirna = 2 # 2 variables selected on comp1 for mirna already.prot = 1 # 1 variables selected on comp1 for protein already.tested.X = list(mrna = already.mrna, mirna = already.mirna, protein = already.prot) tune = tune.block.splsda(X = data, Y = breast.TCGA$data.train$subtype, ncomp = 2, test.keepX = test.keepX, design = design, already.tested.X = already.tested.X) tune$choice.keepX ## End(Not run)