ClassifyR is a framework for cross-validated classification and the rules for functions to be used with are shown by the table below. A fully worked example is shown how to incorporate an existing classifier into the framework. The functions can have any number of other arguments after the set of arguments which are mandatory.
There is an implementation of the k Nearest Neighbours algorithm in the package class. Its function has the form knn(train, test, cl, k = 1, l = 0, prob = FALSE, use.all = TRUE)
. It accepts a matrix
or a data.frame
variable as input, but ClassifyR calls transformation, feature selection and classifier functions with a DataFrame
, a core Bioconductor data container from S4Vectors. It also expects training data to be the first parameter, the classes of it to be the second parameter and the test data to be the third. Therefore, a wrapper for DataFrame
reordering the parameters is created.
setGeneric("kNNinterface", function(measurementsTrain, ...) standardGeneric("kNNinterface"))
setMethod("kNNinterface", "DataFrame", function(measurementsTrain, classesTrain, measurementsTest, ..., verbose = 3)
{
splitDataset <- .splitDataAndOutcomes(measurementsTrain, classesTrain)
trainingMatrix <- as.matrix(splitDataset[["measurements"]])
test <- test[, isNumeric, drop = FALSE]
if(!requireNamespace("class", quietly = TRUE))
stop("The package 'class' could not be found. Please install it.")
if(verbose == 3)
message("Fitting k Nearest Neighbours classifier to data and predicting classes.")
class::knn(as.matrix(measurementsTrain), as.matrix(measurementsTest), measurementsTest, ...)
})
The function only emits a progress message if verbose
is 3. The verbosity levels are explained in the introductory vignette. .splitDataAndOutcomes
is an internal function in ClassifyR which ensures that outcomes are not in measurements
when model training happens. If classesTrain
is a factor vector, then the function has no effect. If classesTrain
is the character name of a column in measurementsTrain
, that column is removed from the table and returned as a separate variable. The ...
parameter captures any options to be passed onto knn
, such as k
(number of neighbours considered) and l
(minimum vote for a definite decision), for example. The function is also defensive and removes any non-numeric columns from the input table.
ClassifyR also accepts a matrix
and a MultiAssayExperiment
as input. Provide convenience methods for these inputs which converts them into a DataFrame
. In this way, only the DataFrame
version of kNNinterface
does the classification.
setMethod("kNNinterface", "matrix",
function(measurementsTrain, classesTrain, measurementsTest, ...)
{
kNNinterface(DataFrame(measurementsTrain, check.names = FALSE),
classesTrain,
DataFrame(measurementsTest, check.names = FALSE), ...)
})
setMethod("kNNinterface", "MultiAssayExperiment",
function(measurementsTrain, measurementsTest, targets = names(measurementsTrain), classesTrain, ...)
{
tablesAndClasses <- .MAEtoWideTable(measurementsTrain, targets, classesTrain)
trainingTable <- tablesAndClasses[["dataTable"]]
classes <- tablesAndClasses[["outcomes"]]
testingTable <- .MAEtoWideTable(measurementsTest, targets)
.checkVariablesAndSame(trainingTable, testingTable)
kNNinterface(trainingTable, classes, testingTable, ...)
})
The matrix
method simply involves transposing the input matrices, which ClassifyR expects to store features in the rows and samples in the columns (customary in bioinformatics), and casting them to a DataFrame
, which dispatches to the kNNinterface method for a DataFrame
, which carries out the classification.
The conversion of a MultiAssayExperiment
is more complicated. ClassifyR has an internal function .MAEtoWideTable
which converts a MultiAssayExperiment
to a wide DataFrame
. targets
specifies which assays to include in the conversion. By default, it can also filters the resultant table to contain only numeric variables. The internal validity function .checkVariablesAndSame
checks that there is at least 1 column after filtering and that the training and testing table have the same number of variables.
Create a data set with 10 samples and 10 features with a clear difference between the two classes. Run leave-out-out cross-validation.
classes <- factor(rep(c("Healthy", "Disease"), each = 5), levels = c("Healthy", "Disease"))
measurements <- matrix(c(rnorm(50, 10), rnorm(50, 5)), ncol = 10)
colnames(measurements) <- paste("Sample", 1:10)
rownames(measurements) <- paste("mRNA", 1:10)
library(ClassifyR)
knnParams <- ModellingParams(selectParams = NULL,
trainParams = TrainParams(kNNinterface),
predictParams = NULL)
CVparams <- CrossValParams("Leave-k-Out", leave = 1)
classified <- runTests(measurements, classes, CVparams, knnParams)
classified
## An object of class 'ClassifyResult'.
## Characteristics:
## characteristic value
## Classifier Name k Nearest Neighbours
## Cross-validation Leave 1 Out
## Features: List of length 10 of feature identifiers.
## Predictions: A data frame of 10 rows.
## Performance Measures: None calculated yet.
cbind(predictions(classified), known = actualOutcomes(classified))
## sample subset class known
## 1 mRNA 1 1 Disease Healthy
## 2 mRNA 2 2 Disease Healthy
## 3 mRNA 3 3 Healthy Healthy
## 4 mRNA 4 4 Disease Healthy
## 5 mRNA 5 5 Disease Healthy
## 6 mRNA 6 6 Disease Disease
## 7 mRNA 7 7 Healthy Disease
## 8 mRNA 8 8 Disease Disease
## 9 mRNA 9 9 Disease Disease
## 10 mRNA 10 10 Healthy Disease
NULL
is specified instead of a function to PredictParams
because one function does training and prediction. As expected for this easy task, the classifier predicts all samples correctly.
The argument verbose is sent from runTest to these functions so they must handle it, even if not explicitly using it. In the ClassifyR framework, verbose is a number which indicates the amount of progress messages to be printed. If verbose is 0, no progress messages are printed. If it is 1, only one message is printed for every 10 cross-validations completed. If it is 2, in addition to the messages printed when it is 1, a message is printed each time one of the stages of classification (transformation, feature selection, training, prediction) is done. If it is 3, in addition to the messages printed for values 1 and 2, progress messages are printed from within the classification functions themselves.
A version of each included transformation, selection, training and prediction function is typically implemented for (1) a numeric matrix for which the rows are for features and columns are for samples (a data storage convention in bioinformatics) and a factor vector of the same length as the number of columns of the matrix, (2) a DataFrame where the columns are naturally for the features, possibly of different data types (i.e. categorical and numeric), and rows are for samples, and a class specification and (3) a MultiAssayExperiment which stores sample class information in the colData slot’s DataFrame with column name “class”. For the inputs (1 and 3) which are not DataFrame, they are converted to one, because the other data types can be stored as a DataFrame without loss of information and the transformation, selection and classification functions which accept a DataFrame contain the code to do the actual computations. At a minimum, a new function must have a method taking a DataFrame as input with the sample classes either stored in a column named “class” or provided as a factor vector. Although not required, providing a version of a function that accepts a numeric matrix with an accompanying factor vector and another version that accepts a MultiAssayExperiment is desirable to provide flexibility regarding input data. See the code of existing functions in the package for examples of this, if intending to implement novel classification-related functions to be used with ClassifyR.