For experiments in which analyzed samples come from different classes or conditions, a common goal of supervised analysis is classification: given a labeled training set for which classes are already known, we want to predict the class of a new sample.
Unlike unsupervised analysis such as segmentation, classification requires biological replicates for testing and validation, to avoid biased reporting of accuracy. Cardinal provides cross-validation for this purpose.
In this vignette, we present an example classification workflow using Cardinal.
We begin by loading the package:
library(Cardinal)
This example uses DESI spectra collected from a renal cell carcinoma (RCC) cancer dataset consisting of 8 matched pairs of human kidney tissue. Each tissue pair consists of a normal tissue sample and a cancerous tissue sample. The goal of the workflow is to develop classifiers for predicting whether a new tissue sample is normal or cancer.
MH0204_33 | UH0505_12 | UH0710_33 | UH9610_15 |
---|---|---|---|
UH9812_03 | UH9905_18 | UH9911_05 | UH9912_01 |
---|---|---|---|
In this RCC dataset, we expect that normal tissue and cancerous tissue will have unique chemical profiles, which we can use to classify new tissue based on the mass spectra.
First, we load the dataset from the CardinalWorkflows package using exampleMSIData()
.
rcc <- CardinalWorkflows::exampleMSIData("rcc")
The dataset contains 16,000 spectra with 10,200 m/z-values.
rcc
## MSImagingExperiment with 10200 features and 16000 spectra
## spectraData(1): intensity
## featureData(1): mz
## pixelData(4): x, y, run, diagnosis
## coord(2): x = 1...99, y = 1...38
## runNames(8): MH0204_33, UH0505_12, UH0710_33, ..., UH9905_18, UH9911_05, UH9912_01
## mass range: 150.08 to 1000.00
## centroided: FALSE
We can visualize the diagnoses:
image(rcc, "diagnosis", layout=c(2,4), free="xy", col=dpal("Set1"))
As can be seen above, each matched pair of tissues belonging to the same subject are on the same slide (i.e., the same run). Note also the the cancer tissue is on the left and the normal tissue is on the right on each slide.
First, let’s visualize the total ion current.
rcc <- summarizePixels(rcc, stat=c(TIC="sum"))
image(rcc, "TIC", layout=c(2,4), free="xy")
Clearly there is pixel-to-pixel variation in in the total ion current.
We will normalize the TIC and perform peak picking on all spectra.
rcc_peaks <- rcc |>
normalize(method="tic") |>
peakProcess(SNR=3, filterFreq=FALSE,
tolerance=0.5, units="mz")
rcc_peaks
## MSImagingExperiment with 753 features and 16000 spectra
## spectraData(1): intensity
## featureData(3): mz, count, freq
## pixelData(5): x, y, run, diagnosis, TIC
## coord(2): x = 1...99, y = 1...38
## runNames(8): MH0204_33, UH0505_12, UH0710_33, ..., UH9905_18, UH9911_05, UH9912_01
## metadata(2): processing_20241031122709, processing_20241031122712
## mass range: 151.2376 to 999.4412
## centroided: TRUE
This produces a centroided dataset with 753 peaks.
Note that we have performed peak picking on all spectra, rather than peak picking on a subset and then summarizing (as we did in the “Segmentation” workflow). This is so we have independent peaks for every spectrum that were not selected based on information from any other spectra. Therefore, we can use these peaks in cross-validation (after re-aligning them independently within each training set).
We will also create a subset of the dataset that excludes the background pixels.
rcc_nobg <- subsetPixels(rcc_peaks, !is.na(diagnosis))
rcc_nobg
## MSImagingExperiment with 753 features and 6077 spectra
## spectraData(1): intensity
## featureData(3): mz, count, freq
## pixelData(5): x, y, run, diagnosis, TIC
## coord(2): x = 2...91, y = 2...37
## runNames(8): MH0204_33, UH0505_12, UH0710_33, ..., UH9905_18, UH9911_05, UH9912_01
## metadata(2): processing_20241031122709, processing_20241031122712
## mass range: 151.2376 to 999.4412
## centroided: TRUE
Before proceeding with the statistical analysis, we’ll first perform some exploratory visual analysis of the dataset.
Let’s plot the images for m/z 810, which appears abundant in both normal and tumor tissue, and doesn’t seem to be very predictive.
image(rcc_peaks, mz=810.36, layout=c(2,4), free="xy",
smooth="bilateral", enhance="histogram", scale=TRUE)
We will also visualize m/z 855, which appears slightly more abundant in the tumor tissue (left).
image(rcc_peaks, mz=886.43, layout=c(2,4), free="xy",
smooth="bilateral", enhance="histogram", scale=TRUE)
Lastly, we will also visualize m/z 215, which appears more abundant in the normal tissue (right).
image(rcc_peaks, mz=215.24, layout=c(2,4), free="xy",
smooth="bilateral", enhance="histogram", scale=TRUE)
We can also note that runs “UH0505_12” and “UH9905_18” have very heterogenous tumor tissues, with what appear to be large sections of normal tissue.
Principal component analysis (PCA) is an unsupervised method for exploring a dataset. PCA is available in Cardinal through the PCA()
method.
Below, we calculate the first 2 principal components. Note that PCA does not use any information about the diagnosis.
rcc_pca <- PCA(rcc_nobg, ncomp=2)
We can overlay the PC scores of the first 2 principal components. It doesn’t appear that either component distinguishes the diagnoses.
image(rcc_pca, type="x", layout=c(2,4), free="xy", scale=TRUE)
We can plot the scores of the first 2 components against each other to see how they separate the diagnoses (or don’t, in this case).
plot(rcc_pca, type="x", groups=rcc_nobg$diagnosis, shape=20)
It doesn’t appear that PCA separates cancer versus normal tissue. At least, not the first 2 components.
PCA is also a useful way to visualize how much each run clusters together. A large amount of variation in the data tends to be variation between experimental runs. This is why it’s useful to have matched pairs on the same slide.
plot(rcc_pca, type="x", groups=run(rcc_nobg), shape=20)