This vignette describes the functionality implemented in the CytoMDS
package. CytoMDS
provides support for low dimensional projection of a set of cytometry samples, using concepts such as Earth Mover’s (EMD) distance, and Multi Dimensional Scaling (MDS). This vignette is distributed under a CC BY-SA 4.0 license.
CytoMDS 1.2.0
To install this package, start R and enter (un-commented):
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#
# BiocManager::install("CytoMDS")
We now load the packages needed in the current vignette:
suppressPackageStartupMessages(library(HDCytoData))
library(CytoMDS)
library(ggplot2)
library(patchwork)
The CytoMDS
package implements low dimensional visualization of cytometry
samples, in order to visually assess distances between them. This, in turn,
can greatly help the user to identify quality issues like batch effects
or outlier samples, and/or check the presence of potential sample clusters
that might align with the experimental design.
The CytoMDS
algorithm combines, on the one hand, the concept of
Earth Mover’s Distance (EMD) (Orlova et al. 2016), a.k.a. Wasserstein metric
and, on the other hand, the metric Multi Dimensional Scaling (MDS) algorithm
for the low dimensional projection (Leeuw and Mair 2009).
Besides projection itself, the package also provides some diagnostic tools for both checking the quality of the MDS projection, as well as interpreting the axes of the projection (see below sections).
The illustrative dataset that is used throughout this vignette is a mass cytometry (CyTOF) dataset from (Bodenmiller et al. 2012), and provided in the Bioconductor HDCytoData data package (Weber and Soneson (2019)).
This dataset consists of 16 paired samples (8 times 2) of peripheral blood cells from healthy individuals. Among each sample pair, one sample - the reference - was left un-stimulated, while the other sample was stimulated with B cell receptor / Fc receptor cross-linker (BCR-XL).
This public dataset is known to contain a strong differential expression signal between the two conditions (stimulated vs un-stimulated) and as been used in recent work to benchmark differential analysis algorithms ((Weber et al. 2019)) and to design mass cytometry data analysis pipelines ((Nowicka et al. 2017)).
In the CytoMDS
package, as in the current vignette,
matrices of cytometry events intensities, corresponding to one sample,
are stored as flowCore::flowFrame
(Ellis et al. 2023) objects.
Samples of a particular cytometry dataset are then stored
as a flowCore::flowSet
object, which is a collection of flowFrame’s,
i.e. one flowFrame per sample. Therefore, we load the flowSet version
of the BodenMiller2012 dataset, obtained from the HDCytoData
package.
BCRXL_fs <- HDCytoData::Bodenmiller_BCR_XL_flowSet()
## see ?HDCytoData and browseVignettes('HDCytoData') for documentation
## loading from cache
## Warning in updateObjectFromSlots(object, ..., verbose = verbose): dropping
## slot(s) 'colnames' from object = 'flowSet'
BCRXL_fs
## A flowSet with 16 experiments.
##
## column names(39): Time Cell_length ... sample_id population_id
In regular flowSet’s, the experimental design information is typically
stored in the phenoData
slot, and this is also the way CytoMDS
expects to
get its input. However, HDCytoData
has chosen to store the experimental
design information in a slightly different way, hence the need to convert
the data as follows:
phenoData <- flowCore::pData(BCRXL_fs)
additionalPhenoData <-
keyword(BCRXL_fs[[1]], "EXPERIMENT_INFO")$EXPERIMENT_INFO
phenoData <- cbind(phenoData, additionalPhenoData)
flowCore::pData(BCRXL_fs) <- phenoData
We also select channels/markers that are biologically relevant, i.e. both the cell type and cell state markers, and store them for further use. We discard the typical housekeeping markers that are founds in flowFrames like time and Cell_length, etc. In total, these mass cytometry samples contain intensities for 24 biologically relevant markers.
markerInfo <- keyword(BCRXL_fs[[1]], "MARKER_INFO")$MARKER_INFO
chClass <- markerInfo$marker_class
table(chClass)
## chClass
## none type state
## 11 10 14
chLabels <- markerInfo$channel_name[chClass != "none"]
(chMarkers <- markerInfo$marker_name[chClass != "none"])
## [1] "CD3" "CD45" "pNFkB" "pp38" "CD4" "CD20" "CD33" "pStat5"
## [9] "CD123" "pAkt" "pStat1" "pSHP2" "pZap70" "pStat3" "CD14" "pSlp76"
## [17] "pBtk" "pPlcg2" "pErk" "pLat" "IgM" "pS6" "HLA-DR" "CD7"
The first step consists in scale transforming the raw data. Indeed, distances between samples make more sense with scaled transformed signal, in which distributional differences are more readable and usable for downstream analysis.
Here, since we are dealing with mass cytometry samples, we use the classical
arcsinh()
transformation with 5 as co-factor, as described elsewhere
((Nowicka et al. 2017)).
trans <- arcsinhTransform(
transformationId="ArcsinhTransform",
a = 0,
b = 1/5,
c = 0)
BCRXL_fs_trans <- transform(
BCRXL_fs,
transformList(chLabels, trans))
We can now calculate pairwise Earth Mover’s Distances (EMD) between all samples of our dataset.
This is done by calling the pairwiseEMDDist()
method The simplest way to
use this method is by providing directly a flowCore::flowSet
, containing all
samples, as input parameter. Note that, for heavy datasets that include
a lot of samples, this can create memory issues. To handle this case, CytoMDS
provides other ways to call the pairwiseEMDDist()
function
(see ‘Handling heavy datasets’ section).
Using the channels
argument, it is possible to restrict the EMD calculation
to some of the channels. Here we simply pass as input the biologically
relevant markers selected in the previous section.
pwDist <- pairwiseEMDDist(
BCRXL_fs_trans,
channels = chMarkers,
verbose = FALSE
)
The calculated distance is a symmetric square matrix, with as many rows (columns) as input samples (extract shown here below for the scale-transformed Bodenmiller2012 dataset).
round(pwDist[1:10, 1:10], 2)
## 1 2 3 4 5 6 7 8 9 10
## 1 0.00 10.46 4.30 11.18 6.39 12.69 7.11 11.85 5.88 10.13
## 2 10.46 0.00 10.91 3.16 11.45 6.06 13.17 10.61 9.61 6.08
## 3 4.30 10.91 0.00 10.72 7.44 13.17 7.47 12.84 5.84 10.00
## 4 11.18 3.16 10.72 0.00 12.10 5.97 12.66 9.63 10.35 6.72
## 5 6.39 11.45 7.44 12.10 0.00 9.19 4.45 10.19 5.28 9.96
## 6 12.69 6.06 13.17 5.97 9.19 0.00 10.56 7.23 11.61 7.74
## 7 7.11 13.17 7.47 12.66 4.45 10.56 0.00 7.24 8.89 11.87
## 8 11.85 10.61 12.84 9.63 10.19 7.23 7.24 0.00 14.50 11.68
## 9 5.88 9.61 5.84 10.35 5.28 11.61 8.89 14.50 0.00 7.10
## 10 10.13 6.08 10.00 6.72 9.96 7.74 11.87 11.68 7.10 0.00
One relevant way to visualize this distance matrix is to draw the histogram of pairwise distances, as shown in the below plot.
distVec <- pwDist[upper.tri(pwDist)]
distVecDF <- data.frame(dist = distVec)
pHist <- ggplot(distVecDF, mapping = aes(x=dist)) +
geom_histogram(fill = "darkgrey", col = "black", bins = 15) +
theme_bw() + ggtitle("EMD distances for Bodenmiller2012 dataset")
pHist
Once the pairwise distance matrix has been calculated, computing the Multi
Dimensional Scaling (MDS) projection is done
by calling the computeMetricMDS()
function. In its simplest form, only
the distance matrix needs to be passed to the function.
In that case, the number of dimensions to use in the MDS is automatically
set in order to reach a specific value for a projection quality indicator,
i.e. the target pseudo R square, which in turn is set by default to 0.95
(see Quality of projection - diagnostic tools section).
Note that the Smacof algorithm (Leeuw and Mair 2009),
used to compute the MDS projection, is stochastic, so it is sensitive
to the ‘seed’ used. Therefore, in cases where reproducible results from
one run to another is required, it is advised to set the seed
argument
to a fixed value.
The returned value of the computeMetricMDS()
function is an object of the
MDS class. This object can be queried to get e.g. the number of dimensions
that was effectively used, or the obtained pseudo RSquare, as shown
in the following code chunk:
mdsObj <- computeMetricMDS(pwDist, seed = 0)
show(mdsObj)
## MDS object containing MDS projection (using Smacof algorithm) data:
## Nb of dimensions: 4
## Nb of points: 16
## Stress: 0.040668
## Pseudo RSquare: 0.981489
## Goodness of fit: 0.998346
Plotting the obtained MDS projection is done using ggplotSampleMDS()
.
If no phenoData
parameter is passed, then, by default,
numbers are used as labels, and samples are represented as black dots.
ggplotSampleMDS(mdsObj)