# Contents

Most of the pipeline and visualizations presented herein have been adapted from Chevrier et al. (2018)’s “Compensation of Signal Spillover in Suspension and Imaging Mass Cytometry” available here.

# load required packages
library(CATALYST)
library(cowplot)
library(flowCore)
library(ggplot2)
library(SingleCellExperiment)

# 1 Data examples

• Normalization:
raw_data is a flowSet with 2 experiments, each containing 2’500 raw measurements with a variation of signal over time. Samples were mixed with DVS beads captured by mass channels 140, 151, 153, 165 and 175.
• Debarocoding:
To demonstrate the debarcoding workflow with CATALYST, we provide sample_ff which follows a 6-choose-3 barcoding scheme where mass channels 102, 104, 105, 106, 108, and 110 were used for labeling such that each of the 20 individual barcodes are positive for exactly 3 out of the 6 barcode channels. Accompanying this, sample_key contains a binary code of length 6 for each sample, e.g. 111000, as its unique identifier.
• Compensation:
Alongside the multiplexed-stained cell sample mp_cells, the package contains 36 single-antibody stained controls in ss_exp where beads were stained with antibodies captured by mass channels 139, 141 through 156, and 158 through 176, respectively, and pooled together. Note that, to decrease running time, we downsampled to a total of 10’000 events. Lastly, isotope_list contains a named list of isotopic compositions for all elements within 75 through 209 u corresponding to the CyTOF mass range at the time of writing (Coursey et al. 2015).

# 2 Data organization

Data used and returned throughout preprocessing are organized into an object of the SingleCellExperiment (SCE) class. A SCE can be constructed from a directory housing a single or set of FCS files, a character vector of the file(s), flowFrame(s) or a flowSet (from the flowCore package) using CATALYST’s prepData function.

prepData will automatically identify channels not corresponding to masses (e.g., event times), remove them from the output SCE’s assay data, and store them as internal event metadata (int_colData).

When multiple files or frames are supplied, prepData will concatenate the data into a single object, and argument by_time (default TRUE) specifies whether runs should be ordered by their acquisition time (keyword(x, "$BTIM"), where x is a flowFrame or flowSet). A "sample_id" column will be added to the output SCE’s colData to track which file/frame events originally source from. Finally, when transform (default TRUE), an arcsinh-transformation with cofactor cofactor (defaults to 5) is applied to the input (count) data, and the resulting expression matrix is stored in the "exprs" assay slot of the output SCE. data("raw_data") (sce <- prepData(raw_data)) ## class: SingleCellExperiment ## dim: 61 5000 ## metadata(1): experiment_info ## assays(2): counts exprs ## rownames(61): BC1 Vol1 ... Pb208Di BC9 ## rowData names(4): channel_name marker_name marker_class use_channel ## colnames: NULL ## colData names(1): sample_id ## reducedDimNames(0): ## altExpNames(0): # view number of events per sample table(sce$sample_id)
##
## raw_data_1.fcs raw_data_2.fcs
##           2500           2500
# view non-mass channels
names(int_colData(sce))
## [1] "reducedDims"  "altExps"      "colPairs"     "Time"         "Event_length"
## [6] "Center"       "Offset"       "Width"        "Residual"

# 3 Normalization

CATALYST provides an implementation of bead-based normalization as described by Finck et al. (Finck et al. 2013). Here, identification of bead-singlets (used for normalization), as well as of bead-bead and cell-bead doublets (to be removed) is automated as follows:

1. beads are identified as events with their top signals in the bead channels
2. cell-bead doublets are remove by applying a separation cutoff to the distance between the lowest bead and highest non-bead signal
4. bead-bead doublets are removed by applying a default $$median\;\pm5\;mad$$ rule to events identified in step 2. The remaining bead events are used for normalization.

## 3.1 Normalization workflow

### 3.1.1normCytof: Normalization using bead standards

Since bead gating is automated here, normalization comes down to a single function that takes a SingleCellExperiment as input and only requires specification of the beads to be used for normalization. Valid options are:

• "dvs" for bead masses 140, 151, 153, 165, 175
• "beta" for bead masses 139, 141, 159, 169, 175
• or a custom numeric vector of bead masses

By default, we apply a $$median\;\pm5\;mad$$ rule to remove low- and high-signal events from the bead population used for estimating normalization factors. The extent to which bead populations are trimmed can be adjusted via trim. The population will become increasingly narrow and bead-bead doublets will be exluded as the trim value decreases. Notably, slight over-trimming will not affect normalization. It is therefore recommended to choose a trim value that is small enough to assure removal of doublets at the cost of a small bead population to normalize to.

normCytof will return the following list of SCE(s)…

• data: Input dataset including normalized counts (and expressions, if transform = TRUE).
• if remove_beads = FALSE, colData columns "is_bead" and "remove" indicate whether an event has been marker as a bead or for removal, respectively.
• otherwise, bead and doublet events are excluded and the following additional data is returned:
• beads: Subset of identified bead events.
• removed: Subset of all cells that have been from the original dataset, including bead events as well as bead-bead and bead-cell doublets.

…and ggplot-objects:

• scatter: Scatter plot of bead vs. DNA intensities with indication of applied gates.
• lines: Running-median smoothed bead intensities vs. time before and after normalization.

Besides general normalized parameters (beads specifying the normalization beads, and running median windown width k), normCytof requires as input to assays corresponding to count- and expression-like data respectively. Here, correction factors are computed on the linear (count) scale, while automated bead-identification happens on the transformed (expression) scale.

By default, normCytof will overwrite the specified assays with the normalized data (overwrite = TRUE). In order to retain both unnormalized and normalized data, overwrite should be set to FALSE, in which case normalized counts (and expression, when transform = TRUE) will be written to separate assay normcounts/exprs, respectively.

# construct SCE
sce <- prepData(raw_data)
# apply normalization; keep raw data
res <- normCytof(sce, beads = "dvs", k = 50,
assays = c("counts", "exprs"), overwrite = FALSE)
# check number & percentage of bead / removed events
n <- ncol(sce); ns <- c(ncol(res$beads), ncol(res$removed))
data.frame(
check.names = FALSE,
"#" = c(ns[1], ns[2]),
"%" = 100*c(ns[1]/n, ns[2]/n),
row.names = c("beads", "removed"))
##           #    %
## removed 153 3.06
# extract data excluding beads & doublets,
# and including normalized intensitied
sce <- res$data assayNames(sce) ## [1] "counts" "exprs" "normcounts" "normexprs" # plot bead vs. dna scatters res$scatter

# plot smoothed bead intensities
res$lines # 4 Debarcoding CATALYST provides an implementation of the single-cell deconvolution algorithm described by Zunder et al. (Zunder et al. 2015). The package contains three functions for debarcoding and three visualizations that guide selection of thresholds and give a sense of barcode assignment quality. In summary, events are assigned to a sample when i) their positive and negative barcode populations are separated by a distance larger than a threshold value and ii) the combination of their positive barcode channels appears in the barcoding scheme. Depending on the supplied scheme, there are two possible ways of arriving at preliminary event assignments: 1. Doublet-filtering: Given a binary barcoding scheme with a coherent number $$k$$ of positive channels for all IDs, the $$k$$ highest channels are considered positive and $$n-k$$ channels negative. Separation of positive and negative events equates to the difference between the $$k$$th highest and $$(n-k)$$th lowest intensity value. If a numeric vector of masses is supplied, the barcoding scheme will be an identity matrix; the most intense channel is considered positive and its respective mass assigned as ID. 2. Non-constant number of 1’s: Given a non-uniform number of 1’s in the binary codes, the highest separation between consecutive barcodes is looked at. In both, the doublet-filtering and the latter case, each event is assigned a binary code that, if matched with a code in the barcoding scheme supplied, dictates which row name will be assigned as ID. Cells whose positive barcodes are still very low or whose binary pattern of positive and negative barcodes doesn’t occur in the barcoding scheme will be given ID 0 for “unassigned”. All data required for debarcoding are held in objects of the SingleCellExperiment (SCE) class, allowing for the following easy-to-use workflow: 1. as the initial step of single-cell deconcolution, assignPrelim will return a SCE containing the input measurement data, barcoding scheme, and preliminary event assignments. 2. assignments will be made final by applyCutoffs. It is recommended to estimate, and possibly adjust, population-specific separation cutoffs by running estCutoffs prior to this. 3. plotYields, plotEvents and plotMahal aim to guide selection of devoncolution parameters and to give a sense of the resulting barcode assignment quality. ## 4.1 Debarcoding workflow ### 4.1.1assignPrelim: Assignment of preliminary IDs The debarcoding process commences by assigning each event a preliminary barcode ID. assignPrelim thereby takes either a binary barcoding scheme or a vector of numeric masses as input, and accordingly assigns each event the appropirate row name or mass as ID. FCS files are read into R with read.FCS of the flowCore package, and are represented as an object of class flowFrame: data(sample_ff) sample_ff ## flowFrame object 'anonymous' ## with 20000 cells and 6 observables: ## name desc range minRange maxRange ## 1 (Pd102)Di BC102 9745.80 -0.999912 9745.80 ## 2 (Pd104)Di BC104 9687.52 -0.999470 9687.52 ## 3 (Pd105)Di BC105 8924.64 -0.998927 8924.64 ## 4 (Pd106)Di BC106 8016.67 -0.999782 8016.67 ## 5 (Pd108)Di BC108 9043.87 -0.999997 9043.87 ## 6 (Pd110)Di BC110 8204.46 -0.999937 8204.46 ## 0 keywords are stored in the 'description' slot The debarcoding scheme should be a binary table with sample IDs as row and numeric barcode masses as column names: data(sample_key) head(sample_key) ## 102 104 105 106 108 110 ## A1 1 1 1 0 0 0 ## A2 1 1 0 1 0 0 ## A3 1 1 0 0 1 0 ## A4 1 1 0 0 0 1 ## A5 1 0 1 1 0 0 ## B1 1 0 1 0 1 0 Provided with a SingleCellExperiment and a compatible barcoding scheme (barcode masses must occur as parameters in the supplied SCE), assignPrelim will add the following data to the input SCE: - assay slot "scaled" containing normalized expression values where each population is scaled to the 95%-quantile of events assigend to the respective population. - colData columns "bc_id" and "delta" containing barcode IDs and separations between lowest positive and highest negative intensity (on the normalized scale) - rowData column is_bc specifying, for each channel, whether it has been specified as a barcode channel sce <- prepData(sample_ff) (sce <- assignPrelim(sce, sample_key)) ## Debarcoding data... ## o ordering ## o classifying events ## Normalizing... ## Computing deltas... ## class: SingleCellExperiment ## dim: 6 20000 ## metadata(2): experiment_info bc_key ## assays(3): counts exprs scaled ## rownames(6): BC102 BC104 ... BC108 BC110 ## rowData names(5): channel_name marker_name marker_class use_channel ## is_bc ## colnames: NULL ## colData names(3): sample_id bc_id delta ## reducedDimNames(0): ## altExpNames(0): # view barcode channels rownames(sce)[rowData(sce)$is_bc]
## [1] "BC102" "BC104" "BC105" "BC106" "BC108" "BC110"
# view number of events assigned to each barcode population
table(sce\$bc_id)
##
##   A1   A2   A3   A4   A5   B1   B2   B3   B4   B5   C1   C2   C3   C4   C5   D1
## 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000
##   D2   D3   D4   D5
## 1000 1000 1000 1000

### 4.1.2estCutoffs: Estimation of separation cutoffs

As opposed to a single global cutoff, estCutoffs will estimate a sample-specific cutoff to deal with barcode population cell yields that decline in an asynchronous fashion. Thus, the choice of thresholds for the distance between negative and positive barcode populations can be i) automated and ii) independent for each barcode. Nevertheless, reviewing the yield plots (see below), checking and possibly refining separation cutoffs is advisable.

For the estimation of cutoff parameters we consider yields upon debarcoding as a function of the applied cutoffs. Commonly, this function will be characterized by an initial weak decline, where doublets are excluded, and subsequent rapid decline in yields to zero. Inbetween, low numbers of counts with intermediate barcode separation give rise to a plateau. To facilitate robust estimation, we fit a linear and a three-parameter log-logistic function (Finney 1971) to the yields function with the LL.3 function of the drc R package (Ritz et al. 2015) (Figure 1). As an adequate cutoff estimate, we target a point that marks the end of the plateau regime and on-set of yield decline to appropriately balance confidence in barcode assignment and cell yield.

The goodness of the linear fit relative to the log-logistic fit is weighed as follow: $w = \frac{\text{RSS}_{log-logistic}}{\text{RSS}_{log-logistic}+\text{RSS}_{linear}}$

The cutoffs for both functions are defined as:

$c_{linear} = -\frac{\beta_0}{2\beta_1}$ $c_{log-logistic}=\underset{x}{\arg\min}\:\frac{\vert\:f'(x)\:\vert}{f(x)} > 0.1$

The final cutoff estimate $$c$$ is defined as the weighted mean between these estimates:

$c=(1-w)\cdot c_{log-logistic}+w\cdot c_{linear}$