scp 1.12.0
This vignette demonstrates how to use scp
to report missing values,
following our recommendations in Vanderaa and Gatto (2023). Briefly, we
recommend reporting at least 4 metrics:
We will also demonstrate how to estimate total sensitivity when the number of samples is too low and how to report data consistency using the distribution of the Jaccard indices.
In this vignette, we will assume you are familiar with the scp
framework. If this is not the case, we suggest you first read the
introduction vignette.
First, we load the scp
package and retrieve a real-life dataset from
the scpdata
package.
library("scp")
library("scpdata")
leduc <- leduc2022()
Next, we reduce the size of the dataset to the 30 first acquisitions. This allows for a fast execution of the code for this vignette while still being a representative demonstration on a real dataset. We also keep only the feature annotations that will be used later in the vignette.
leduc <- leduc[, , 1:30]
#> Warning: 'experiments' dropped; see 'drops()'
#> harmonizing input:
#> removing 8057 sampleMap rows not in names(experiments)
#> removing 1872 colData rownames not in sampleMap 'primary'
leduc <- selectRowData(leduc, c(
"Sequence", "Leading.razor.protein", "Reverse",
"Potential.contaminant", "PEP"
))
This is the actual minimal processing: 1. filtering contaminant and low-quality features 2. replacing zeros by missing values 3. keep only samples that correspond to single cells 4. remove the feature absent in all samples.
leduc <- filterFeatures(leduc, ~ Reverse != "+" &
Potential.contaminant != "+" &
PEP < 0.01) ## 1.
#> 'Reverse' found in 30 out of 30 assay(s)
#> 'Potential.contaminant' found in 30 out of 30 assay(s)
#> 'PEP' found in 30 out of 30 assay(s)
leduc <- zeroIsNA(leduc, i = names(leduc)) ## 2.
leduc <- filterNA(leduc, i = names(leduc), pNA = 0.9999)
leduc <- subsetByColData( ## 3.
leduc, leduc$SampleType %in% c("Monocyte", "Melanoma cell")
)
leduc <- dropEmptyAssays(leduc) ## 4.
Next, we build the peptide identification matrix, that is a matrix
with single cells as columns and peptides as rows that indicate
whether a given peptide is observed or not in a given cell. To build
this matrix, we need data at peptide level, hence we aggregate the
peptide to spectrum match (PSM) data to peptide data. To do so, we
need an aggregation function, isIdentified()
. This functions
considers that a peptide is observed in a cell (returns a TRUE
) if
there is at least one observed PSM belonging to the peptide in that
cell.
isIdentified <- function(x) colSums(!is.na(x)) != 0
leduc <- aggregateFeatures(
leduc, i = names(leduc),
name = paste0("id_", names(leduc)),
fcol = "Sequence",
fun = isIdentified
)
What about proteins? if we were interested in reporting missing
values at the protein level, we simply need to change
fcol = "Sequence"
to fcol = "Leading.razor.protein"
.
Then, we join all runs in a single large assay.
leduc <- joinAssays(
leduc, i = grep("^id_", names(leduc)),
name = "id"
)
Since not all peptides are found in all assays, some entries have been
filled with NA
s. These entries are missing and hence are replaced
with FALSE
.
isMissing <- is.na(assay(leduc[["id"]]))
assay(leduc[["id"]])[isMissing] <- FALSE
We can now compute the metrics of interest. We recommend computing
these for each cell type separately, since biological properties
specific to the cell type could influence the outcome. You can perform
this using reportMissingValues()
. We provide the dataset and point
towards the assay with the identification matrix (id
). The metrics
are computed based on the cell annotation SampleType
that is
available in the colData
.
reportMissingValues(leduc, "id", by = leduc$SampleType)
#> LocalSensitivityMean LocalSensitivitySd TotalSensitivity Completeness
#> Monocyte 2753.904 351.2502 7029 0.3872737
#> NumberCells
#> Monocyte 197
The Jaccard index between a pair of cells is the number of features
shared by the two cells divided by the number of features identified
in any of the two columns. This provides a good measure of how
consistent the identifications are across single-cells. Again,
biological differences between cell types may decrease the consistency
between single cells and we therefore suggest to compute the Jaccard
index for each cell type separately. We compute the Jaccard index
using jaccardIndex()
.
ji <- jaccardIndex(leduc, "id", by = leduc$SampleType)
The function returns a data.frame
that we visualize using the
ggplot2
package.
library("ggplot2")
ggplot(ji) +
aes(x = jaccard) +
geom_histogram() +
facet_grid(~ by)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The Jaccard indices are mainly distributed between 40 and 60 %, meaning that about half of the features are consistently found across single-cells within the same cell type. Note also that some pairs of cells have consistency above 80 %. These are pairs of cells from the same acquisition runs that were multiplexed together with TMT labelling.
To assessing whether we can accurately estimate the total sensitivity,
we generate a cumulative sensitivity curve (CSC). More precisely, we
sample the identification matrix for an increasing number of cells
(or runs) and count the number of distinct features found across the
sampled cells. We repeat each sampling multiple times to account for
the stochasticity of the approach. The approach is implemented in
cumulativeSensitivityCurve()
. Again, we compute the curve for each
cell type separately. In the leduc
dataset, several cells are
acquired in an MS run. When a features is identified in a cell, it is
most of the time also identified in all other cells of that run, and
this will distort the cumulative sensitivity curve. Therefore, the
function provides a batch
argument to account for this. Finally,
nSteps
defines the number of random draws with increasing sample
size, and niters
defines how many times each draw must be iterated.
csc <- cumulativeSensitivityCurve(leduc, "id", by = leduc$SampleType,
batch = leduc$Set, niters = 10,
nsteps = 30)
The function returns a data.frame
that we visualize using the
ggplot2
package.
(plCSC <- ggplot(csc) +
aes(x = SampleSize, y = Sensitivity, colour = by) +
geom_point(size = 1))
The cumulative sensitivity does not reach a plateau. This means
that we underestimated the total sensitivity in the previous section.
We use predictSensitivity()
to predict the total sensitivity from
these curves. The function fits an asymptotic regression model to
assess the relationship between the sensitivity and the sample size.
Then, it uses the model to predict the sensitivity for any sample
size. (supplied through nSamples
). The function requires the
data.frame
generated by cumulativeSensitivityCurve()
. To assess
the quality of the fit, we first predict the sensitivity for the range
of sample size.
predCSC <- predictSensitivity(csc, nSample = 1:30)
plCSC + geom_line(data = predCSC)
We finally predict the total sensitivity, that is the sensitivity if we had an infinite number of samples.
predictSensitivity(csc, nSamples = Inf)
#> SampleSize Sensitivity by
#> 1 Inf 7160.515 Monocyte
The total sensitivity is predicted to be about 7200 peptides for both cell types. So in the previous section, we underestimated the total sensitivity by about 150-200 peptides.
citation("scp")
#> To cite the scp package in publications use:
#>
#> Vanderaa, Christophe, and Laurent Gatto. 2023. Revisiting the Thorny
#> Issue of Missing Values in Single-Cell Proteomics. Journal of
#> Proteome Research 22 (9): 2775–84.
#>
#> Vanderaa Christophe and Laurent Gatto. The current state of
#> single-cell proteomics data analysis. Current Protocols 3 (1): e658.;
#> doi: https://doi.org/10.1002/cpz1.658 (2023).
#>
#> Vanderaa Christophe and Laurent Gatto. Replication of Single-Cell
#> Proteomics Data Reveals Important Computational Challenges. Expert
#> Review of Proteomics, 1–9 (2021).
#>
#> To see these entries in BibTeX format, use 'print(<citation>,
#> bibtex=TRUE)', 'toBibtex(.)', or set
#> 'options(citation.bibtex.max=999)'.
This vignette is distributed under a CC BY-SA license license.
Vanderaa, Christophe, and Laurent Gatto. 2023. “Revisiting the Thorny Issue of Missing Values in Single-Cell Proteomics,” April. http://arxiv.org/abs/2304.06654.