library(PRONE)
PRONE also offers some additional functionalities for the evaluation of normalization techniques on spike-in data sets with a known ground truth. These functionalities will be delineated subsequently. Additionally, all functionalities detailed in the context of real-world data sets remain applicable to the SummarizedExperiment object associated with spike-in data sets.
The example spike-in data set is from (Cox et al. 2014). For the illustration of the package functionality, a smaller subset of the data consisting of 1500 proteins was used. For the complete data set, please refer to the original publication.
data_path <- readPRONE_example("Ecoli_human_MaxLFQ_protein_intensities.csv")
md_path <- readPRONE_example("Ecoli_human_MaxLFQ_metadata.csv")
data <- read.csv(data_path)
md <- read.csv(md_path)
Before loading the data into a SummarizedExperiment object, it is important to check the organism of the protein groups. Proteins should be uniquely assigned to either the spike-in organism or the background organism. If some protein groups are mixed, they should be removed from the data since these can’t be used for classification into true positives, false positives, etc.
In our example, we can extract the information from the “Fasta.headers” column:
# Check if some protein groups are mixed
mixed <- grepl("Homo sapiens.*Escherichia|Escherichia.*Homo sapiens",data$Fasta.headers)
data <- data[!mixed,]
table(mixed)
#> mixed
#> FALSE
#> 1000
In this case, all proteins were assigned either as a spike-in or background protein. Hence, a SummarizedExperiment container of the data can be created. However, before we need to add a column of the actual organism that can be used to calculate true positives, false positives, etc.
data$Spiked <- rep("HUMAN", nrow(data))
data$Spiked[grepl("ECOLI", data$Fasta.headers)] <- "ECOLI"
In contrast to the real-world data sets, you need to specify the “spike_column”, “spike_value”, “spike_concentration”, and utilize the load_spike_data()
function for this purpose.
Here, the “spike_column” denotes the column in the protein intensity data file encompassing information whether a proteins is classified as spike-in or background protein. The “spike_value” determines the value/identifier that is used to identify the spike-in proteins in the “spike_column”, and the “spike_concentration” identifies the column containing the concentration values of the spike-in proteins (in this case the different conditions that will be tested in DE analysis).
se <- load_spike_data(data,
md,
spike_column = "Spiked",
spike_value = "ECOLI",
spike_concentration = "Concentration",
protein_column = "Protein.IDs",
gene_column = "Gene.names",
ref_samples = NULL,
batch_column = NULL,
condition_column = "Condition",
label_column = "Label")
To get an overview on the number of identified (non-NA) spike-in proteins per sample, you can use the plot_identified_spiked_proteins()
function.
plot_identified_spiked_proteins(se)
#> Condition of SummarizedExperiment used!
#> Label of SummarizedExperiment used!
To compare the distributions of spike-in and human proteins in the different sample groups (here high-low), use the function plot_histogram_spiked()
. Again, “condition = NULL” means that the condition specified by loading the data is used, but you can also specify any other column of the meta data.
plot_histogram_spiked(se, condition = NULL)
#> Condition of SummarizedExperiment used!
If you want to have a look at the amount of actual measure spike-in, you can use the plot_profiles_spiked()
function. Moreover, you can analyze whether the intensities of the background proteins, here HUMAN proteins, are constant across the different spike-in concentrations.
plot_profiles_spiked(se, xlab = "Concentration")
Given that the preprocessing (including filtering of proteins and samples), normalization, and imputation operations remain invariant for spike-in data sets compared to real-world data sets, the same methodologies can be employed across both types. In this context, normalization will only be demonstrated here as the performance of the methods will be evaluated in DE analysis, while detailed descriptions of the other functionalities are available in preceding sections.
se_norm <- normalize_se(se, c("Median", "Mean", "MAD", "LoessF"),
combination_pattern = NULL)
#> Median completed.
#> Mean completed.
#> MAD completed.
#> LoessF completed.
Due to the known spike-in concentrations, the normalization methods can be evaluated based on their ability to detect DE proteins. The DE analysis can be conducted using the same methodology as for real-world data sets. However, other visualization options are available and performance metrics can be calculated for spike-in data sets.
First, you need to specify the comparisons you want to perform in DE analysis. For this, a special function was developed which helps to build the right comparison strings.
comparisons <- specify_comparisons(se_norm, condition = "Condition",
sep = NULL, control = NULL)
Then you can run DE analysis:
de_res <- run_DE(se = se_norm,
comparisons = comparisons,
ain = NULL,
condition = NULL,
DE_method = "limma",
covariate = NULL,
logFC = TRUE,
logFC_up = 1,
logFC_down = -1,
p_adj = TRUE,
alpha = 0.05,
B = 100,
K = 500)
#> Condition of SummarizedExperiment used!
#> All assays of the SummarizedExperiment will be used.
#> DE Analysis will not be performed on raw data.
#> log2: DE analysis completed.
#> Median: DE analysis completed.
#> Mean: DE analysis completed.
#> MAD: DE analysis completed.
#> LoessF: DE analysis completed.
Before being able to visualize the DE results, you need to run get_spiked_stats_DE()
to calculate multiple performance metrics, such as the number of true positives, number fo false positives, etc.
stats <- get_spiked_stats_DE(se_norm, de_res)
# Show tp, fp, fn, tn, with F1 score
knitr::kable(stats[,c("Assay", "Comparison", "TP", "FP", "FN", "TN", "F1Score")], caption = "Performance metrics for the different normalization methods and pairwise comparisons.", digits = 2)
Assay | Comparison | TP | FP | FN | TN | F1Score |
---|---|---|---|---|---|---|
LoessF | H-L | 170 | 27 | 92 | 711 | 0.74 |
MAD | H-L | 0 | 1 | 262 | 737 | 0.00 |
Mean | H-L | 141 | 65 | 121 | 673 | 0.60 |
Median | H-L | 160 | 46 | 102 | 692 | 0.68 |
log2 | H-L | 163 | 40 | 99 | 698 | 0.70 |
You have different options to visualize the amount of true and false positives for the different normalization techniques and pairwise comparisons. For all these functions, you can specify a subset of normalization methods and comparisons. If you do not specify anything, all normalization methods and comparisons of the “stats” data frame are used.
The plot_TP_FP_spiked_bar()
function generates a barplot showing the number of false positives and true positives for each normalization method and is facetted by pairwise comparison.
plot_TP_FP_spiked_bar(stats, ain = c("Median", "Mean", "MAD", "LoessF"),
comparisons = NULL)
#> All comparisons of stats will be visualized.
If many pairwise comparisons were performed, the plot_TP_FP_spiked_box()
function can be used to visualize the distribution of true and false positives for all pairwise comparisons in a boxplot.
Given that the data set encompasses merely two distinct spike-in concentrations and therefore only one pairwise comparison has been conducted in DE analysis, a barplot would be more appropriate. Nonetheless, for demonstration purposes, a boxplot will be used here.
plot_TP_FP_spiked_box(stats, ain = c("Median", "Mean", "MAD", "LoessF"),
comparisons = NULL)
#> All comparisons of stats will be visualized.
Furthermore, similarly the plot_TP_FP_spiked_scatter()
function can be used to visualize the true and false positives in a scatter plot. Here, a scatterplot of the median true positives and false positives is calculated across all comparisons and displayed for each normalization method with errorbars showing the Q1 and Q3. Here again, this plot is more suitable for data sets with multiple pairwise comparisons.
plot_TP_FP_spiked_scatter(stats, ain = NULL, comparisons = NULL)
#> All comparisons of stats will be visualized.
#> All normalization methods of de_res will be visualized.
Furthermore, other performance metrics that the number of true positives and false positives can be visualized using the plot_stats_spiked_heatmap()
function. Here, currently, the sensitivity, specificity, precision, FPR, F1 score, and accuracy can be visualized for the different normalization methods and pairwise comparisons.
plot_stats_spiked_heatmap(stats, ain = c("Median", "Mean", "MAD"),
metrics = c("Precision", "F1Score"))
#> All comparisons of stats will be visualized.
Finally, receiver operating characteristic (ROC) curves and AUC values can be calculated and visualized using the plot_ROC_AUC_spiked()
function. This function returns a plot showing the ROC curves, a bar plot with the AUC values for each comparison, and a boxplot with the distributions of AUC values across all comparisons.
plot_ROC_AUC_spiked(se_norm, de_res, ain = c("Median", "Mean", "LoessF"),
comparisons = NULL)
#> All comparisons of de_res will be visualized.
#> $ROC
#>
#> $AUC_bars
#>
#> $AUC_box
#>
#> $AUC_dt
#> PANEL Comparison group Assay AUC
#> 1 1 H-L 1 LoessF 0.8784788
#> 2 1 H-L 2 Mean 0.7851968
#> 3 1 H-L 3 Median 0.8792709
Furthermore, you can visualize the distribution of log fold changes for the different conditions using the plot_fold_changes_spiked()
function. The fold changes of the background proteins should be centered around zero, while the spike-in proteins should be centered around the actual log fold change calculated based on the spike-in concentrations.
plot_fold_changes_spiked(se_norm, de_res, condition = "Condition",
ain = c("Median", "Mean", "MAD"), comparisons = NULL)
#> All comparisons of de_res will be visualized.
Similarly, the distributions of the p-values can be visualized using the plot_pvalues_spiked()
function.
plot_pvalues_spiked(se_norm, de_res, ain = c("Median", "Mean", "MAD"),
comparisons = NULL)
#> All comparisons of de_res will be visualized.
Due to the high amount of false positives encountered in spike-in data sets, we provided a function to test for different log fold change thresholds to try to reduce the amount of false positives. The function plot_logFC_thresholds_spiked()
can be used to visualize the number of true positives and false positives for different log fold change thresholds.
plots <- plot_logFC_thresholds_spiked(se_norm, de_res, condition = NULL,
ain = c("Median", "Mean", "MAD"),
nrow = 1, alpha = 0.05)
#> All comparisons of de_res will be visualized.
#> Condition of SummarizedExperiment used!
plots[[1]]
plots[[2]]
utils::sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] PRONE_1.0.0 SummarizedExperiment_1.36.0
#> [3] Biobase_2.66.0 GenomicRanges_1.58.0
#> [5] GenomeInfoDb_1.42.0 IRanges_2.40.0
#> [7] S4Vectors_0.44.0 BiocGenerics_0.52.0
#> [9] MatrixGenerics_1.18.0 matrixStats_1.4.1
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 jsonlite_1.8.9
#> [3] shape_1.4.6.1 MultiAssayExperiment_1.32.0
#> [5] magrittr_2.0.3 magick_2.8.5
#> [7] farver_2.1.2 MALDIquant_1.22.3
#> [9] rmarkdown_2.28 GlobalOptions_0.1.2
#> [11] zlibbioc_1.52.0 vctrs_0.6.5
#> [13] Cairo_1.6-2 htmltools_0.5.8.1
#> [15] S4Arrays_1.6.0 ComplexUpset_1.3.3
#> [17] SparseArray_1.6.0 mzID_1.44.0
#> [19] sass_0.4.9 bslib_0.8.0
#> [21] htmlwidgets_1.6.4 plyr_1.8.9
#> [23] impute_1.80.0 cachem_1.1.0
#> [25] igraph_2.1.1 lifecycle_1.0.4
#> [27] iterators_1.0.14 pkgconfig_2.0.3
#> [29] Matrix_1.7-1 R6_2.5.1
#> [31] fastmap_1.2.0 GenomeInfoDbData_1.2.13
#> [33] clue_0.3-65 digest_0.6.37
#> [35] pcaMethods_1.98.0 colorspace_2.1-1
#> [37] patchwork_1.3.0 crosstalk_1.2.1
#> [39] vegan_2.6-8 labeling_0.4.3
#> [41] fansi_1.0.6 httr_1.4.7
#> [43] abind_1.4-8 mgcv_1.9-1
#> [45] compiler_4.4.1 withr_3.0.2
#> [47] doParallel_1.0.17 BiocParallel_1.40.0
#> [49] UpSetR_1.4.0 highr_0.11
#> [51] MASS_7.3-61 DelayedArray_0.32.0
#> [53] rjson_0.2.23 permute_0.9-7
#> [55] mzR_2.40.0 tools_4.4.1
#> [57] PSMatch_1.10.0 glue_1.8.0
#> [59] nlme_3.1-166 QFeatures_1.16.0
#> [61] gridtext_0.1.5 grid_4.4.1
#> [63] cluster_2.1.6 reshape2_1.4.4
#> [65] generics_0.1.3 gtable_0.3.6
#> [67] preprocessCore_1.68.0 tidyr_1.3.1
#> [69] data.table_1.16.2 xml2_1.3.6
#> [71] utf8_1.2.4 XVector_0.46.0
#> [73] foreach_1.5.2 pillar_1.9.0
#> [75] stringr_1.5.1 limma_3.62.0
#> [77] splines_4.4.1 circlize_0.4.16
#> [79] dplyr_1.1.4 ggtext_0.1.2
#> [81] lattice_0.22-6 tidyselect_1.2.1
#> [83] ComplexHeatmap_2.22.0 knitr_1.48
#> [85] gridExtra_2.3 bookdown_0.41
#> [87] ProtGenerics_1.38.0 xfun_0.48
#> [89] statmod_1.5.0 plotROC_2.3.1
#> [91] MSnbase_2.32.0 DT_0.33
#> [93] stringi_1.8.4 UCSC.utils_1.2.0
#> [95] lazyeval_0.2.2 yaml_2.3.10
#> [97] NormalyzerDE_1.24.0 evaluate_1.0.1
#> [99] codetools_0.2-20 MsCoreUtils_1.18.0
#> [101] tibble_3.2.1 BiocManager_1.30.25
#> [103] cli_3.6.3 affyio_1.76.0
#> [105] munsell_0.5.1 jquerylib_0.1.4
#> [107] Rcpp_1.0.13 png_0.1-8
#> [109] XML_3.99-0.17 parallel_4.4.1
#> [111] ggplot2_3.5.1 dendsort_0.3.4
#> [113] AnnotationFilter_1.30.0 scales_1.3.0
#> [115] affy_1.84.0 ncdf4_1.23
#> [117] purrr_1.0.2 crayon_1.5.3
#> [119] POMA_1.16.0 GetoptLong_1.0.5
#> [121] rlang_1.1.4 vsn_3.74.0
Cox, Jürgen, Marco Y. Hein, Christian A. Luber, Igor Paron, Nagarjuna Nagaraj, and Matthias Mann. 2014. “Accurate Proteome-Wide Label-Free Quantification by Delayed Normalization and Maximal Peptide Ratio Extraction, Termed Maxlfq.” Molecular & Cellular Proteomics 13 (9): 2513–26. https://doi.org/10.1074/mcp.M113.031591.