Pirat 1.0.1
Pirat is an R
package available via the
Bioconductor repository for packages.
You can install the development version from GitHub :
BiocManager::install("prostarproteomics/Pirat")
When calling one of the imputation methods for the first time after package installation, the underlying python module and conda environment will be automatically downloaded and installed without the need for user intervention. This may take several minutes, but this is a one-time installation. the first time after package installation.
Pirat is a single imputation pipeline including a novel statistical approach dedicated to bottom-up proteomics data. All the technical details and the validation procedure of this method in available on the corresponding preprint Etourneau et al. (2023) . To demonstrate its usage, we first load Pirat package and a subset of Bouyssie2020 dataset in the environment.
library(Pirat)
library(utils)
data(subbouyssie)
Note that subbouyssie
is actually a list that contains two elements:
peptides_ab
, the matrix of peptide log-2-abundances, with samples in rows
and peptides in columns.adj
, the adjacency matrix between peptides and proteins, containing either
booleans or 0s and 1s (no preprocessing or simplification is needed for this
matrix, Pirat will automatically build the PGs from it).Slight imputation variations may occur for peptides belonging to very large PGs, because the latter are randomly split into several smaller PGs with fixed size to reduce computational costs. Although this variability is too small to affect imputation quality, we fix the seed in this tutorial such that the user can retrieve exactly the same imputed values when running the notebook again.
set.seed(12345)
One can then impute this dataset with the following line
imp.res <- my_pipeline_llkimpute(subbouyssie)
#>
#> Call:
#> stats::lm(formula = log(probs) ~ m_ab_sorted[seq(length(probs))])
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.64780 -0.25838 0.06931 0.35373 0.76279
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 13.1950 1.5687 8.411 1.19e-07 ***
#> m_ab_sorted[seq(length(probs))] -0.7937 0.0758 -10.471 4.38e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.42 on 18 degrees of freedom
#> Multiple R-squared: 0.859, Adjusted R-squared: 0.8511
#> F-statistic: 109.6 on 1 and 18 DF, p-value: 4.378e-09
The first plot represents the goodness of fit of the inverse-gamma prior, whereas the second one represents the goodness of fit of the missingness mechanism (details on fitting methods are given in Etourneau et al. (2023)). Note that some of these parameters were originally proposed in Chen, Prentice, and Wang (2014), however no methods existed to find then automatically and without relying on heuristics.
The result imp.res
is a list that contains:
data.imputed
, the imputed log-2 abundance matrix.params
, a list containing parameters \(\Gamma\) and hyperparameters
\(\alpha\) and \(\beta\).You can check the imputed values here…
head(imp.res$data.imputed[ ,seq(5)])
#> ELISMAK EDLLVTK VVIEPHR HAGVYIAR
#> raw_abundance_10amol_1 24.07467 25.08683 24.79494 25.90383
#> raw_abundance_10amol_2 24.05896 25.11780 25.04157 26.20225
#> raw_abundance_10amol_3 23.94210 24.97989 24.67742 25.97934
#> raw_abundance_10amol_4 24.00094 24.93289 24.31135 25.80639
#> raw_abundance_50amol_1 24.05962 25.03968 24.66224 25.86291
#> raw_abundance_50amol_2 24.02732 25.02199 24.65812 25.84511
#> DHCIVVGR Carbamidomethyl (C3)
#> raw_abundance_10amol_1 24.77011
#> raw_abundance_10amol_2 24.98623
#> raw_abundance_10amol_3 24.69103
#> raw_abundance_10amol_4 24.50651
#> raw_abundance_50amol_1 24.68406
#> raw_abundance_50amol_2 24.66596
…and the computed parameters here.
imp.res$params
#> $alpha
#> a
#> 5.147265
#>
#> $beta
#> b
#> 0.2335229
#>
#> $gamma0
#> (Intercept)
#> -13.19496
#>
#> $gamma1
#> m_ab_sorted[seq(length(probs))]
#> 0.7937527
Note that a positive value for \(\gamma1\) indicates that a random left-truncation mechanism is present in the dataset.
Pirat has a diagnosis tool that compares distributions of correlations at random and those from same peptide groups (PGs). We use it here on the complete Ropers2021 dataset.
data(subropers)
plot_pep_correlations(subropers, titlename = "Ropers2021")
We see here that the blue distribution has much more weights on high correlations than the red one, indicating that PGs should clearly help for imputation.
To handle singleton PGs, Pirat proposes three extensions, on top of the classical Pirat approach. Note that the -T extensions can be applied to up to an arbitrary PG size. To illustrate our examples, we use a subset of Ropers2021 dataset.
The -2 extension simply consists in not imputing singleton PGs. It can be used as following.
data(subropers)
imp.res = pipeline_llkimpute(subropers, extension = "2")
#>
#> Call:
#> stats::lm(formula = log(probs) ~ m_ab_sorted[seq(length(probs))])
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.40672 -0.15008 0.08833 0.28986 0.84125
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 14.40677 1.19049 12.10 2.48e-16 ***
#> m_ab_sorted[seq(length(probs))] -0.91419 0.06469 -14.13 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.4549 on 49 degrees of freedom
#> Multiple R-squared: 0.803, Adjusted R-squared: 0.799
#> F-statistic: 199.7 on 1 and 49 DF, p-value: < 2.2e-16
We can then check that singleton peptides are not imputed (yet some may be already fully observed).
mask.sing.pg = colSums(subropers$adj) == 1
mask.sing.pep = rowSums(subropers$adj[, mask.sing.pg]) >= 1
imp.res$data.imputed[, mask.sing.pep]
#> NALPEGYAPAK__2 AIGIISNNPEFADVR__2 DLNIDPATLR__2
#> W1_1_MS 20.17349 NA 19.18247
#> W1_2_MS 19.79753 17.22690 18.50307
#> W1_3_MS 20.07592 17.55627 18.83505
#> W2_1_MS 19.82019 17.42375 18.47821
#> W2_2_MS 20.17537 17.13305 17.89070
#> W2_3_MS 20.36344 17.50759 18.18041
#> R1_1_MS 20.69017 NA 18.12682
#> R1_2_MS 20.70866 NA 18.03316
#> R1_3_MS 21.04542 NA 18.52355
#> R2_1_MS 20.95692 NA 18.49510
#> R2_2_MS 20.90775 15.87217 17.92845
#> R2_3_MS 21.20864 NA 18.24418
#> R3_1_MS 21.02669 NA NA
#> R3_2_MS 21.10948 NA 17.34890
#> R3_3_MS 21.23109 NA NA
#> R4_1_MS 20.83249 NA 17.28032
#> R4_2_MS 21.00524 NA 17.67013
#> R4_3_MS 20.57447 NA 17.33383
Pirat can leverage sample-wise correlations to impute the singleton peptides as following:
imp.res = my_pipeline_llkimpute(subropers, extension = "S")
#>
#> Call:
#> stats::lm(formula = log(probs) ~ m_ab_sorted[seq(length(probs))])
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.40672 -0.15008 0.08833 0.28986 0.84125
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 14.40677 1.19049 12.10 2.48e-16 ***
#> m_ab_sorted[seq(length(probs))] -0.91419 0.06469 -14.13 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.4549 on 49 degrees of freedom
#> Multiple R-squared: 0.803, Adjusted R-squared: 0.799
#> F-statistic: 199.7 on 1 and 49 DF, p-value: < 2.2e-16
Here singleton peptides are impute after the rest of the dataset, using sample-wise correlations obtained.
mask.sing.pg = colSums(subropers$adj) == 1
mask.sing.pep = rowSums(subropers$adj[, mask.sing.pg]) >= 1
imp.res$data.imputed[, mask.sing.pep]
#> NALPEGYAPAK__2 AIGIISNNPEFADVR__2 DLNIDPATLR__2
#> W1_1_MS 20.17349 17.77321 19.18247
#> W1_2_MS 19.79753 17.22690 18.50307
#> W1_3_MS 20.07592 17.55627 18.83505
#> W2_1_MS 19.82019 17.42375 18.47821
#> W2_2_MS 20.17537 17.13305 17.89070
#> W2_3_MS 20.36344 17.50759 18.18041
#> R1_1_MS 20.69017 16.01870 18.12682
#> R1_2_MS 20.70866 16.28692 18.03316
#> R1_3_MS 21.04542 16.34702 18.52355
#> R2_1_MS 20.95692 16.10669 18.49510
#> R2_2_MS 20.90775 15.87217 17.92845
#> R2_3_MS 21.20864 16.43660 18.24418
#> R3_1_MS 21.02669 15.87970 17.69229
#> R3_2_MS 21.10948 16.04076 17.34890
#> R3_3_MS 21.23109 16.01596 18.00233
#> R4_1_MS 20.83249 16.22859 17.28032
#> R4_2_MS 21.00524 16.03777 17.67013
#> R4_3_MS 20.57447 15.87691 17.33383
The last extension consists in using correlations between peptides and gene/transcript expression obtained from a related transcriptomic analysis. To use this extension, the list of the dataset must contain:
rnas_ab
, an log2-normalized-count table of gene or transcript expression,
for which samples are either paired or related (i.e., from the same
experimental/biological conditions).adj_rna_pg
, a adjacency matrix between transcripts or genes and PGs,
containing either booleans or 0s and 1s.ropers
proteomic and transcriptomic samples are paired (i.e. the same
biological samples were used for each type of analysis). Thus Pirat-T can be
used as following:
imp.res = my_pipeline_llkimpute(subropers,
extension = "T",
rna.cond.mask = seq(nrow(subropers$peptides_ab)),
pep.cond.mask = seq(nrow(subropers$peptides_ab)),
max.pg.size.pirat.t = 1)
#>
#> Call:
#> stats::lm(formula = log(probs) ~ m_ab_sorted[seq(length(probs))])
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.40672 -0.15008 0.08833 0.28986 0.84125
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 14.40677 1.19049 12.10 2.48e-16 ***
#> m_ab_sorted[seq(length(probs))] -0.91419 0.06469 -14.13 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.4549 on 49 degrees of freedom
#> Multiple R-squared: 0.803, Adjusted R-squared: 0.799
#> F-statistic: 199.7 on 1 and 49 DF, p-value: < 2.2e-16
Only few peptides have been used to fit the prior variance distribution in \(\Sigma\), as we use a small subset from the original Ropers2021 dataset. Thus the goodness of fit may vary a lot depending on the subset chosen.
It gives the following imputed singletons:
mask.sing.pg = colSums(subropers$adj) == 1
mask.sing.pep = rowSums(subropers$adj[, mask.sing.pg]) >= 1
imp.res$data.imputed[, mask.sing.pep]
#> NALPEGYAPAK__2 AIGIISNNPEFADVR__2 DLNIDPATLR__2
#> W1_1_MS 20.17349 17.16826 19.18247
#> W1_2_MS 19.79753 17.22690 18.50307
#> W1_3_MS 20.07592 17.55627 18.83505
#> W2_1_MS 19.82019 17.42375 18.47821
#> W2_2_MS 20.17537 17.13305 17.89070
#> W2_3_MS 20.36344 17.50759 18.18041
#> R1_1_MS 20.69017 16.07105 18.12682
#> R1_2_MS 20.70866 15.69253 18.03316
#> R1_3_MS 21.04542 15.95376 18.52355
#> R2_1_MS 20.95692 15.74790 18.49510
#> R2_2_MS 20.90775 15.87217 17.92845
#> R2_3_MS 21.20864 15.54095 18.24418
#> R3_1_MS 21.02669 15.70025 17.84194
#> R3_2_MS 21.10948 15.83525 17.34890
#> R3_3_MS 21.23109 15.81318 17.82150
#> R4_1_MS 20.83249 15.99150 17.28032
#> R4_2_MS 21.00524 16.14081 17.67013
#> R4_3_MS 20.57447 17.69132 17.33383
On the other hand, if proteomic and transcriptomic samples are not paired but are derived from a same biological/experimental condition. Pirat-T can be used by adapting the mask related to samples in each type of analysis (here, both proteomic and transcriptomic datasets have 6 different conditions in the same order with 3 replicates each):
imp.res = my_pipeline_llkimpute(subropers,
extension = "T",
rna.cond.mask = rep(seq(6), each = 3),
pep.cond.mask = rep(seq(6), each = 3),
max.pg.size.pirat.t = 1)
#>
#> Call:
#> stats::lm(formula = log(probs) ~ m_ab_sorted[seq(length(probs))])
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.40672 -0.15008 0.08833 0.28986 0.84125
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 14.40677 1.19049 12.10 2.48e-16 ***
#> m_ab_sorted[seq(length(probs))] -0.91419 0.06469 -14.13 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.4549 on 49 degrees of freedom
#> Multiple R-squared: 0.803, Adjusted R-squared: 0.799
#> F-statistic: 199.7 on 1 and 49 DF, p-value: < 2.2e-16
Also, it is possible to apply transcriptomic integration up to an arbitrary
size of PG, simply by
changing parameter max.pg.size.pirat.t
in my_pipeline_llkimpute()
to the
desired limit PG size (e.g. +Inf
for whole dataset).
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/.cache/R/basilisk/1.18.0/Pirat/1.0.1/envPirat/lib/libmkl_intel_lp64.so.1; LAPACK version 3.9.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] Pirat_1.0.1 BiocManager_1.30.25 BiocStyle_2.34.0
#>
#> loaded via a namespace (and not attached):
#> [1] SummarizedExperiment_1.36.0 gtable_0.3.6
#> [3] dir.expiry_1.14.0 xfun_0.49
#> [5] bslib_0.8.0 ggplot2_3.5.1
#> [7] Biobase_2.66.0 lattice_0.22-6
#> [9] vctrs_0.6.5 tools_4.4.2
#> [11] generics_0.1.3 parallel_4.4.2
#> [13] stats4_4.4.2 tibble_3.2.1
#> [15] fansi_1.0.6 pkgconfig_2.0.3
#> [17] Matrix_1.7-1 S4Vectors_0.44.0
#> [19] lifecycle_1.0.4 GenomeInfoDbData_1.2.13
#> [21] farver_2.1.2 compiler_4.4.2
#> [23] progress_1.2.3 tinytex_0.54
#> [25] munsell_0.5.1 GenomeInfoDb_1.42.0
#> [27] htmltools_0.5.8.1 sass_0.4.9
#> [29] yaml_2.3.10 pillar_1.9.0
#> [31] crayon_1.5.3 jquerylib_0.1.4
#> [33] MASS_7.3-61 DelayedArray_0.32.0
#> [35] cachem_1.1.0 magick_2.8.5
#> [37] abind_1.4-8 basilisk_1.18.0
#> [39] tidyselect_1.2.1 digest_0.6.37
#> [41] dplyr_1.1.4 bookdown_0.41
#> [43] labeling_0.4.3 fastmap_1.2.0
#> [45] grid_4.4.2 colorspace_2.1-1
#> [47] cli_3.6.3 invgamma_1.1
#> [49] SparseArray_1.6.0 magrittr_2.0.3
#> [51] S4Arrays_1.6.0 utf8_1.2.4
#> [53] withr_3.0.2 prettyunits_1.2.0
#> [55] filelock_1.0.3 scales_1.3.0
#> [57] UCSC.utils_1.2.0 rmarkdown_2.29
#> [59] XVector_0.46.0 httr_1.4.7
#> [61] matrixStats_1.4.1 reticulate_1.40.0
#> [63] png_0.1-8 hms_1.1.3
#> [65] evaluate_1.0.1 knitr_1.49
#> [67] GenomicRanges_1.58.0 IRanges_2.40.0
#> [69] basilisk.utils_1.18.0 rlang_1.1.4
#> [71] Rcpp_1.0.13-1 glue_1.8.0
#> [73] BiocGenerics_0.52.0 jsonlite_1.8.9
#> [75] R6_2.5.1 MatrixGenerics_1.18.0
#> [77] zlibbioc_1.52.0
Chen, Lin S., Ross L. Prentice, and Pei Wang. 2014. “A penalized EM algorithm incorporating missing data mechanism for Gaussian parameter estimation.” Biometrics 70 (2): 312–22. https://doi.org/10.1111/BIOM.12149.
Etourneau, Lucas, Laura Fancello, Samuel Wieczorek, Nelle Varoquaux, and Thomas Burger. 2023. “A new take on missing value imputation for bottom-up label-free LC-MS/MS proteomics.” bioRxiv, January, 2023.11.09.566355. https://doi.org/10.1101/2023.11.09.566355.