The function of this R package is to assess the contribution of the targeted precursor in a fragmentation isolation window using a metric called “precursor purity”.
What we call “Precursor purity” is a measure of the contribution of a selected precursor peak in an isolation window used for fragmentation. The simple calculation involves dividing the intensity of the selected precursor peak by the total intensity of the isolation window. When assessing MS/MS spectra this calculation is done before and after the MS/MS scan of interest and the purity is interpolated at the time of the MS/MS acquisition. The calculation is very similar to the “Precursor Ion Fraction” (PIF) metric described by (Michalski, Cox, and Mann 2011) for proteomics with the exception that purity here is interpolated at the recorded point of MS2 acquisition using bordering full-scan spectra. Additionally, low abundance ions that are remove that are thought to have limited contribution to the resulting MS2 spectra and can optionally take into account the isolation efficiency of the mass spectrometer
There are two main use cases for the package
purityA
purityX
) or DIMS (purityD
) full scan (MS1) data and an assessment is to be made of the precursor purity of detected features using anticipated or theoretical isolation windows. This information can then be used to guide further targeted MS2 experiments.The package has been developed to be used with DI-MS or LC-MS data and has been checked to work with the following vendor files after conversion to mzML: Thermo, Agilent and AB Sciex.
Given a vector of LC-MS/MS or DI-MS/MS mzML file paths the function purityA
will calculate the precursor purity of each MS/MS scan. The output is a S4 class object where a dataframe of the purity results can be accessed using the appropriate slot (@puritydf
).
The isolation widths will be determined automatically from the mzML file. For some mzML files this is not recorded and in these cases the offsets can be given as a parameter.
In the case of Agilent only the “narrow” isolation is supported. This roughly equates to +/- 0.65 Da (depending on the instrument). If the file is detected as originating from an Agilent instrument the isolation widths will automatically be set as +/- 0.65 Da (this can be overwritten with the offsets
argument)
The purity dataframe (pa@puritydf
) consists of the following columns:
library(msPurity)
## Loading required package: Rcpp
msmsPths <- list.files(system.file("extdata", "lcms", "mzML", package="msPurityData"), full.names = TRUE, pattern = "MSMS")
msPths <- list.files(system.file("extdata", "lcms", "mzML", package="msPurityData"), full.names = TRUE, pattern = "LCMS_")
pa <- purityA(msmsPths)
print(head(pa@puritydf))
## pid fileid seqNum precursorIntensity precursorMZ precursorRT
## 1 1 1 7 2338044.2 391.2838 2.707016
## 2 2 1 8 1415939.8 149.0232 2.707016
## 3 3 1 9 1319700.2 135.1015 2.707016
## 4 4 1 10 1179373.9 219.1742 2.707016
## 5 5 1 11 1065425.9 136.0200 2.707016
## 6 6 1 13 817673.7 235.1690 3.583746
## precursorScanNum id filename precursorNearest aMz aPurity
## 1 6 7 LCMSMS_1.mzML 6 391.2838 1.0000000
## 2 6 8 LCMSMS_1.mzML 6 149.0233 0.8535700
## 3 6 9 LCMSMS_1.mzML 6 135.1015 0.7616688
## 4 6 10 LCMSMS_1.mzML 12 219.1742 0.7173636
## 5 6 11 LCMSMS_1.mzML 12 136.0215 0.8163521
## 6 12 13 LCMSMS_1.mzML 12 235.1691 0.8312278
## apkNm iMz iPurity ipkNm inPkNm inPurity
## 1 1 391.2838 1.0000000 1 1 1.0000000
## 2 2 149.0233 0.8535700 2 2 0.8475240
## 3 4 135.1015 0.7616688 4 4 0.7558731
## 4 3 219.1742 0.7173636 3 3 0.7248489
## 5 4 136.0215 0.8163521 4 3 0.8247355
## 6 2 235.1691 0.8312278 2 2 0.8299369
The MS/MS spectra can be assigned to an XCMS grouped feature using the frag4feature
function.
First an xcmsSet object of the same files is required #```{r results=‘hide’, message=FALSE, warning=FALSE}
library(xcms)
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, append,
## as.data.frame, cbind, colnames, do.call, duplicated, eval,
## evalq, get, grep, grepl, intersect, is.unsorted, lapply,
## lengths, mapply, match, mget, order, paste, pmax, pmax.int,
## pmin, pmin.int, rank, rbind, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which, which.max,
## which.min
## Loading required package: ProtGenerics
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
xset <- xcms::xcmsSet(msmsPths)
## 150:92 200:211 250:340 300:477 350:580 400:642 450:692 500:729
## 150:93 200:210 250:341 300:479 350:579 400:645 450:692 500:730
xset <- xcms::group(xset)
## 166 229 291 354 416 479
xset <- xcms::retcor(xset)
## Retention Time Correction Groups: 351
xset <- xcms::group(xset)
## 166 229 291 354 416 479
pa <- frag4feature(pa, xset)
The slot grped_df
is a dataframe of the grouped XCMS features linked to a reference to any associated MS/MS scans in the region of the full width of the XCMS feature in each file. The dataframe contains the following columns.
print(head(pa@grped_df))
## grpid mz mzmin mzmax rt rtmin rtmax into
## 108 8 112.0508 112.0507 112.0872 67.60929 55.27690 80.36167 36223791
## 109 8 112.0509 112.0506 112.1205 67.51574 55.41402 80.55541 36139266
## 16 12 116.0706 116.0109 116.0709 47.68880 35.74403 60.00379 130337063
## 17 12 116.0706 116.0109 116.0709 47.78054 35.59266 59.86578 124086404
## 46 12 116.0706 116.0109 116.0709 47.68880 35.74403 60.00379 130337063
## 47 12 116.0706 116.0109 116.0709 47.78054 35.59266 59.86578 124086404
## intf maxo maxf i sn sample id filename
## 108 133522504 7158012 8555976 1 21.05495 2 398 LCMSMS_2.mzML
## 109 133395721 7426336 8522973 1 20.38040 1 9 LCMSMS_1.mzML
## 16 491415400 27555850 32138223 1 24.64411 1 13 LCMSMS_1.mzML
## 17 465322433 26501960 30360236 1 24.07917 2 402 LCMSMS_2.mzML
## 46 491415400 27555850 32138223 1 24.64411 1 13 LCMSMS_1.mzML
## 47 465322433 26501960 30360236 1 24.07917 2 402 LCMSMS_2.mzML
## rtminCorrected rtmaxCorrected precurMtchID precurMtchRT precurMtchMZ
## 108 55.37223 80.35088 466 64.42730 112.0507
## 109 55.39383 80.40870 472 65.37616 112.0507
## 16 35.55508 60.13504 277 41.09841 116.0708
## 17 35.81265 59.93620 277 40.95480 116.0708
## 46 35.55508 60.13504 343 49.31567 116.0708
## 47 35.81265 59.93620 343 49.18240 116.0708
## precurMtchPPM inPurity pid
## 108 0.5516288 1.0000000 1213
## 109 1.0047960 1.0000000 389
## 16 2.2246382 0.9893762 226
## 17 1.7359354 0.9506322 1055
## 46 2.2903689 1.0000000 281
## 47 1.6702047 1.0000000 1110
The slot grped_ms2
is a list of the associated fragmentation spectra for the grouped features.
print(pa@grped_ms2[2:3])
## $`12`
## $`12`[[1]]
## [,1] [,2]
## [1,] 107.2701 1726.613
## [2,] 116.0164 2890.495
## [3,] 116.0709 100876.133
## [4,] 116.1072 2424.613
##
## $`12`[[2]]
## [,1] [,2]
## [1,] 116.0168 3725.937
## [2,] 116.0709 97631.586
## [3,] 116.1071 3945.327
##
## $`12`[[3]]
## [,1] [,2]
## [1,] 116.0709 1847703
##
## $`12`[[4]]
## [,1] [,2]
## [1,] 103.1290 4419.712
## [2,] 116.0164 5682.144
## [3,] 116.0709 1782171.000
## [4,] 130.0276 4081.138
##
## $`12`[[5]]
## [,1] [,2]
## [1,] 116.0166 4434.369
## [2,] 116.0709 165623.641
## [3,] 116.1073 11372.488
##
## $`12`[[6]]
## [,1] [,2]
## [1,] 116.0168 14364.784
## [2,] 116.0709 149471.266
## [3,] 116.1074 8359.903
##
##
## $`27`
## $`27`[[1]]
## [,1] [,2]
## [1,] 117.8772 5004.664
## [2,] 132.1019 273406.250
##
## $`27`[[2]]
## [,1] [,2]
## [1,] 132.1020 402822.69
## [2,] 144.2789 7715.69
## [3,] 146.6829 7014.51
##
## $`27`[[3]]
## [,1] [,2]
## [1,] 130.187 121726.4
## [2,] 132.102 3973065.5
##
## $`27`[[4]]
## [,1] [,2]
## [1,] 104.9648 111113.7
## [2,] 132.1021 3328366.8
##
## $`27`[[5]]
## [,1] [,2]
## [1,] 132.1021 77799.47
##
## $`27`[[6]]
## [,1] [,2]
## [1,] 115.4372 2424.58
## [2,] 132.1020 67118.03
NOTE ON TERMINOLOGY: The term ‘anticipated purity’ and ‘predicted purity’ are used interchangeably
A processed xcmsSet object is required to determine the anticipated (predicted) precursor purity score from an LC-MS dataset. The offsets chosen in the parameters should reflect what settings would be used in a hypothetical fragmentation experiment.
The slot predictions
provides the anticipated (predicted) purity scores for each feature. The dataframe contains the following columns:
XCMS run on an LC-MS dataset
xset <- xcms::xcmsSet(msPths)
## 150:157 200:348 250:631 300:960 350:1303 400:1616 450:1900 500:2163
## 150:157 200:348 250:629 300:955 350:1296 400:1622 450:1915 500:2184
xset <- xcms::group(xset)
## 164 227 289 352 414 477
xset <- xcms::retcor(xset)
## Retention Time Correction Groups: 762
xset <- xcms::group(xset)
## 164 227 289 352 414 477
Perform purity calculations
ppLCMS <- purityX(xset, offsets=c(0.5, 0.5), xgroups = c(1, 2))
print(head(ppLCMS@predictions))
## grpid mean median sd stde RSD pknm
## 1 1 0.9901505 0.9901505 0.001354984 0.0009581183 0.1368463 2.75
## 2 2 1.0000000 1.0000000 0.000000000 0.0000000000 0.0000000 1.00
## i mz
## 1 61925043 102.0916
## 2 25719001 103.0544
The anticipated/predicted purity for a DI-MS experiment can be performed on any DI-MS dataset consisting of multiple MS1 scans of the same mass range, i.e. it has not been developed to be used with any SIM stitching approach.
A number of simple data processing steps are performed on the mzML files to provide a DI-MS peak list (features) to perform the purity predictions on.
These data processing steps consist of:
The averaged peaks before and after filtering are stored in the avPeaks
slot of purityPD S4 object.
Get file dataframe: The purityD constructor requires a dataframe consisting of the following columns:
datapth <- system.file("extdata", "dims", "mzML", package="msPurityData")
inDF <- Getfiles(datapth, pattern=".mzML", check = FALSE)
ppDIMS <- purityD(inDF, mzML=TRUE)
Average spectra: The default averaging will use a Hierarchal clustering approach. Noise filtering is also performed here.
ppDIMS <- averageSpectra(ppDIMS, snMeth = "median", snthr = 5)
Filter by RSD and Intensity
ppDIMS <- filterp(ppDIMS, thr=5000, rsd = 10)
Subtract blank
ppDIMS <- subtract(ppDIMS)
Predict purity
ppDIMS <- dimsPredictPurity(ppDIMS)
print(head(ppDIMS@avPeaks$processed$B02_Daph_TEST_pos))
## peakID mz i snr rsd inorm
## 5 5 173.0806 11272447.0 216.506319 9.006126 0.0108585920
## 7 7 179.1177 606983.2 11.425825 6.019861 0.0005729283
## 10 10 217.1067 17770220.0 343.292914 8.602331 0.0171178067
## 15 15 235.1173 4950841.5 95.991762 6.302825 0.0047694791
## 16 16 236.1206 486912.0 9.270517 8.811437 0.0004638254
## 17 17 239.1485 2533134.5 48.892062 5.781277 0.0024401334
## medianPurity meanPurity sdPurity cvPurity sdePurity medianPeakNum
## 5 1.0000000 1.0000000 0.00000000 0.000000 0.000000000 1
## 7 1.0000000 1.0000000 0.00000000 0.000000 0.000000000 1
## 10 0.7797864 0.7808917 0.01261501 1.615462 0.005641605 2
## 15 1.0000000 1.0000000 0.00000000 0.000000 0.000000000 1
## 16 0.8818313 0.8755873 0.01056807 1.206969 0.004726184 2
## 17 0.8123950 0.8229505 0.04384595 5.327896 0.019608505 2
The data processing steps carried out through purityPD can be bypassed if the peaks (m/z values) of interest are already known. The function dimsPredictPuritySingle()
can be used to predict the purity of a list of m/z values in a chosen mzML file.
mzpth <- system.file("extdata", "dims", "mzML", "B02_Daph_TEST_pos.mzML", package="msPurityData")
predicted <- dimsPredictPuritySingle(filepth = mzpth, mztargets = c(111.0436, 113.1069))
print(predicted)
## medianPurity meanPurity sdPurity cvPurity sdePurity medianPeakNum
## 1 0.6390276 0.6251787 0.0356821 5.707505 0.01595752 5
## 2 0.7453778 0.7619277 0.1008513 13.236338 0.04510209 5
Michalski, Annette, Juergen Cox, and Matthias Mann. 2011. “More than 100,000 detectable peptide species elute in single shotgun proteomics runs but the majority is inaccessible to data-dependent LC-MS/MS.” Journal of Proteome Research 10 (4): 1785–93. doi:10.1021/pr101060v.