This vignette describes the fragmentation analysis functionality of the topdownr
package.
topdownr 1.29.0
topdownr is free and open-source software. If you use it, please support the project by citing it in publications:
P.V. Shliaha, S. Gibb, V. Gorshkov, M.S. Jespersen, G.R. Andersen, D. Bailey, J. Schwartz, S. Eliuk, V. Schwämmle, and O.N. Jensen. 2018. Maximizing Sequence Coverage in Top-Down Proteomics By Automated Multi-modal Gas-phase Protein Fragmentation. Analytical Chemistry. DOI: 10.1021/acs.analchem.8b02344
For bugs, typos, suggestions or other questions, please file an issue
in our tracking system (https://github.com/sgibb/topdownr/issues)
providing as much information as possible, a reproducible example and
the output of sessionInfo()
.
If you don’t have a GitHub account or wish to reach a broader audience for general questions about proteomics analysis using R, you may want to use the Bioconductor support site: https://support.bioconductor.org/.
topdownr
Load the package.
library("topdownr")
Some example files are provided in the topdownrdata
package. For a full
analysis you need a .fasta
file with the protein sequence, the
.experiments.csv
files containing the method information, the .txt
files
containing the scan header information and the .mzML
files with the
deconvoluted spectra.
## list.files(topdownrdata::topDownDataPath("myoglobin"))
$csv
[1] ".../20170629_myo/experiments/myo_1211_ETDReagentTarget_1e6_1.experiments.csv.gz"
[2] ".../20170629_myo/experiments/myo_1211_ETDReagentTarget_1e6_2.experiments.csv.gz"
[3] "..."
$fasta
[1] ".../20170629_myo/fasta/myoglobin.fasta.gz"
[2] "..."
$mzML
[1] ".../20170629_myo/mzml/myo_1211_ETDReagentTarget_1e6_1.mzML.gz"
[2] ".../20170629_myo/mzml/myo_1211_ETDReagentTarget_1e6_2.mzML.gz"
[3] "..."
$txt
[1] ".../20170629_myo/header/myo_1211_ETDReagentTarget_1e6_1.txt.gz"
[2] ".../20170629_myo/header/myo_1211_ETDReagentTarget_1e6_2.txt.gz"
[3] "..."
All these files have to be in a directory. You could import them via
readTopDownFiles
. This function has some arguments. The most important ones
are the path
of the directory containing the files,
the protein modification
(e.g. initiator methionine removal,
"Met-loss"
), and adducts (e.g. proton transfer often occurs
from c to z-fragment after ETD reaction).
## the mass adduct for a proton
H <- 1.0078250321
myoglobin <- readTopDownFiles(
## directory path
path = topdownrdata::topDownDataPath("myoglobin"),
## fragmentation types
type = c("a", "b", "c", "x", "y", "z"),
## adducts (add -H/H to c/z and name
## them cmH/zpH (c minus H, z plus H)
adducts = data.frame(
mass=c(-H, H),
to=c("c", "z"),
name=c("cmH", "zpH")),
## initiator methionine removal
modifications = "Met-loss",
## don't use neutral loss
neutralLoss = NULL,
## tolerance for fragment matching
tolerance = 5e-6,
## topdownrdata was generate with an older version of topdownr,
## the method files were generated with FilterString identification,
## use `conditions = "ScanDescription"` (default) for recent data.
conditions = "FilterString"
)
## Warning in FUN(X[[i]], ...): 61 FilterString entries modified because of
## duplicated ID for different conditions.
## Warning in FUN(X[[i]], ...): 63 FilterString entries modified because of
## duplicated ID for different conditions.
## Warning in FUN(X[[i]], ...): 53 FilterString entries modified because of
## duplicated ID for different conditions.
## Warning in FUN(X[[i]], ...): 55 FilterString entries modified because of
## duplicated ID for different conditions.
## Warning in FUN(X[[i]], ...): 50 FilterString entries modified because of
## duplicated ID for different conditions.
## Warning in FUN(X[[i]], ...): 50 FilterString entries modified because of
## duplicated ID for different conditions.
## Warning in FUN(X[[i]], ...): ID in FilterString are not sorted in ascending
## order. Introduce own condition ID via 'cumsum'.
## Warning in FUN(X[[i]], ...): ID in FilterString are not sorted in ascending
## order. Introduce own condition ID via 'cumsum'.
myoglobin
## TopDownSet object (7.28 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGHGQ...GAMTKALELFRNDIAAKYKELGFQG
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1216
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;16910.93]
## - - - Condition data - - -
## Number of conditions: 1852
## Number of scans: 5882
## Condition variables (66): File, Scan, ..., Sample, MedianIonInjectionTimeMs
## - - - Intensity data - - -
## Size of array: 1216x5882 (5.15% != 0)
## Number of matched fragments: 368296
## Intensity range: [87.61;10704001.00]
## - - - Processing information - - -
## [2024-10-29 19:56:15] 368296 fragments [1216;5882] matched (tolerance: 5 ppm, strategies ion/fragment: remove/remove).
## [2024-10-29 19:56:15] Condition names updated based on: Mz, AgcTarget, EtdReagentTarget, EtdActivation, CidActivation, HcdActivation. Order of conditions changed. 1852 conditions.
## [2024-10-29 19:56:15] Recalculate median injection time based on: Mz, AgcTarget.
TopDownSet
AnatomyThe assembled object is an TopDownSet
object.
Briefly it is composed of three interconnected tables:
rowViews
/fragment data: holds the information on the type of fragments,
their modifications and adducts.colData
/condition data: contains the corresponding fragmentation
condition for every spectrum.assayData
: contains the intensity of assigned fragments.This section explains the implementation details of the TopDownSet
class. It
is not necessary to understand everything written here to use topdownr
for the
analysis of fragmentation data.
The TopDownSet
contains the following components: Fragment data, Condition
data, Assay data.
rowViews(myoglobin)
## FragmentViews on a 153-letter sequence:
## GLSDGEWQQVLNVWGKVEADIAGHGQEVLIRLFTGHPE...SKHPGDFGADAQGAMTKALELFRNDIAAKYKELGFQG
## Mass:
## 16922.95406
## Modifications:
## Met-loss
## Views:
## start end width mass name type z
## [1] 1 1 1 30.03 a1 a 1 [G]
## [2] 1 1 1 58.03 b1 b 1 [G]
## [3] 1 1 1 59.01 z1 z 1 [G]
## [4] 1 1 1 60.02 zpH1 z 1 [G]
## [5] 1 1 1 74.05 cmH1 c 1 [G]
## ... ... ... ... ... ... ... ... ...
## [1212] 2 153 152 16868.93 zpH152 z 1 [LSDGEWQQVLNVWG...DIAAKYKELGFQG]
## [1213] 1 152 152 16882.96 cmH152 c 1 [GLSDGEWQQVLNVW...NDIAAKYKELGFQ]
## [1214] 1 152 152 16883.97 c152 c 1 [GLSDGEWQQVLNVW...NDIAAKYKELGFQ]
## [1215] 2 153 152 16884.95 y152 y 1 [LSDGEWQQVLNVWG...DIAAKYKELGFQG]
## [1216] 2 153 152 16910.93 x152 x 1 [LSDGEWQQVLNVWG...DIAAKYKELGFQG]
The fragmentation data are represented by an FragmentViews
object that is an
overloaded XStringViews
object. It contains one AAString
(the protein sequence) and an IRanges
object that stores the
start
, end
(and width
) values of the fragments.
Additionally it has a DataFrame
for the mass
, type
and z
information
of each fragment.
conditionData(myoglobin)[, 1:5]
## DataFrame with 5882 rows and 5 columns
## File Scan SpectrumIndex
## <Rle> <numeric> <integer>
## C0707.30_1.0e+05_1.0e+06_02.50_07_00_1 myo_707_ET... 33 22
## C0707.30_1.0e+05_1.0e+06_02.50_07_00_2 myo_707_ET... 34 23
## C0707.30_1.0e+05_1.0e+06_02.50_07_00_3 myo_707_ET... 33 23
## C0707.30_1.0e+05_1.0e+06_02.50_07_00_4 myo_707_ET... 34 24
## C0707.30_1.0e+05_1.0e+06_02.50_14_00_1 myo_707_ET... 36 25
## ... ... ... ...
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_07 myo_1211_E... 223 203
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_08 myo_1211_E... 221 202
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_09 myo_1211_E... 222 203
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_10 myo_1211_E... 223 203
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_11 myo_1211_E... 224 204
## PeaksCount TotIonCurrent
## <integer> <numeric>
## C0707.30_1.0e+05_1.0e+06_02.50_07_00_1 161 27224937
## C0707.30_1.0e+05_1.0e+06_02.50_07_00_2 175 29167765
## C0707.30_1.0e+05_1.0e+06_02.50_07_00_3 180 26132872
## C0707.30_1.0e+05_1.0e+06_02.50_07_00_4 171 25475501
## C0707.30_1.0e+05_1.0e+06_02.50_14_00_1 172 27347105
## ... ... ...
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_07 213 2566120
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_08 250 2348707
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_09 145 2305900
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_10 207 2262800
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_11 158 2212189
Condition data is a DataFrame
that contains the combined header information
for each MS run (combined from method (.experiments.csv
files)/scan header
(.txt
files) table and metadata from the .mzML
files).
assayData(myoglobin)[206:215, 1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
## [[ suppressing 10 column names 'C0707.30_1.0e+05_1.0e+06_02.50_07_00_1', 'C0707.30_1.0e+05_1.0e+06_02.50_07_00_2', 'C0707.30_1.0e+05_1.0e+06_02.50_07_00_3' ... ]]
##
## z26 . . . . . . . .
## zpH26 491328.4 446301.1 407389.1 473200.9 470679.3 493244.8 390025.8 389430.25
## y26 . . . . . . . 23648.63
## b27 . . . . . . . .
## cmH27 . . . . . . . .
## c27 . . . . . . . .
## x26 . . . . . . . .
## z27 . . . . . . . .
## zpH27 534307.6 534135.1 434296.8 436866.2 550887.3 513038.8 460476.4 456524.97
## y27 . . . . . . . .
##
## z26 . .
## zpH26 496551.3 554295.7
## y26 . .
## b27 . .
## cmH27 . .
## c27 . .
## x26 . .
## z27 . .
## zpH27 602207.0 579989.8
## y27 . .
Assay data is a sparseMatrix
from the Matrix
package
(in detail a dgCMatrix
) where the rows correspond to the fragments,
the columns to the runs/conditions and the entries to the intensity values.
A sparseMatrix
is similar to the classic matrix
in R but stores just
the values that are different from zero.
TopDownSet
A TopDownSet
could be subsetted by the fragment and the condition data.
# select the first 100 fragments
myoglobin[1:100]
## TopDownSet object (3.56 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGHGQ...GAMTKALELFRNDIAAKYKELGFQG
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 100
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;1426.70]
## - - - Condition data - - -
## Number of conditions: 1852
## Number of scans: 5882
## Condition variables (66): File, Scan, ..., Sample, MedianIonInjectionTimeMs
## - - - Intensity data - - -
## Size of array: 100x5882 (9.68% != 0)
## Number of matched fragments: 56955
## Intensity range: [105.70;1076768.00]
## - - - Processing information - - -
## [2024-10-29 19:56:15] 368296 fragments [1216;5882] matched (tolerance: 5 ppm, strategies ion/fragment: remove/remove).
## [2024-10-29 19:56:15] Condition names updated based on: Mz, AgcTarget, EtdReagentTarget, EtdActivation, CidActivation, HcdActivation. Order of conditions changed. 1852 conditions.
## [2024-10-29 19:56:15] Recalculate median injection time based on: Mz, AgcTarget.
## [2024-10-29 19:56:16] Subsetted 368296 fragments [1216;5882] to 56955 fragments [100;5882].
# select all "c" fragments
myoglobin["c"]
## TopDownSet object (4.51 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGHGQ...GAMTKALELFRNDIAAKYKELGFQG
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 304
## Theoretical fragment types (1): c
## Theoretical mass range: [74.05;16883.97]
## - - - Condition data - - -
## Number of conditions: 1852
## Number of scans: 5882
## Condition variables (66): File, Scan, ..., Sample, MedianIonInjectionTimeMs
## - - - Intensity data - - -
## Size of array: 304x5882 (7.69% != 0)
## Number of matched fragments: 137461
## Intensity range: [87.61;1203763.75]
## - - - Processing information - - -
## [2024-10-29 19:56:15] 368296 fragments [1216;5882] matched (tolerance: 5 ppm, strategies ion/fragment: remove/remove).
## [2024-10-29 19:56:15] Condition names updated based on: Mz, AgcTarget, EtdReagentTarget, EtdActivation, CidActivation, HcdActivation. Order of conditions changed. 1852 conditions.
## [2024-10-29 19:56:15] Recalculate median injection time based on: Mz, AgcTarget.
## [2024-10-29 19:56:16] Subsetted 368296 fragments [1216;5882] to 137461 fragments [304;5882].
# select just the 100. "c" fragment
myoglobin["c100"]
## TopDownSet object (2.89 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGHGQ...GAMTKALELFRNDIAAKYKELGFQG
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1
## Theoretical fragment types (1): c
## Theoretical mass range: [11085.96;11085.96]
## - - - Condition data - - -
## Number of conditions: 1852
## Number of scans: 5882
## Condition variables (66): File, Scan, ..., Sample, MedianIonInjectionTimeMs
## - - - Intensity data - - -
## Size of array: 1x5882 (0.09% != 0)
## Number of matched fragments: 5
## Intensity range: [1276.91;17056.12]
## - - - Processing information - - -
## [2024-10-29 19:56:15] 368296 fragments [1216;5882] matched (tolerance: 5 ppm, strategies ion/fragment: remove/remove).
## [2024-10-29 19:56:15] Condition names updated based on: Mz, AgcTarget, EtdReagentTarget, EtdActivation, CidActivation, HcdActivation. Order of conditions changed. 1852 conditions.
## [2024-10-29 19:56:15] Recalculate median injection time based on: Mz, AgcTarget.
## [2024-10-29 19:56:16] Subsetted 368296 fragments [1216;5882] to 5 fragments [1;5882].
# select all "a" and "b" fragments but just the first 100 "c"
myoglobin[c("a", "b", paste0("c", 1:100))]
## TopDownSet object (4.59 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGHGQ...GAMTKALELFRNDIAAKYKELGFQG
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 404
## Theoretical fragment types (3): a, b, c
## Theoretical mass range: [30.03;16866.94]
## - - - Condition data - - -
## Number of conditions: 1852
## Number of scans: 5882
## Condition variables (66): File, Scan, ..., Sample, MedianIonInjectionTimeMs
## - - - Intensity data - - -
## Size of array: 404x5882 (6.04% != 0)
## Number of matched fragments: 143582
## Intensity range: [87.61;1630533.12]
## - - - Processing information - - -
## [2024-10-29 19:56:15] 368296 fragments [1216;5882] matched (tolerance: 5 ppm, strategies ion/fragment: remove/remove).
## [2024-10-29 19:56:15] Condition names updated based on: Mz, AgcTarget, EtdReagentTarget, EtdActivation, CidActivation, HcdActivation. Order of conditions changed. 1852 conditions.
## [2024-10-29 19:56:15] Recalculate median injection time based on: Mz, AgcTarget.
## [2024-10-29 19:56:16] Subsetted 368296 fragments [1216;5882] to 143582 fragments [404;5882].
# select condition/run 1 to 10
myoglobin[, 1:10]
## TopDownSet object (0.26 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGHGQ...GAMTKALELFRNDIAAKYKELGFQG
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1216
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;16910.93]
## - - - Condition data - - -
## Number of conditions: 3
## Number of scans: 10
## Condition variables (66): File, Scan, ..., Sample, MedianIonInjectionTimeMs
## - - - Intensity data - - -
## Size of array: 1216x10 (8.38% != 0)
## Number of matched fragments: 1019
## Intensity range: [7872.05;1036892.19]
## - - - Processing information - - -
## [2024-10-29 19:56:15] 368296 fragments [1216;5882] matched (tolerance: 5 ppm, strategies ion/fragment: remove/remove).
## [2024-10-29 19:56:15] Condition names updated based on: Mz, AgcTarget, EtdReagentTarget, EtdActivation, CidActivation, HcdActivation. Order of conditions changed. 1852 conditions.
## [2024-10-29 19:56:15] Recalculate median injection time based on: Mz, AgcTarget.
## [2024-10-29 19:56:16] Subsetted 368296 fragments [1216;5882] to 1019 fragments [1216;10].
# select all conditions from one file
myoglobin[, myoglobin$File == "myo_1211_ETDReagentTarget_1e+06_1"]
## TopDownSet object (0.24 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGHGQ...GAMTKALELFRNDIAAKYKELGFQG
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1216
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;16910.93]
## - - - Processing information - - -
## [2024-10-29 19:56:15] 368296 fragments [1216;5882] matched (tolerance: 5 ppm, strategies ion/fragment: remove/remove).
## [2024-10-29 19:56:15] Condition names updated based on: Mz, AgcTarget, EtdReagentTarget, EtdActivation, CidActivation, HcdActivation. Order of conditions changed. 1852 conditions.
## [2024-10-29 19:56:15] Recalculate median injection time based on: Mz, AgcTarget.
## [2024-10-29 19:56:16] Subsetted 368296 fragments [1216;5882] to 0 fragments [1216;0].
# select all "c" fragments from a single file
myoglobin["c", myoglobin$File == "myo_1211_ETDReagentTarget_1e+06_1"]
## TopDownSet object (0.11 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGHGQ...GAMTKALELFRNDIAAKYKELGFQG
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 304
## Theoretical fragment types (1): c
## Theoretical mass range: [74.05;16883.97]
## - - - Processing information - - -
## [2024-10-29 19:56:15] 368296 fragments [1216;5882] matched (tolerance: 5 ppm, strategies ion/fragment: remove/remove).
## [2024-10-29 19:56:15] Condition names updated based on: Mz, AgcTarget, EtdReagentTarget, EtdActivation, CidActivation, HcdActivation. Order of conditions changed. 1852 conditions.
## [2024-10-29 19:56:15] Recalculate median injection time based on: Mz, AgcTarget.
## [2024-10-29 19:56:16] Subsetted 368296 fragments [1216;5882] to 0 fragments [304;0].
TopDownSet
Each condition represents one spectrum. We could plot a single condition
interactively or all spectra into a pdf
file
(or any other R device that supports multiple pages/plots).
# plot a single condition
plot(myoglobin[, "C0707.30_1.0e+05_1.0e+06_10.00_00_28_3"])
## [[1]]
## Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
## ggplot2 3.3.4.
## ℹ Please use "none" instead.
## ℹ The deprecated feature was likely used in the topdownr package.
## Please report the issue at <https://github.com/sgibb/topdownr/issues/>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # example to plot the first ten conditions into a pdf
## # (not evaluated in the vignette)
## pdf("topdown-conditions.pdf", paper="a4r", width=12)
## plot(myoglobin[, 1:10])
## dev.off()
plot
returns a list
(an item per condition) of ggplot
objects which could
further modified or investigated interactively by calling plotly::ggplotly()
.
We follow the following workflow: