Package: Spectra
Authors: RforMassSpectrometry Package Maintainer [cre],
Laurent Gatto [aut] (https://orcid.org/0000-0002-1520-2268),
Johannes Rainer [aut] (https://orcid.org/0000-0002-6977-7147),
Sebastian Gibb [aut] (https://orcid.org/0000-0001-7406-4443),
Philippine Louail [aut] (https://orcid.org/0009-0007-5429-6846),
Jan Stanstrup [ctb] (https://orcid.org/0000-0003-0541-7369),
Nir Shahaf [ctb],
Mar Garcia-Aloy [ctb] (https://orcid.org/0000-0002-1330-6610)
Last modified: 2024-10-29 15:53:06.073685
Compiled: Tue Oct 29 23:00:15 2024
The Spectra package provides a scalable and flexible
infrastructure to represent, retrieve and handle mass spectrometry (MS)
data. The Spectra
object provides the user with a single standardized
interface to access and manipulate MS data while supporting, through the concept
of exchangeable backends, a large variety of different ways to store and
retrieve mass spectrometry data. Such backends range from mzML/mzXML/CDF files,
simple flat files, or database systems.
This vignette provides general examples and descriptions for the Spectra package. Additional information and tutorials are available, such as SpectraTutorials, MetaboAnnotationTutorials, or also in (Rainer et al. 2022). For information on how to handle and (parallel) process large-scale data sets see the Large-scale data handling and processing with Spectra vignette.
The package can be installed with the BiocManager package. To
install BiocManager use install.packages("BiocManager")
and, after that,
BiocManager::install("Spectra")
to install Spectra.
Mass spectrometry data in Spectra
objects can be thought of as a list of
individual spectra, with each spectrum having a set of variables associated with
it. Besides core spectra variables (such as MS level or retention time)
an arbitrary number of optional variables can be assigned to a spectrum. The
core spectra variables all have their own accessor method and it is
guaranteed that a value is returned by it (or NA
if the information is not
available). The core variables and their data type are (alphabetically
ordered):
integer(1)
: the index of acquisition of a spectrum during a
MS run.logical(1)
: whether the spectrum is in profile or centroid
mode.numeric(1)
: collision energy used to create an MSn
spectrum.character(1)
: the origin of the spectrum’s data, e.g. the
mzML file from which it was read.character(1)
: the (current) storage location of the spectrum
data. This value depends on the backend used to handle and provide the
data. For an in-memory backend like the MsBackendDataFrame
this will be
"<memory>"
, for an on-disk backend such as the MsBackendHdf5Peaks
it will
be the name of the HDF5 file where the spectrum’s peak data is stored.numeric
: intensity values for the spectrum’s peaks.numeric(1)
: lower m/z for the isolation window in
which the (MSn) spectrum was measured.numeric(1)
: the target m/z for the isolation
window in which the (MSn) spectrum was measured.numeric(1)
: upper m/z for the isolation window in
which the (MSn) spectrum was measured.integer(1)
: the MS level of the spectrum.numeric
: the m/z values for the spectrum’s peaks.integer(1)
: the polarity of the spectrum (0
and 1
representing negative and positive polarity, respectively).integer(1)
: the scan (acquisition) number of the precursor for
an MSn spectrum.integer(1)
: the charge of the precursor of an MSn
spectrum.numeric(1)
: the intensity of the precursor of an MSn
spectrum.numeric(1)
: the m/z of the precursor of an MSn spectrum.numeric(1)
: the retention time of a spectrum.integer(1)
: the index of a spectrum within a (raw) file.logical(1)
: whether the spectrum was smoothed.For details on the individual variables and their getter/setter function see the
help for Spectra
(?Spectra
). Also note that these variables are suggested,
but not required to characterize a spectrum. Also, some only make sense for MSn,
but not for MS1 spectra.
Spectra
objectsThe simplest way to create a Spectra
object is by defining a DataFrame
with
the corresponding spectra data (using the corresponding spectra variable names
as column names) and passing that to the Spectra
constructor function. Below
we create such an object for a set of 3 spectra providing their MS level,
olarity but also additional annotations such as their ID in
HMDB (human metabolome database) and their name. The m/z and
intensity values for each spectrum have to be provided as a list
of numeric
values.
library(Spectra)
spd <- DataFrame(
msLevel = c(2L, 2L, 2L),
polarity = c(1L, 1L, 1L),
id = c("HMDB0000001", "HMDB0000001", "HMDB0001847"),
name = c("1-Methylhistidine", "1-Methylhistidine", "Caffeine"))
## Assign m/z and intensity values.
spd$mz <- list(
c(109.2, 124.2, 124.5, 170.16, 170.52),
c(83.1, 96.12, 97.14, 109.14, 124.08, 125.1, 170.16),
c(56.0494, 69.0447, 83.0603, 109.0395, 110.0712,
111.0551, 123.0429, 138.0662, 195.0876))
spd$intensity <- list(
c(3.407, 47.494, 3.094, 100.0, 13.240),
c(6.685, 4.381, 3.022, 16.708, 100.0, 4.565, 40.643),
c(0.459, 2.585, 2.446, 0.508, 8.968, 0.524, 0.974, 100.0, 40.994))
sps <- Spectra(spd)
sps
## MSn data (Spectra) with 3 spectra in a MsBackendMemory backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 2 NA NA
## 2 2 NA NA
## 3 2 NA NA
## ... 18 more variables/columns.
Alternatively, it is possible to import spectra data from mass spectrometry raw
files in mzML/mzXML or CDF format. Below we create a Spectra
object from two
mzML files and define to use a MsBackendMzR
backend to store the data (note
that this requires the mzR package to be installed). This
backend, specifically designed for raw MS data, keeps only a subset of spectra
variables in memory while reading the m/z and intensity values from the original
data files only on demand. See section Backends for more details
on backends and their properties.
fls <- dir(system.file("sciex", package = "msdata"), full.names = TRUE)
sps_sciex <- Spectra(fls, source = MsBackendMzR())
sps_sciex
## MSn data (Spectra) with 1862 spectra in a MsBackendMzR backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 0.280 1
## 2 1 0.559 2
## 3 1 0.838 3
## 4 1 1.117 4
## 5 1 1.396 5
## ... ... ... ...
## 1858 1 258.636 927
## 1859 1 258.915 928
## 1860 1 259.194 929
## 1861 1 259.473 930
## 1862 1 259.752 931
## ... 33 more variables/columns.
##
## file(s):
## 20171016_POOL_POS_1_105-134.mzML
## 20171016_POOL_POS_3_105-134.mzML
The Spectra
object sps_sciex
allows now to access spectra data from 1862 MS1
spectra and uses MsBackendMzR
as backend (the Spectra
object sps
created
in the previous code block uses the default MsBackendMemory
).
As detailed above Spectra
objects can contain an arbitrary number of
properties of a spectrum (so called spectra variables). The available
variables can be listed with the spectraVariables()
method:
spectraVariables(sps)
## [1] "msLevel" "rtime"
## [3] "acquisitionNum" "scanIndex"
## [5] "dataStorage" "dataOrigin"
## [7] "centroided" "smoothed"
## [9] "polarity" "precScanNum"
## [11] "precursorMz" "precursorIntensity"
## [13] "precursorCharge" "collisionEnergy"
## [15] "isolationWindowLowerMz" "isolationWindowTargetMz"
## [17] "isolationWindowUpperMz" "id"
## [19] "name"
spectraVariables(sps_sciex)
## [1] "msLevel" "rtime"
## [3] "acquisitionNum" "scanIndex"
## [5] "dataStorage" "dataOrigin"
## [7] "centroided" "smoothed"
## [9] "polarity" "precScanNum"
## [11] "precursorMz" "precursorIntensity"
## [13] "precursorCharge" "collisionEnergy"
## [15] "isolationWindowLowerMz" "isolationWindowTargetMz"
## [17] "isolationWindowUpperMz" "peaksCount"
## [19] "totIonCurrent" "basePeakMZ"
## [21] "basePeakIntensity" "ionisationEnergy"
## [23] "lowMZ" "highMZ"
## [25] "mergedScan" "mergedResultScanNum"
## [27] "mergedResultStartScanNum" "mergedResultEndScanNum"
## [29] "injectionTime" "filterString"
## [31] "spectrumId" "ionMobilityDriftTime"
## [33] "scanWindowLowerLimit" "scanWindowUpperLimit"
The two Spectra
contain a different set of variables: besides "msLevel"
,
"polarity"
, "id"
and "name"
, that were specified for the Spectra
object
sps
, it contains more variables such as "rtime"
, "acquisitionNum"
and
"scanIndex"
. These are part of the core variables defining a spectrum and
for all of these accessor methods exist. Below we use msLevel()
and rtime()
to access the MS levels and retention times for the spectra in sps
.
msLevel(sps)
## [1] 2 2 2
rtime(sps)
## [1] NA NA NA
We did not specify retention times for the spectra in sps
thus NA
is
returned for them. The Spectra
object sps_sciex
contains many more
variables, all of which were extracted from the mzML files. Below we extract the
retention times for the first spectra in the object.
head(rtime(sps_sciex))
## [1] 0.280 0.559 0.838 1.117 1.396 1.675
Note that in addition to the accessor functions it is also possible to use $
to extract a specific spectra variable. To extract the name of the compounds in
sps
we can use sps$name
, or, to extract the MS levels sps$msLevel
.
sps$name
## [1] "1-Methylhistidine" "1-Methylhistidine" "Caffeine"
sps$msLevel
## [1] 2 2 2
We could also replace specific spectra variables using either the dedicated
method or $
. Below we specify that all spectra in sps
represent centroided
data.
sps$centroided <- TRUE
centroided(sps)
## [1] TRUE TRUE TRUE
The $
operator can also be used to add arbitrary new spectra variables to a
Spectra
object. Below we add the SPLASH key to each of the spectra.
sps$splash <- c(
"splash10-00di-0900000000-037d24a7d65676b7e356",
"splash10-00di-0900000000-03e99316bd6c098f5d11",
"splash10-000i-0900000000-9af60e39c843cb715435")
This new spectra variable will now be listed as an additional variable in the
result of the spectraVariables()
function and we can directly access its
content with sps$splash
.
Each spectrum can have a different number of mass peaks, each consisting of a
mass-to-charge (m/z) and associated intensity value. These can be extracted with
the mz()
or intensity()
functions, each of which return a list
of
numeric
values.
mz(sps)
## NumericList of length 3
## [[1]] 109.2 124.2 124.5 170.16 170.52
## [[2]] 83.1 96.12 97.14 109.14 124.08 125.1 170.16
## [[3]] 56.0494 69.0447 83.0603 109.0395 110.0712 111.0551 123.0429 138.0662 195.0876
intensity(sps)
## NumericList of length 3
## [[1]] 3.407 47.494 3.094 100 13.24
## [[2]] 6.685 4.381 3.022 16.708 100 4.565 40.643
## [[3]] 0.459 2.585 2.446 0.508 8.968 0.524 0.974 100 40.994
Peak data can also be extracted with the peaksData()
function that returns a
list of numerical matrices with peak variables such as m/z and intensity
values. Which peak variables are available in a Spectra
object can be
determined with the peaksVariables()
function.
peaksVariables(sps)
## [1] "mz" "intensity"
These can be passed to the peaksData()
function with parameter columns
to
extract the peak variables of interest. By default peaksData()
extracts m/z
and intensity values.
pks <- peaksData(sps)
pks[[1]]
## mz intensity
## [1,] 109.20 3.407
## [2,] 124.20 47.494
## [3,] 124.50 3.094
## [4,] 170.16 100.000
## [5,] 170.52 13.240
Note that we would get the same result by using the as()
method to coerce a
Spectra
object to a list
or SimpleList
:
as(sps, "SimpleList")
## List of length 3
The spectraData()
function returns a DataFrame
with the full data for each
spectrum (except m/z and intensity values), or with selected spectra variables
(which can be specified with the columns
parameter). Below we extract the
spectra data for variables "msLevel"
, "id"
and "name"
.
spectraData(sps, columns = c("msLevel", "id", "name"))
## DataFrame with 3 rows and 3 columns
## msLevel id name
## <integer> <character> <character>
## 1 2 HMDB0000001 1-Methylhistidine
## 2 2 HMDB0000001 1-Methylhistidine
## 3 2 HMDB0001847 Caffeine
Spectra
are one-dimensional objects storing spectra, even from different files
or samples, in a single list. Specific variables have thus to be used to define
the originating file from which they were extracted or the sample in which they
were measured. The data origin of each spectrum can be extracted with the
dataOrigin()
function. For sps
, the Spectra
created from a DataFrame
,
this will be NA
because we did not specify the data origin:
dataOrigin(sps)
## [1] NA NA NA
dataOrigin
for sps_sciex
, the Spectra
which was initialized with data
from mzML files, in contrast, returns the originating file names:
head(basename(dataOrigin(sps_sciex)))
## [1] "20171016_POOL_POS_1_105-134.mzML" "20171016_POOL_POS_1_105-134.mzML"
## [3] "20171016_POOL_POS_1_105-134.mzML" "20171016_POOL_POS_1_105-134.mzML"
## [5] "20171016_POOL_POS_1_105-134.mzML" "20171016_POOL_POS_1_105-134.mzML"
The current data storage location of a spectrum can be retrieved with the
dataStorage
variable, which will return an arbitrary string for Spectra
that
use an in-memory backend or the file where the data is stored for on-disk
backends:
dataStorage(sps)
## [1] "<memory>" "<memory>" "<memory>"
head(basename(dataStorage(sps_sciex)))
## [1] "20171016_POOL_POS_1_105-134.mzML" "20171016_POOL_POS_1_105-134.mzML"
## [3] "20171016_POOL_POS_1_105-134.mzML" "20171016_POOL_POS_1_105-134.mzML"
## [5] "20171016_POOL_POS_1_105-134.mzML" "20171016_POOL_POS_1_105-134.mzML"
Certain backends (such as the MsBackendMemory
and MsBackendDataFrame
)
support also additional peaks variables. At present, these must already be
present when the backend gets initialized. In future a dedicated function
allowing to add peaks variables will be available. Below we thus first extract
the full data (including peaks variables) from the sps
spectra object and add
a column "peak_anno"
with peak annotations for each individual
peak. Importantly, for peak variables, a value needs to be assigned to each
individual peak, even it it is NA
(the lengths()
of the new peak variable
must match lengths()
of mz
or intensity
, i.e. the number of peaks per
spectrum).
## Extract the full data from a spectrum
spd <- spectraData(sps, columns = union(spectraVariables(sps),
peaksVariables(sps)))
## Add a new column with a *annotation* for each peak
spd$peak_anno <- list(c("a", NA_character_, "b", "c", "d"),
c("a", "b", "c", "d", "e", "f", "g"),
c("a", "b", "c", "d", "e", "f", "g", "h", "i"))
## lengths have to match:
lengths(spd$peak_anno)
## [1] 5 7 9
lengths(spd$mz)
## [1] 5 7 9
The parameter peaksVariables()
(currently only available for the
backendInitialize()
method of MsBackendMemory
and MsBackendDataFrame
)
allows to define which of the columns from an input data contain peaks variables
(in our case "mz"
, "intensity"
and the additional "peak_anno"
column).
sps2 <- Spectra(spd, backend = MsBackendMemory(),
peaksVariables = c("mz", "intensity", "peak_anno"))
peaksVariables(sps2)
## [1] "mz" "intensity" "peak_anno"
Full peak data can be extracted with the peaksData()
function that has a
second parameter columns
allowing to define which peak variables to
return. Below we extract the peak data for the second spectrum.
peaksData(sps2, columns = peaksVariables(sps2))[[2L]]
## mz intensity peak_anno
## 1 83.10 6.685 a
## 2 96.12 4.381 b
## 3 97.14 3.022 c
## 4 109.14 16.708 d
## 5 124.08 100.000 e
## 6 125.10 4.565 f
## 7 170.16 40.643 g
We can also use the peaksData()
function to extract the values for individual
peak variables.
## Peak annotations for the first spectrum
peaksData(sps2, "peak_anno")[[1L]]
## peak_anno
## 1 a
## 2 <NA>
## 3 b
## 4 c
## 5 d
## Peak annotations for the second spectrum
peaksData(sps2, "peak_anno")[[2L]]
## peak_anno
## 1 a
## 2 b
## 3 c
## 4 d
## 5 e
## 6 f
## 7 g
Peak variables can also be extracted using the $
method:
sps2$peak_anno
## [[1]]
## [1] "a" NA "b" "c" "d"
##
## [[2]]
## [1] "a" "b" "c" "d" "e" "f" "g"
##
## [[3]]
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i"
Similar to spectra variables it is also possible to replace values for
existing peaks variables using the $<-
function.
Various functions are available to filter, subset and merge Spectra
objects. These can be generally subdivided into functions that subset or filter
spectra data and operations that filter mass peak data. A third category of
function allows to aggregate data within a Spectra
or to merge and combine
multiple Spectra
objects into one. Functions of the various categories are
described in the following subsections. Please refer to the function’s
documentation for more details and information.
These functions comprise subset operations that reduce the total number of
spectra in a Spectra
object as well as filter functions that reduce the
content of the Spectra
’s spectra data (i.e. the content of its
spectraVariables()
). These functions thus don’t change or affect the mass
peaks data of the Spectra
’s individual spectra.
[
: operation to reduce a Spectra
object to selected elements.dropNaSpectraVariables()
: drops spectraVariables()
that contain only
missing values. The function returns a Spectra
object with the same number
of elements, but with eventually fewer spectra variables.filterAcquisitionNum()
: retains spectra with certain acquisition numbers.filterDataOrigin()
: subsets to spectra from specific origins.filterDataStorage()
: subsets to spectra from certain data storage files.filterEmptySpectra()
: removes spectra without mass peaks.filterIsolationWindow()
: keeps spectra with the provided mz
in their
isolation window (m/z range).filterMsLevel()
: filters by MS level.filterPolarity()
: filters by polarity.filterPrecursorCharge()
: retains (MSn) spectra with specified
precursor charge(s).filterPrecursorIsotopes()
: identifies precursor ions (from fragment spectra)
that could represent isotopes of the same molecule. For each of these spectra
groups only the spectrum of the monoisotopic precursor ion is returned. MS1
spectra are returned without filtering.filterPrecursorMaxIntensity()
: filters spectra keeping, for groups of
spectra with similar precursor m/z, the one spectrum with the highest
precursor intensity. All MS1 spectra are returned without filtering.filterPrecursorMzRange()
: retains (MSn) spectra with a precursor m/z within
the provided m/z range.filterPrecursorMzValues(()
: retains (MSn) spectra with precursor m/z value
matching the provided value(s) considering also a tolerance
and ppm
.filterPrecursorScan()
: retains (parent and children) scans of an acquisition
number.filterRanges()
: filters a Spectra
object based on (multiple) user
defined numeric ranges for one or more available (numeric) spectra
variables.filterRt()
: filters based on retention time range.filterValues()
: filters a Spectra
object based on similarities of
numeric values of one or more available spectra variables.selectSpectraVariables()
: reduces the (spectra) data within the object to
the selected spectra variables.These function filter or aggregate the mass peak data (peaksData()
) of each
spectrum in a Spectra
without changing the total number of spectra.
combinePeaks()
: groups peaks within each spectrum based on similarity of
their m/z values and combines these into a single peak per peak group.deisotopeSpectra()
: deisotopes each individual spectrum keeping only the
monoisotopic peak for peaks groups of potential isotopologues.filterFourierTransformArtefacts()
: removes (Orbitrap) fast fourier transform
artifact peaks from spectra.filterIntensity()
: filter each spectrum keeping only peaks with intensities
meeting certain criteria.filterMzRange()
: filters mass peaks keeping (or removing) those with an
m/z within the provided m/z range.filterMzValues()
: filters mass peaks within each spectrum keeping (or
removing) those with an m/z matching the provided value(s).filterPeaksRanges()
: filters mass peaks using any set of range-based filters
on numeric spectra or peaks variables.filterPrecursorPeaks()
: removes peaks with either an m/z value matching the
precursor m/z of the respective spectrum (with parameter mz = "=="
) or peaks
with an m/z value larger or equal to the precursor m/z (with parameter
mz = ">="
).reduceSpectra()
: filters individual spectra keeping only the largest peak
for groups of peaks with similar m/z values.c()
: combine several Spectra
into a single Spectra
object.combineSpectra()
: allows to combine the MS data from sets of spectra into a
single spectrum per set. Thus, instead of filtering the data, this function
aggregates it.joinSpectraData()
: merge a DataFrame
to the existing spectra data.split()
: splits the Spectra
object based on a provided grouping factor.In this example, we use the filterValues()
function to retain spectra with a
base peak m/z close to 100 (+/- 30 ppm) and a retention time around 230 (+/- 5
s).
sps_sub <- filterValues(sps_sciex, spectraVariables = c("basePeakMZ", "rtime"),
values = c(123.089, 230), tolerance = c(0,5),
ppm = c(30, 0), match = "all")
length(sps_sub)
## [1] 72
Then, we demonstrate the usage of the filterRanges()
function to filter
spectra based on ranges of values for variables such as base peak m/z, peak
count, and retention time.
sps_ranges <- filterRanges(sps_sciex,
spectraVariables = c("basePeakMZ","peaksCount",
"rtime"),
ranges = c(123.09,124, 3500, 3520, 259, 260),
match = "all")
length(sps_ranges)
## [1] 1
Only one spectrum matches all the ranges. Another option for filterValues()
and filterRanges()
is to use the parameter match = "any"
, which retains
spectra that match any one of the conditions instead of having to match all of
them. Let’s run the code once again but change the match parameter this time:
sps_ranges <- filterRanges(sps_sciex,
spectraVariables = c("basePeakMZ",
"peaksCount", "rtime"),
ranges = c(123.09, 124, 3500, 3520, 259, 260),
match = "any")
length(sps_ranges)
## [1] 473
We can see many more spectra passed the filtering step this time.
In the example below we use specific functions to select all spectra measured in the second mzML file and subsequently filter them to retain spectra measured between 175 and 189 seconds in the measurement run.
fls <- unique(dataOrigin(sps_sciex))
file_2 <- filterDataOrigin(sps_sciex, dataOrigin = fls[2])
length(file_2)
## [1] 931
sps_sub <- filterRt(file_2, rt = c(175, 189))
length(sps_sub)
## [1] 50
In addition, Spectra
support also subsetting with [
. Below we perform the
filtering above with [
-based subsetting.
sps_sciex[sps_sciex$dataOrigin == fls[2] &
sps_sciex$rtime >= 175 &
sps_sciex$rtime <= 189]
## MSn data (Spectra) with 50 spectra in a MsBackendMzR backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 175.212 628
## 2 1 175.491 629
## 3 1 175.770 630
## 4 1 176.049 631
## 5 1 176.328 632
## ... ... ... ...
## 46 1 187.768 673
## 47 1 188.047 674
## 48 1 188.326 675
## 49 1 188.605 676
## 50 1 188.884 677
## ... 33 more variables/columns.
##
## file(s):
## 20171016_POOL_POS_3_105-134.mzML
The equivalent using filter function is shown below, with the added benefit that the filtering is recorded in the processing slot.
sps_sciex |>
filterDataOrigin(fls[2]) |>
filterRt(c(175, 189))
## MSn data (Spectra) with 50 spectra in a MsBackendMzR backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 175.212 628
## 2 1 175.491 629
## 3 1 175.770 630
## 4 1 176.049 631
## 5 1 176.328 632
## ... ... ... ...
## 46 1 187.768 673
## 47 1 188.047 674
## 48 1 188.326 675
## 49 1 188.605 676
## 50 1 188.884 677
## ... 33 more variables/columns.
##
## file(s):
## 20171016_POOL_POS_3_105-134.mzML
## Processing:
## Filter: select data origin(s) /home/biocbuild/bbs-3.20-bioc/R/site-library/msdata/sciex/20171016_POOL_POS_3_105-134.mzML [Tue Oct 29 23:00:16 2024]
## Filter: select retention time [175..189] on MS level(s) 1 [Tue Oct 29 23:00:16 2024]
Note that the use of the filter functions might be more efficient for some backends, depending on their implementation, (e.g. database-based backends could translate the filter function into a SQL condition to perform the subsetting already within the database).
Multiple Spectra
objects can also be combined into a single Spectra
with the
c()
or the concatenateSpectra()
function. The resulting Spectra
object
will contain an union of the spectra variables of the individual objects. Below
we combine the Spectra
object sps
with an additional object containing
another MS2 spectrum for Caffeine.
caf_df <- DataFrame(msLevel = 2L, name = "Caffeine",
id = "HMDB0001847",
instrument = "Agilent 1200 RRLC; Agilent 6520 QTOF",
splash = "splash10-0002-0900000000-413259091ba7edc46b87",
centroided = TRUE)
caf_df$mz <- list(c(110.0710, 138.0655, 138.1057, 138.1742, 195.9864))
caf_df$intensity <- list(c(3.837, 32.341, 0.84, 0.534, 100))
caf <- Spectra(caf_df)
Next we combine the two objects.
sps <- concatenateSpectra(sps, caf)
sps
## MSn data (Spectra) with 4 spectra in a MsBackendMemory backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 2 NA NA
## 2 2 NA NA
## 3 2 NA NA
## 4 2 NA NA
## ... 20 more variables/columns.
## Processing:
## Merge 2 Spectra into one [Tue Oct 29 23:00:16 2024]
The resulting object contains now the data for all 4 MS2 spectra and an union of all spectra variables from both objects.
spectraVariables(sps)
## [1] "msLevel" "rtime"
## [3] "acquisitionNum" "scanIndex"
## [5] "dataStorage" "dataOrigin"
## [7] "centroided" "smoothed"
## [9] "polarity" "precScanNum"
## [11] "precursorMz" "precursorIntensity"
## [13] "precursorCharge" "collisionEnergy"
## [15] "isolationWindowLowerMz" "isolationWindowTargetMz"
## [17] "isolationWindowUpperMz" "id"
## [19] "name" "splash"
## [21] "instrument"
The second object had an additional spectra variable instrument that was not
present in sps
and all the spectra in this object will thus get a value of
NA
for this variable.
sps$instrument
## [1] NA
## [2] NA
## [3] NA
## [4] "Agilent 1200 RRLC; Agilent 6520 QTOF"
Sometimes not all spectra variables might be required (e.g. also because many of
them are empty). This might be specifically interesting also for Spectra
containing the data from very large experiments, because it can significantly
reduce the object’s size in memory. In such cases the selectSpectraVariables()
function can be used to retain only specified spectra variables.
Some analyses require manipulation of the mass peak data (i.e. the m/z and/or
intensity values). One example would be to remove all peaks from a spectrum that
have an intensity lower than a certain threshold. Below we perform such an
operation with the replaceIntensitiesBelow()
function to replace peak
intensities below 10 in each spectrum in sps
with a value of 0.
sps_rep <- replaceIntensitiesBelow(sps, threshold = 10, value = 0)
As a result intensities below 10 were set to 0 for all peaks.
intensity(sps_rep)
## NumericList of length 4
## [[1]] 0 47.494 0 100 13.24
## [[2]] 0 0 0 16.708 100 0 40.643
## [[3]] 0 0 0 0 0 0 0 100 40.994
## [[4]] 0 32.341 0 0 100
Zero-intensity peaks (and peaks with missing intensities) can then be removed
with the filterIntensity()
function specifying a lower required intensity level
or optionally also an upper intensity limit.
sps_rep <- filterIntensity(sps_rep, intensity = c(0.1, Inf))
intensity(sps_rep)
## NumericList of length 4
## [[1]] 47.494 100 13.24
## [[2]] 16.708 100 40.643
## [[3]] 100 40.994
## [[4]] 32.341 100
The filterIntensity()
supports also a user-provided function to be passed with
parameter intensity
which would allow e.g. to remove peaks smaller than the
median peak intensity of a spectrum. See examples in the ?filterIntensity
help
page for details.
Note that any data manipulations on Spectra
objects are not immediately
applied to the peak data. They are added to a so called processing queue which
is applied each time peak data is accessed (with the peaksData()
, mz()
or
intensity()
functions). Thanks to this processing queue data manipulation
operations are also possible for read-only backends (e.g. mzML-file based
backends or database-based backends). The information about the number of such
processing steps can be seen below (next to Lazy evaluation queue).
sps_rep
## MSn data (Spectra) with 4 spectra in a MsBackendMemory backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 2 NA NA
## 2 2 NA NA
## 3 2 NA NA
## 4 2 NA NA
## ... 20 more variables/columns.
## Lazy evaluation queue: 2 processing step(s)
## Processing:
## Merge 2 Spectra into one [Tue Oct 29 23:00:16 2024]
## Signal <= 10 in MS level(s) 2 set to 0 [Tue Oct 29 23:00:16 2024]
## Remove peaks with intensities outside [0.1, Inf] in spectra of MS level(s) 2. [Tue Oct 29 23:00:16 2024]
It is possible to add also custom functions to the processing queue of a
Spectra
object. Such a function must take a peaks matrix as its first
argument, have ...
in the function definition and must return a peaks matrix
(a peaks matrix is a numeric two-column matrix with the first column containing
the peaks’ m/z values and the second the corresponding intensities). Below we
define a function that divides the intensities of each peak by a value which can
be passed with argument y
.
## Define a function that takes a matrix as input, divides the second
## column by parameter y and returns it. Note that ... is required in
## the function's definition.
divide_intensities <- function(x, y, ...) {
x[, 2] <- x[, 2] / y
x
}
## Add the function to the procesing queue
sps_2 <- addProcessing(sps_rep, divide_intensities, y = 2)
sps_2
## MSn data (Spectra) with 4 spectra in a MsBackendMemory backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 2 NA NA
## 2 2 NA NA
## 3 2 NA NA
## 4 2 NA NA
## ... 20 more variables/columns.
## Lazy evaluation queue: 3 processing step(s)
## Processing:
## Merge 2 Spectra into one [Tue Oct 29 23:00:16 2024]
## Signal <= 10 in MS level(s) 2 set to 0 [Tue Oct 29 23:00:16 2024]
## Remove peaks with intensities outside [0.1, Inf] in spectra of MS level(s) 2. [Tue Oct 29 23:00:16 2024]
Object sps_2
has now 3 processing steps in its lazy evaluation queue. Calling
intensity()
on this object will now return intensities that are half of the
intensities of the original objects sps
.
intensity(sps_2)
## NumericList of length 4
## [[1]] 23.747 50 6.62
## [[2]] 8.354 50 20.3215
## [[3]] 50 20.497
## [[4]] 16.1705 50
intensity(sps_rep)
## NumericList of length 4
## [[1]] 47.494 100 13.24
## [[2]] 16.708 100 40.643
## [[3]] 100 40.994
## [[4]] 32.341 100
Alternatively we could define a function that returns the maximum peak from each
spectrum (note: we use the unname()
function to remove any names from the
results):
max_peak <- function(x, ...) {
unname(x[which.max(x[, 2]), , drop = FALSE])
}
sps_2 <- addProcessing(sps_rep, max_peak)
lengths(sps_2)
## [1] 1 1 1 1
intensity(sps_2)
## NumericList of length 4
## [[1]] 100
## [[2]] 100
## [[3]] 100
## [[4]] 100
Each spectrum in sps_2
thus contains only a single peak. The parameter
spectraVariables
of the addProcessing()
function allows in addition to
define spectra variables that should be passed (in addition to the peaks matrix)
to the user-provided function. This would enable for example to calculate
neutral loss spectra from a Spectra
by subtracting the precursor m/z from
each m/z of a spectrum (note that there would also be a dedicated
neutralLoss()
function to perform this operation more efficiently). Our tool
example does not have precursor m/z values defined, thus we first set them to
arbitrary values. Then we define a function neutral_loss
that calculates the
difference between the precursor m/z and the fragment peak’s m/z. In addition we
need to ensure the peaks in the resulting spectra are ordered by (the delta) m/z
values. Note that, in order to be able to access the precursor m/z of the
spectrum within our function, we have to add a parameter to the function that
has the same name as the spectrum variable we want to access (in our case
precursorMz
).
sps_rep$precursorMz <- c(170.5, 170.5, 195.1, 195.1)
neutral_loss <- function(x, precursorMz, ...) {
x[, "mz"] <- precursorMz - x[, "mz"]
x[order(x[, "mz"]), , drop = FALSE]
}
We have then to call addProcessing()
with spectraVariables = "precursorMz"
to specify that this spectra variable is passed along to our function.
sps_3 <- addProcessing(sps_rep, neutral_loss,
spectraVariables = "precursorMz")
mz(sps_rep)
## NumericList of length 4
## [[1]] 124.2 170.16 170.52
## [[2]] 109.14 124.08 170.16
## [[3]] 138.0662 195.0876
## [[4]] 138.0655 195.9864
mz(sps_3)
## NumericList of length 4
## [[1]] -0.0200000000000102 0.340000000000003 46.3
## [[2]] 0.340000000000003 46.42 61.36
## [[3]] 0.0123999999999853 57.0338
## [[4]] -0.886400000000009 57.0345
As we can see, the precursor m/z was subtracted from each m/z of the respective
spectrum. A better version of the function, that only calculates neutral loss
spectra for MS level 2 spectra would be the neutral_loss
function below. Since
we are accessing also the spectrum’s MS level we have to call addProcessing()
adding also the spectra variable msLevel
to the spectraVariables
parameter. Note however that the msLevel
spectra variable is by default
renamed to spectrumMsLevel
prior passing it to the function. We have thus to
use a parameter called spectrumMsLevel
in the neutral_loss
function instead
of msLevel
.
neutral_loss <- function(x, spectrumMsLevel, precursorMz, ...) {
if (spectrumMsLevel == 2L) {
x[, "mz"] <- precursorMz - x[, "mz"]
x <- x[order(x[, "mz"]), , drop = FALSE]
}
x
}
sps_3 <- addProcessing(sps_rep, neutral_loss,
spectraVariables = c("msLevel", "precursorMz"))
mz(sps_3)
## NumericList of length 4
## [[1]] -0.0200000000000102 0.340000000000003 46.3
## [[2]] 0.340000000000003 46.42 61.36
## [[3]] 0.0123999999999853 57.0338
## [[4]] -0.886400000000009 57.0345
Using the same concept it would also be possible to provide any
spectrum-specific user-defined value to the processing function. This variable
could simply be added first as a new spectra variable to the Spectra
object
and then this variable could be passed along to the function in the same way we
passed the precursor m/z to our function above.
Another example for spectra processing potentially helpful for spectral matching
against reference fragment spectra libraries would be a function that removes
fragment peaks with an m/z matching the precursor m/z of a spectrum. Below we
define such a function that takes the peaks matrix and the precursor m/z as
input and evaluates with the closest()
function from the
MsCoreUtils whether the spectrum contains peaks with an m/z value
matching the one of the precursor (given tolerance
and ppm
). The returned
peaks matrix contains all peaks except those matching the precursor m/z.
library(MsCoreUtils)
remove_precursor <- function(x, precursorMz, tolerance = 0.1, ppm = 0, ...) {
if (!is.na(precursorMz)) {
keep <- is.na(closest(x[, "mz"], precursorMz, tolerance = tolerance,
ppm = ppm, .check = FALSE))
x[keep, , drop = FALSE]
} else x
}
We can now again add this processing step to our Spectra
object. As a result,
peaks matching the precursor m/z (with tolerance = 0.1
and ppm = 0
) will be
removed.
sps_4 <- addProcessing(sps_rep, remove_precursor,
spectraVariables = "precursorMz")
peaksData(sps_4) |> as.list()
## [[1]]
## mz intensity
## [1,] 124.20 47.494
## [2,] 170.16 100.000
##
## [[2]]
## mz intensity
## [1,] 109.14 16.708
## [2,] 124.08 100.000
## [3,] 170.16 40.643
##
## [[3]]
## mz intensity
## [1,] 138.0662 100
##
## [[4]]
## mz intensity
## [1,] 138.0655 32.341
## [2,] 195.9864 100.000
As a reference, the original peak matrices are shown below.
peaksData(sps_rep) |> as.list()
## [[1]]
## mz intensity
## [1,] 124.20 47.494
## [2,] 170.16 100.000
## [3,] 170.52 13.240
##
## [[2]]
## mz intensity
## [1,] 109.14 16.708
## [2,] 124.08 100.000
## [3,] 170.16 40.643
##
## [[3]]
## mz intensity
## [1,] 138.0662 100.000
## [2,] 195.0876 40.994
##
## [[4]]
## mz intensity
## [1,] 138.0655 32.341
## [2,] 195.9864 100.000
Note that we can also perform a more relaxed matching of m/z values by passing a
different value for tolerance
to the function:
sps_4 <- addProcessing(sps_rep, remove_precursor, tolerance = 0.6,
spectraVariables = "precursorMz")
peaksData(sps_4) |> as.list()
## [[1]]
## mz intensity
## [1,] 124.2 47.494
##
## [[2]]
## mz intensity
## [1,] 109.14 16.708
## [2,] 124.08 100.000
##
## [[3]]
## mz intensity
## [1,] 138.0662 100
##
## [[4]]
## mz intensity
## [1,] 138.0655 32.341
## [2,] 195.9864 100.000
Since all data manipulations above did not change the original intensity or m/z
values, it is possible to restore the original data. This can be done with the
reset()
function which will empty the lazy evaluation queue and call the
reset()
method on the storage backend. Below we call reset()
on the sps_2
object and hence restore the data to its original state.
sps_2_rest <- reset(sps_2)
intensity(sps_2_rest)
## NumericList of length 4
## [[1]] 3.407 47.494 3.094 100 13.24
## [[2]] 6.685 4.381 3.022 16.708 100 4.565 40.643
## [[3]] 0.459 2.585 2.446 0.508 8.968 0.524 0.974 100 40.994
## [[4]] 3.837 32.341 0.84 0.534 100
intensity(sps)
## NumericList of length 4
## [[1]] 3.407 47.494 3.094 100 13.24
## [[2]] 6.685 4.381 3.022 16.708 100 4.565 40.643
## [[3]] 0.459 2.585 2.446 0.508 8.968 0.524 0.974 100 40.994
## [[4]] 3.837 32.341 0.84 0.534 100
Finally, for Spectra
that use a writeable backend, such as the
MsBackendMemory
, MsBackendDataFrame
or MsBackendHdf5Peaks
, it is possible
to apply the processing queue to the peak data and write that back to the data
storage with the applyProcessing()
function. Below we use this to make all
data manipulations on peak data of the sps_rep
object persistent.
length(sps_rep@processingQueue)
## [1] 2
sps_rep <- applyProcessing(sps_rep)
length(sps_rep@processingQueue)
## [1] 0
sps_rep
## MSn data (Spectra) with 4 spectra in a MsBackendMemory backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 2 NA NA
## 2 2 NA NA
## 3 2 NA NA
## 4 2 NA NA
## ... 20 more variables/columns.
## Processing:
## Merge 2 Spectra into one [Tue Oct 29 23:00:16 2024]
## Signal <= 10 in MS level(s) 2 set to 0 [Tue Oct 29 23:00:16 2024]
## Remove peaks with intensities outside [0.1, Inf] in spectra of MS level(s) 2. [Tue Oct 29 23:00:16 2024]
## ...1 more processings. Use 'processingLog' to list all.
Before applyProcessing()
the lazy evaluation queue contained 2 processing
steps, which were then applied to the peak data and written to the data
storage. Note that calling reset()
after applyProcessing()
can no longer
restore the data.
Spectra
The Spectra
package provides the following functions to visualize spectra
data:
- plotSpectra()
: plot each spectrum in Spectra
in its own panel.
- plotSpectraOverlay()
: plot multiple spectra into the same plot.
Below we use plotSpectra()
to plot the 4 spectra from the sps
object using
their names (as provided in spectra variable "name"
) as plot titles.
plotSpectra(sps, main = sps$name)