This tutorial shows how to use the CluMSID
package to help annotate MS2 spectra from untargeted LC-MS/MS data. CluMSID
works with MS2 data generated by data-dependent acquisition and requires an mzXML file (like in this example) or any other file that can be parsed by mzR
, like mzML, mzTab or netCDF, as input. It can be used both stand-alone and together with the XCMS suite of preprocessing tools.
CluMSID
extracts and merges MS2 spectra and generates neutral loss patterns for each feature. Additionally, it can make use of information from the CAMERA
package to generate pseudospectra from MS1 level data. The tool uses cosine similarity to generate distance matrices from MS2 spectra, neutral loss patterns and pseudospectra.
These distance matrices are the basis for multivariate statistics methods such as multidimensional scaling, density-based clustering, hierarchical clustering and correlation networks. The CluMSID
package provides functions for these methods including (interactive) visualisation but the distance/similarity data can also be analysed with other R
functions.
For the demonstrations in this tutorial, we will mainly use data from pooled Pseudomonas aeruginosa cell extracts, measured in ESI-(+) mode with auto-MS/MS on a Bruker maxisHD qTOF after reversed phase separation by UPLC. For details, please refer to the Depke et al. 2017 publication (doi: 10.1016/j.jchromb.2017.06.002.).
To be able to access the example data, we also need the related package CluMSIDdata
. The packages can be loaded as follows:
MS2spectrum
and pseudospectrum
classesCluMSID
uses a custom S4 class named MS2spectrum
to store spectral information in the following slots:
id
: a character string similar to the ID used by XCMSonline or the ID given in a predefined peak listannotation
: a character string containing a user-defined annotation, defaults to emptyprecursor
: (median) m/z of the spectrum’s precursor ionrt
: (median) retention time of the spectrum’s precursor ionpolarity
: the polarity with which the spectrum was recorded, either positive
or negative
spectrum
: the actual MS2 spectrum as two-column matrix (column 1 is (median) m/z, column 2 is (median) intensity of the product ions)neutral_losses
: a neutral loss pattern generated by subtracting the product ion mass-to-charge ratios from the precursor m/z in a matrix format analogous to the spectrum
slotThe pseudospectrum
class is very similar but it contains no information on precursor m/z and therefore no neutral loss pattern, either. By default, the id
slot contains the “pcgroup” number assigned by CAMERA
.
The individual slots of MS2spectrum
and pseudospectrum
objects can be accessed via the standard S4 way using object@slot
, e.g. object@annotation
or by using an accessor function. These exist for all slots and are called accessFoo()
, where Foo
is the slot name (not exactly, though, because Bioconductor
does not allow to mix snake_case
and camelCase
in function names):
accessID(object)
accessAnnotation(object)
accessPrecursor(object)
accessRT(object)
accessPolarity(object)
accessSpectrum(object)
accessNeutralLosses(object)
.The first step in the CluMSID
workflow is to extract MS2 spectra from the raw data file (in mzXML format). This is done by the extractMS2spectra
function which internally uses several functions from the mzR
package. The function offers the possibility to filter spectra that contain less a defined number of peaks and/or do not fall in a defined retention time window. Setting the recalibrate_precursor
argument to TRUE
activates a correction process for uncalibrated precursor m/z data that existed in older version of Bruker’s Compass Xport (cf. Depke et al. 2017). It is not necessary to use it with files generated by other software but does not corrupt the data, either.
Please be aware that mzR
often throws warnings concerning the Rcpp
version that can usually be ignored.
ms2list <- extractMS2spectra(system.file("extdata",
"PoolA_R_SE.mzXML",
package = "CluMSIDdata"),
min_peaks = 2,
recalibrate_precursor = TRUE,
RTlims = c(0,25))
This operation has now extracted all the MS2 spectra from the raw data file and stored them in a list. Each list entry is an object of class MS2spectrum
. The list is quite long because it still contains a lot of spectra that derive from the same chromatographic peak.
In our example, the first two spectra in the list derive from the same peak and thus have the same precursor ion and almost the same retention time.
head(ms2list, 4)
#> [[1]]
#> An object of class "MS2spectrum"
#> id:
#> annotation:
#> precursor: 146.1652
#> retention time: 56.266
#> polarity: positive
#> MS2 spectrum with 2 fragment peaks
#> neutral loss pattern with 0 neutral losses
#> [[2]]
#> An object of class "MS2spectrum"
#> id:
#> annotation:
#> precursor: 146.1653
#> retention time: 57.292
#> polarity: positive
#> MS2 spectrum with 3 fragment peaks
#> neutral loss pattern with 0 neutral losses
#> [[3]]
#> An object of class "MS2spectrum"
#> id:
#> annotation:
#> precursor: 129.1387
#> retention time: 57.545
#> polarity: positive
#> MS2 spectrum with 2 fragment peaks
#> neutral loss pattern with 0 neutral losses
#> [[4]]
#> An object of class "MS2spectrum"
#> id:
#> annotation:
#> precursor: 112.1119
#> retention time: 57.797
#> polarity: positive
#> MS2 spectrum with 2 fragment peaks
#> neutral loss pattern with 0 neutral losses
From the output above, you also see that the MS2spectrum
class has a show()
generic that summarises the MS2 spectrum and neutral loss pattern data. To show the default output, use showDefault()
. Be aware that neutral loss patterns have not been calculated in this step.
showDefault(ms2list[[2]])
#> An object of class "MS2spectrum"
#> Slot "id":
#> character(0)
#>
#> Slot "annotation":
#> character(0)
#>
#> Slot "precursor":
#> [1] 146.1653
#>
#> Slot "rt":
#> [1] 57.292
#>
#> Slot "polarity":
#> [1] "positive"
#>
#> Slot "spectrum":
#> mz intensity
#> [1,] 72.08064 2448
#> [2,] 84.08077 328
#> [3,] 112.11228 843
#>
#> Slot "neutral_losses":
#> <0 x 0 matrix>
To reduce the amount of redundant MS2 spectra, the mergeMS2spectra()
function is used to generate consensus spectra from the MS2 spectra that derive from the same precursor. CluMSID
offers two possibilities to do so:
This possibility is the standard method for stand-alone use of CluMSID
and is equivalent to what has been described in Depke et al. 2017. It does not need additional input and summarises consecutive spectra that have the same precursor m/z if their retention time fall within a defined threshold (rt_tolerance
, defaults to 30s). A retention time difference between consecutive spectra larger than rt_tolerance
is interpreted as chromatographic separation and respective spectra will be assigned to a new feature. The mz_tolerance
argument should be set according to your instruments m/z precision, the default is 1 * 10-5 (10ppm, equivalent to ±5ppm instrument precision). The peaktable
and exclude_unmatched
arguments are not used in this method and are to be left at their default.
head(featlist, 4)
#> [[1]]
#> An object of class "MS2spectrum"
#> id: M146.17T59.35
#> annotation:
#> precursor: 146.1653
#> retention time: 59.35
#> polarity: positive
#> MS2 spectrum with 8 fragment peaks
#> neutral loss pattern with 7 neutral losses
#> [[2]]
#> An object of class "MS2spectrum"
#> id: M129.14T58.57
#> annotation:
#> precursor: 129.1387
#> retention time: 58.57
#> polarity: positive
#> MS2 spectrum with 4 fragment peaks
#> neutral loss pattern with 3 neutral losses
#> [[3]]
#> An object of class "MS2spectrum"
#> id: M112.11T57.8
#> annotation:
#> precursor: 112.1119
#> retention time: 57.8
#> polarity: positive
#> MS2 spectrum with 2 fragment peaks
#> neutral loss pattern with 1 neutral losses
#> [[4]]
#> An object of class "MS2spectrum"
#> id: M251.16T60.64
#> annotation:
#> precursor: 251.1603
#> retention time: 60.64
#> polarity: positive
#> MS2 spectrum with 9 fragment peaks
#> neutral loss pattern with 8 neutral losses
The total amount of spectra was reduced from 2290 to 518 and as many other, the redundant spectra #1 and #2 in the raw list are now merged to one consensus spectrum (#1 in the merged list).
In this step, neutral loss patterns have been generated that look like this:
The second possibility is to supply a peaktable, i.e. a list of picked peaks with their mass-to-charge ratios and retention times. This is particularly useful if you want to annotate a complete metabolomics data set. In our example, we have a metabolomics dataset called “TD035” in which we have measured a range of samples in MS1 mode for relative quantification. Additionally, we have measured a pooled QC sample in MS2 mode for annotation. The MS1 data were analysed using XCMSonline and we want to group the MS2 spectra so that they match the XCMSonline peak picking.
The spectra are extracted as shown above:
ms2list2 <- extractMS2spectra(system.file("extdata",
"TD035-PoolMSMS2.mzXML",
package = "CluMSIDdata"),
min_peaks = 2,
recalibrate_precursor = TRUE,
RTlims = c(0,25))
The peaklist is imported from the XCMSonline output. The list has to contain at least 3 columns:
Shown below is an easy way of getting from an XCMSonline annotated diffreport to a suitable peaktable using tidyverse functions. Of course, you can achieve the same goal with base R functions or even in Excel. Depending on the retention time format in your *.mzXML file, you might have to convert from minutes to seconds or vice versa. Here, we have minutes in the XCMSonline output but seconds in the MS2 file, so we multiply by 60.
require(magrittr)
ptable <-
readr::read_delim(file = system.file("extdata",
"TD035_XCMS.annotated.diffreport.tsv",
package = "CluMSIDdata"),
delim = "\t") %>%
dplyr::select(c(name, mzmed, rtmed)) %>%
dplyr::mutate(rtmed = rtmed * 60)
head(ptable)
#> # A tibble: 6 × 3
#> name mzmed rtmed
#> <chr> <dbl> <dbl>
#> 1 M245T2 245. 100.
#> 2 M440T2_1 440. 107.
#> 3 M578T2 578. 104.
#> 4 M85T1 85.0 60.8
#> 5 M126T1_1 126. 61.0
#> 6 M688T24 688. 1468.
We can now use this peaktable as an argument for mergeMS2spectra()
. You can choose whether you want to keep or exclude MS2 spectra that do not match any peak in the peaktable. These can occur in regions of the chromatogramm where there are no clear peaks but the auto-MS/MS still fragments the most abundant ions. These unmatched spectra are merged following the same rules as described above (method without peaktable). In this example, we keep the unmatched spectra. We use the default values for m/z and retention time tolerance and thus do not need to specify them.
featlist2 <- mergeMS2spectra(ms2list2,
peaktable = ptable,
exclude_unmatched = FALSE)
head(featlist2, 4)
#> [[1]]
#> An object of class "MS2spectrum"
#> id: M213T0
#> annotation:
#> precursor: 213.1462
#> retention time: 6.04
#> polarity: positive
#> MS2 spectrum with 5 fragment peaks
#> neutral loss pattern with 3 neutral losses
#> [[2]]
#> An object of class "MS2spectrum"
#> id: xM158T31.17
#> annotation:
#> precursor: 158.0027
#> retention time: 31.17
#> polarity: positive
#> MS2 spectrum with 3 fragment peaks
#> neutral loss pattern with 3 neutral losses
#> [[3]]
#> An object of class "MS2spectrum"
#> id: M146T1_3
#> annotation:
#> precursor: 146.1650
#> retention time: 61.15
#> polarity: positive
#> MS2 spectrum with 7 fragment peaks
#> neutral loss pattern with 6 neutral losses
#> [[4]]
#> An object of class "MS2spectrum"
#> id: M129T1_4
#> annotation:
#> precursor: 129.1384
#> retention time: 60.74
#> polarity: positive
#> MS2 spectrum with 2 fragment peaks
#> neutral loss pattern with 2 neutral losses
Note that the 2nd entry in featlist2
is marked with an ‘x’ which means that it could not be assigned to a feature in the peaktable.
For the sake of simplicity, only the data generated from the stand-alone procedure will be used for the following examples. Be assured that all of them would also work with the data generated with the help of an external peaktable (featlist2
).
The next step is to add (external) annotations to the list of features, e.g. from a spectral library that you curate in-house or one that has been supplied by your instrument manufacturer. If you do not (want to) annotate your features at all, this step can be skipped completely, leaving the annotation
slot of the MS2spectrum
objects empty.
CluMSID
offers several possibilities to add annotations to your feature list. The most basic one first generates a list of features and saves it as *.csv file. For that you use the writeFeaturelist()
function and only have to specify your list of spectra and a file name for the output file (here: pre_anno.csv
). You can then manually fill in your annotations in a new column in the table, save it (in this example under the name post_anno.csv
) and reload it to R
:
annotatedSpeclist <- addAnnotations(featlist, system.file("extdata",
"post_anno.csv",
package = "CluMSIDdata"))
annotatedSpeclist
will then be equivalent to featlist
with annotations added to the annotation
slot of the list entries.
You can add annotations without leaving the R
environment, too. addAnnotations()
also accepts objects of class data.frame
as annolist
argument. Be aware that addAnnotations()
assigns the annotation based on the position in the feature list. I.e., if the order of the features in your list of features (featlist
) and your list of annotations (annolist
) is different, you will get nonsense results.
The savest ways to addAnnotations()
with a data.frame
is to use featureList()
to generate a data.frame
that is formatted in the same way as the file output from writeFeaturelist()
and then match your identifications against this data.frame
and use the result as argument for addAnnotations()
.
Say you have an object called annos
that contains feature IDs (the same as in featlist
) and annotations in a two-column data.frame
with “id” and “annotation” as column names. It could look like this:
str(annos)
#> 'data.frame': 154 obs. of 2 variables:
#> $ id : chr "M146.17T59.35" "M129.14T58.57" "M112.11T57.8" "M148.06T69.65" ...
#> $ annotation: chr "spermidine" "spermidine (fragment)" "spermidine (fragment)" "glutamate" ...
head(annos)
#> id annotation
#> 1 M146.17T59.35 spermidine
#> 2 M129.14T58.57 spermidine (fragment)
#> 3 M112.11T57.8 spermidine (fragment)
#> 4 M148.06T69.65 glutamate
#> 5 M130.05T69.64 glutamate (fragment)
#> 6 M179.06T71.32 gluconolactone
addAnnotations(featlist, annos, annotationColumn = 2)
will throw an error because featlist
and annos
are of different length. Instead, you need to do the following:
Now, you can annotate your list of spectra using addAnnotations(featlist, fl_annos, annotationColumn = 4)
.
An analogous procedure works if you have your annotations stored in a peaktable that you have used for mergeMSspectra()
. As the order of spectra in the list will not be same as the order of features in your peaktable, you need to do a matching with the output of featureList()
as well.
Once we have a list of MS2spectrum
objects containing all the required information with or without annotation, we can generate distance matrices from (product ion) MS2 spectra as well as from neutral loss patterns. These distance matrices serve as the basis for further analysis of the data. Both for MS2 spectra and neutral loss patterns, cosine similarity is used as similarity metric:
\[ cos(\theta) = \frac{\sum_{i}a_i \cdot b_i}{\sqrt{\sum_{i}{a_{i}}^2 \cdot \sum_{i}{b_{i}}^2}} \]
For most applications, analysing the similarity of product ion MS2 spectra will be most useful. The generation of the distance matrix is done by just one simple command but it can take some time to calculate.
Common neutral losses and neutral loss patterns can convey information about structural similarity, as well, e.g. with nucleotides or glykosylated secondary metabolites. CluMSID
offers the possibility to study neutral loss patterns independently from product ion spectra. The generation of a distance matrix is analogous, you just need to set the ‘type’ argument to “neutral_losses”:
One rather simple possibility to visually analyse the spectral similarity data is multidimensional scaling, a dimension reduction method that simplifies distances in n-dimensional space to those in two-dimensional space (n in this case being the number of consensus spectra or neutral loss patterns that were used to generate the distance matrix in the previous step). CluMSID
offers a simple function to produce an MDS plot from the distance matrix with the option to highlight annotated metabolites and the possibility to generate an interactive plot using plotly
.
Standard MDS plots are generated as follows:
For MS2 spectra:
For neutral loss patterns:
Interactive plots are zoomable and show feature names upon mouse-over. They are generated like normal MDS plots and can be viewed within RStudio or—after saving as html file using htmlwidgets
—displayed in a normal web browser.
my_mds <- MDSplot(distmat, interactive = TRUE, highlight_annotated = TRUE)
htmlwidgets::saveWidget(my_mds, "mds.html")
This is how it looks like if you open the html file in Firefox and mouse over a feature:
For density-based clustering with CluMSID
, the ‘OPTICS’ algorithm and its implementation in the dbscan
package is used. Density-based clustering is a useful clustering method that often yields different results than hierarchical clustering and can thus provide additional insight into the data. CluMSID
has two functions to perform density-based clustering, one for the reachability plot which is the most useful visualisation of OPTICS results and one that outputs a data.frame
containing the cluster assignations for every feature.
Both functions require as arguments a distance matrix as well as three parameters for the underlying functions dbscan::optics
and dbscan::extractDBSCAN
: eps
, minPts
and eps_cl
. Lowering the eps
parameter (default is 10000) limits the size of the epsilon neighbourhood which from experience has very little effect on the results. minPts
defaults to 3 in CluMSID
. It defines how many points are considered for reachability distance calculation between clusters. The dbscan::optics
default for minPts
is 5. Users are encourage to experiment with this parameter. eps_cl
is the reachability threshold to identify clusters and can be varied based on your data. Lowering eps_cl
leads to a larger number of smaller clusters and vice versa for raising the value. In general, it is advisable to chose a higher eps_cl
for MS2 spectra than for neutral loss patterns, since the latter tend to show less similarity to each other. For details, please refer to the dbscan
help for the dbscan::optics
and dbscan::extractDBSCAN
functions.
If the default parameters are used, the generation of an OPTICS reachability plots is very simple, shown here for MS2 spectra and neutral loss patterns:
In the reachability plots, every line represents a feature and the height of the line is the reachability distance to the next feature in the OPTICS order. Thus, valleys represent groups of similar spectra or neutral loss patterns. The order and the cluster assignment can be studied using the OPTICStbl
function that outputs a three-column data.frame
with feature id, cluster assignment and OPTICS order. The order of features in the data.frame
corresponds to the original order in the input distance matrix. Features that were not assigned to a cluster are black in the reachability plot and have the cluster ID 0. OPTICStbl
takes the same arguments as OPTICSplot
. The two functions have to be run with exactly the same parameters to assure compatibility of results.
In Depke et al. 2017, hierarchical clustering proved the most useful method to unveil structural similarities between features. analogous to density-based clustering, CluMSID
offers two functions, one for plots and one for a data.frame
with cluster assignments, both taking a distance matrix as the only compulsory argument. The other two parameters are h
(defaults to 0.95
), the height where the tree should be cut (see stats::cutree
for details) and type
that determines the type visualisation:
heatmap
: a heatmap displaying pairwise similarities/distances along with cluster dendrogramsdendrogram
(default): a circular dendrogram with colour code for cluster assignmentHeatmaps of our example data for MS2 and neutral loss pattern similarity are created as follows (with reduced label font size by changing cexRow
and cexCol
as well as margins
of the underlying heatmap.2
function):
Obviously, it makes sense to export the plots to larger pdf or png files (e.g. 2000 \(\times\) 2000 pixels) to examine them closely. If exported to pdf, the feature names remain searchable (Ctrl+F
in Windows).