Using CluMSID with a Publicly Available MetaboLights Data Set

Tobias Depke

May 01, 2024


In this tutorial, we would like to demonstate the use of CluMSID with a publicly available LC-MS/MS data set deposited on MetaboLights. We chose data set MTBLS433 that can be accessed on the MetaboLights web page ( and which has been published in the following article:

Kalogiouri, N. P., Alygizakis, N. A., Aalizadeh, R., & Thomaidis, N. S. (2016). Olive oil authenticity studies by target and nontarget LC–QTOF-MS combined with advanced chemometric techniques. Analytical and bioanalytical chemistry, 408(28), 7955-7970.

The authors analysed olive oil of various providence using reversed-phase ultra high performance liquid chromatography–electrospray ionisation quadrupole time of flight tandem mass spectrometry in negative mode with auto-MS/MS fragmentation.

As a representative pooled sample is not provided, we will combine MS2 data from several runs and use the peak picking done by the authors of the study for the merging of MS2 spectra. Some metabolite annotations are also included in the MTBLS433 data set which we will integrate into our analysis.


Extract MS2 spectra from multiple *.mzML files

For demonstration, not all files from the analysis will be included into the analysis. Four data files from the data set have been chosen that represent olive oil samples from different regions in Greece:

Note that these are mzML files that can be processed the exact same way as mzXML files.

Furthermore, we would like to use the peak picking and annotation data from the original authors which we can read from the file m_mtbls433_metabolite_profiling_mass_spectrometry_v2_maf.tsv.

First, we extract MS2 spectra from the respective files separately by using extractMS2spectra(). Then, we just combine the resulting lists into one list using base R functionality:

YH1 <- system.file("extdata", "YH1_GA7_01_10463.mzML",
                    package = "CluMSIDdata")
AX1 <- system.file("extdata", "AX1_GB5_01_10470.mzML",
                    package = "CluMSIDdata")
LP1 <- system.file("extdata", "LP1_GB3_01_10467.mzML",
                    package = "CluMSIDdata")
BR1 <- system.file("extdata", "BR1_GB6_01_10471.mzML",
                    package = "CluMSIDdata")

YH1list <- extractMS2spectra(YH1)
AX1list <- extractMS2spectra(AX1)
LP1list <- extractMS2spectra(LP1)
BR1list <- extractMS2spectra(BR1)

raw_oillist <- c(YH1list, AX1list, LP1list, BR1list)

Merge spectra with external peak list

First, we import the peak list by reading the respective table and filtering for the relevant information. We only need the columns metabolite_identification, mass_to_charge and rentention_time and we would like to replace "unknown" with an empty field in the metabolite_identification column. Plus, the features do not have a unique identifier in the table but we can easily generate that from m/z and RT. Note that the retention time in the raw data is given in seconds and in the data table it is in minutes, so we have to convert. For the sake of consistency, we also change the column names. We use tidyverse syntax but users can do as they prefer.

raw_mtbls_df <- system.file("extdata", 
                package = "CluMSIDdata")


mtbls_df <- readr::read_delim(raw_mtbls_df, "\t") %>%
    dplyr::mutate(metabolite_identification = 
                                    "unknown", "")) %>%
    dplyr::mutate(id = paste0("M", mass_to_charge, "T", retention_time)) %>%
    dplyr::mutate(retention_time = retention_time * 60) %>%
            metabolite_identification) %>%
    dplyr::rename(mz = mass_to_charge,
            rt = retention_time,
            annotation = metabolite_identification)

This peak list, or its first three columns, can now be used to merge spectra. We exclude spectra that do not match to any of the peaks in the peak list. As we are not very familiar with instrumental setup, we set the limits for retention time and m/z deviation a little wider. To make an educated guess on mass accuracy, we take a look at an identified metabolite, its measured m/z and its theoretical m/z. We use arachidic acid [M-H]-, whose theoretical m/z is 311.2956:

## Define theoretical m/z
th <- 311.2956

## Get measured m/z for arachidic acid data from mtbls_df
ac <- mtbls_df %>%
    dplyr::filter(annotation == "Arachidic acid") %>%
    dplyr::select(mz) %>%

## Calculate relative m/z difference in ppm
abs(th - ac)/th * 1e6
#> [1] 28.91143

So, we will work with an an m/z tolerance of 30ppm (which seems rather high for a high resolution mass spectrometer).

A 30ppm mass accuracy necessitates an m/z tolerance of 60ppm, because deviations can go both ways:

oillist <- mergeMS2spectra(raw_oillist, 
                                peaktable = mtbls_df[,1:3],
                                exclude_unmatched = TRUE,
                                rt_tolerance = 60,
                                mz_tolerance = 3e-5)

Add annotations

To add annotations, we use mtbls_df as well, as described in the General Tutorial:

fl <- featureList(oillist)
fl_annos <- dplyr::left_join(fl, mtbls_df, by = "id")

annolist <- addAnnotations(oillist, fl_annos, annotationColumn = 6)

Generate distance matrix

For the generation of the distance matrix, too, we use an m/z tolerance of 30ppm:

distmat <- distanceMatrix(annolist, mz_tolerance = 3e-5)

Explore data

To explore the data, we have a look at a cluster dendrogram:

HCplot(distmat, h = 0.7, cex = 0.7)

Figure 1: Circularised dendrogram as a result of agglomerative hierarchical clustering with average linkage as agglomeration criterion based on MS2 spectra similarities of the MTBLS433 LC-MS/MS example data set. Each leaf represents one feature and colours encode cluster affiliation of the features. Leaf labels display feature IDs. Distance from the central point is indicative of the height of the dendrogram.

Since it was not in the focus of their study, the authors identified only a few metabolites. If we look at the positions of these metabolites in the cluster dendrogram, we see that the poly-unsaturated fatty acids alpha-linolenic acid and alpha-linolenic acid are nicely separated from the saturated fatty acids stearic acid and arachidic acid. We would expect the latter to cluster together but a look at the spectra reveals that stearic acid barely produces any fragment ions and mainly contains the unfragmented [M-H]- parent ion:

specplot(getSpectrum(annolist, "annotation", "Stearic acid"))

Figure 2: Barplot for the feature M283.2642T14.62, identified as stearic acid, displaying fragment m/z on the x-axis and intensity normalised to the maximum intensity on the y-axis.

In contrast, arachidic acid produces a much richer spectrum:

specplot(getSpectrum(annolist, "annotation", "Arachidic acid"))

Figure 3: Barplot for the feature M311.3046T15.1, identified as arachidic acid, displaying fragment m/z on the x-axis and intensity normalised to the maximum intensity on the y-axis.

Inspecting the features that cluster close to arachidic acid shows that many of them have an exact m/z that conforms with other fatty acids of different chain length or saturation (within the m/z tolerance), e.g. the neighbouring feature M339.2125T15.32 that could be arachidonic acid [M+Cl]-.

Looking at oleic acid [M-H]-, we see that it clusters very closely to M563.5254T13.93, whose m/z is consistent with oleic acid [2M-H]- and some other possible adducts.

As a last example, the only identified metabolite that does not belong to the class of fatty acids is acetosyringone, a phenolic secondary plant metabolite. It forms part of a rather dense cluster in the dendrogram, suggesting high spectral similarities to the other members of the cluster. It would be interesting to try to annotate more of these metabolite to find out if they are also phenolic compounds.

In conclusion, we demonstrated how to use CluMSID with a publicly available data set from the MetaboLights repository and how to include external information such as peak lists or feature annotations into a CluMSID workflow. In doing so, we had a look on a few example findings that could help to annotate more of the features in the data set and thereby showed the usefulness of CluMSID for samples very different from the ones in the other tutorials.

Session Info

#> R version 4.4.0 RC (2024-04-16 r86468)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/ 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> other attached packages:
#>  [1] magrittr_2.0.3      metaMSdata_1.39.0   metaMS_1.41.0      
#>  [4] CAMERA_1.61.0       xcms_4.3.0          BiocParallel_1.39.0
#>  [7] Biobase_2.65.0      BiocGenerics_0.51.0 CluMSIDdata_1.19.0 
#> [10] CluMSID_1.21.0     
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3          rstudioapi_0.16.0          
#>   [3] jsonlite_1.8.8              MultiAssayExperiment_1.31.0
#>   [5] farver_2.1.1                MALDIquant_1.22.2          
#>   [7] rmarkdown_2.26              fs_1.6.4                   
#>   [9] zlibbioc_1.51.0             vctrs_0.6.5                
#>  [11] base64enc_0.1-3             htmltools_0.5.8.1          
#>  [13] S4Arrays_1.5.0              progress_1.2.3             
#>  [15] Formula_1.2-5               SparseArray_1.5.0          
#>  [17] mzID_1.43.0                 sass_0.4.9                 
#>  [19] KernSmooth_2.23-22          bslib_0.7.0                
#>  [21] htmlwidgets_1.6.4           plyr_1.8.9                 
#>  [23] impute_1.79.0               plotly_4.10.4              
#>  [25] cachem_1.0.8                igraph_2.0.3               
#>  [27] lifecycle_1.0.4             iterators_1.0.14           
#>  [29] pkgconfig_2.0.3             Matrix_1.7-0               
#>  [31] R6_2.5.1                    fastmap_1.1.1              
#>  [33] GenomeInfoDbData_1.2.12     MatrixGenerics_1.17.0      
#>  [35] clue_0.3-65                 digest_0.6.35              
#>  [37] pcaMethods_1.97.0           colorspace_2.1-0           
#>  [39] GGally_2.2.1                S4Vectors_0.43.0           
#>  [41] Hmisc_5.1-2                 GenomicRanges_1.57.0       
#>  [43] Spectra_1.15.0              fansi_1.0.6                
#>  [45] httr_1.4.7                  abind_1.4-5                
#>  [47] compiler_4.4.0              bit64_4.0.5                
#>  [49] withr_3.0.0                 doParallel_1.0.17          
#>  [51] backports_1.4.1             htmlTable_2.4.2            
#>  [53] DBI_1.2.2                   ggstats_0.6.0              
#>  [55] highr_0.10                  gplots_3.1.3.1             
#>  [57] MASS_7.3-60.2               MsExperiment_1.7.0         
#>  [59] DelayedArray_0.31.0         gtools_3.9.5               
#>  [61] caTools_1.18.2              mzR_2.39.0                 
#>  [63] tools_4.4.0                 foreign_0.8-86             
#>  [65] PSMatch_1.9.0               ape_5.8                    
#>  [67] nnet_7.3-19                 glue_1.7.0                 
#>  [69] dbscan_1.1-12               nlme_3.1-164               
#>  [71] QFeatures_1.15.0            grid_4.4.0                 
#>  [73] checkmate_2.3.1             cluster_2.1.6              
#>  [75] generics_0.1.3              gtable_0.3.5               
#>  [77] tzdb_0.4.0                  preprocessCore_1.67.0      
#>  [79] tidyr_1.3.1                 sna_2.7-2                  
#>  [81] data.table_1.15.4           hms_1.1.3                  
#>  [83] MetaboCoreUtils_1.13.0      utf8_1.2.4                 
#>  [85] XVector_0.45.0              stringr_1.5.1              
#>  [87] foreach_1.5.2               pillar_1.9.0               
#>  [89] vroom_1.6.5                 limma_3.61.0               
#>  [91] robustbase_0.99-2           dplyr_1.1.4                
#>  [93] lattice_0.22-6              bit_4.0.5                  
#>  [95] tidyselect_1.2.1            RBGL_1.81.0                
#>  [97] knitr_1.46                  gridExtra_2.3              
#>  [99] IRanges_2.39.0              ProtGenerics_1.37.0        
#> [101] SummarizedExperiment_1.35.0 stats4_4.4.0               
#> [103] xfun_0.43                   statmod_1.5.0              
#> [105] MSnbase_2.31.0              matrixStats_1.3.0          
#> [107] DEoptimR_1.1-3              stringi_1.8.3              
#> [109] UCSC.utils_1.1.0            statnet.common_4.9.0       
#> [111] lazyeval_0.2.2              yaml_2.3.8                 
#> [113] evaluate_0.23               codetools_0.2-20           
#> [115] archive_1.1.8               MsCoreUtils_1.17.0         
#> [117] tibble_3.2.1                BiocManager_1.30.22        
#> [119] graph_1.83.0                cli_3.6.2                  
#> [121] affyio_1.75.0               rpart_4.1.23               
#> [123] munsell_0.5.1               jquerylib_0.1.4            
#> [125] network_1.18.2              Rcpp_1.0.12                
#> [127] GenomeInfoDb_1.41.0         MassSpecWavelet_1.71.0     
#> [129] coda_0.19-4.1               XML_3.99-0.16.1            
#> [131] parallel_4.4.0              readr_2.1.5                
#> [133] ggplot2_3.5.1               prettyunits_1.2.0          
#> [135] AnnotationFilter_1.29.0     bitops_1.0-7               
#> [137] viridisLite_0.4.2           MsFeatures_1.13.0          
#> [139] scales_1.3.0                affy_1.83.0                
#> [141] ncdf4_1.22                  purrr_1.0.2                
#> [143] crayon_1.5.2                rlang_1.1.3                
#> [145] vsn_3.73.0