Metabolomics Workbench (link) hosts a metabolomics data repository. It contains over 1000 publicly available studies including raw data, processed data and metabolite/compound information.
The repository is searchable using a REST service API. The metabolomicsWorkbenchR package makes the endpoints of this service available in R and provides functionality to search the database and import datasets and metabolite information into commonly used formats such as data frames and SummarizedExperiment objects.
In this vigenette we will use
metabolomicsWorkbenchR to retrieve the uploaded peak matrix
for a study. We will then use
structToolbox to apply a basic workflow to analyse the data.
To install this package enter:
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("metabolomicsWorkbenchR")
For older versions, please refer to the appropriate Bioconductor release.
The API endpoints for Metabolomics Workbench are accessible using the
do_query functions takes 4 inputs:
context A valid context name (character)
input_item A valid input_item name (character)
input_value A valid input_value name (character)
output_item A valid output_item (character)
Contexts refer to the different database searches available in the API. The reader
is referred to the API manual for details of each context
metabolomicsWorkBenchR contexts are stored as a list, and a list of valid
contexts can be obtained using the
##  "study" "compound" "refmet" "gene" "protein" "moverz" ##  "exactmass"
input_item is specific to a context. Valid items for a context can
be listed using
## Valid inputs:
##  "study_id" "study_title" "institute" "last_name" ##  "analysis_id" "metabolite_id"
## ## Valid outputs:
##  "summary" "factors" ##  "analysis" "metabolites" ##  "mwtab" "source" ##  "species" "disease" ##  "number_of_metabolites" "data" ##  "datatable" "untarg_studies" ##  "untarg_factors" "untarg_data" ##  "metabolite_info" "SummarizedExperiment" ##  "untarg_SummarizedExperiment" "DatasetExperiment" ##  "untarg_DatasetExperiment"
First we query the database to return a list of untargeted studies. We use the “study” context in combination with a special case input item called “ignored” that is required for the “untarg_studies” output item.
US = do_query( context = 'study', input_item = 'ignored', input_value = 'ignored', output_item = 'untarg_studies' ) head(US[,1:3])
## study_id analysis_id analysis_display ## 1 ST000009 AN000023 LC/Electro-spray /QTOF positive ion mode ## 2 ST000009 AN000024 LC/Electro-spray /QTOF negative ion mode ## 3 ST000010 AN000025 LC/Electro-spray /QTOF positive ion mode ## 4 ST000010 AN000026 LC/Electro-spray /QTOF negative ion mode ## 5 ST000045 AN000072 MS positive ion mode/C18 ## 6 ST000045 AN000073 MS positive ion mode/HILIC
We will pull data for study “ST000009”. We can obtain summary information using the “summary” output item.
S = do_query('study','study_id','ST000010','summary') t(S)
## [,1] ## study_id "ST000010" ## study_title "Lung Cancer Cells 4" ## study_type "MS analysis (Untargeted)" ## institute "University of Michigan" ## last_name "Keshamouni" ## first_name "Venkat" ## email "email@example.com" ## submit_date "2013-04-03" ## study_summary "In cancer cells, the process of epithelial–mesenchymal transition (EMT) confers migratory and invasive capacity, resistance to apoptosis, drug resistance, evasion of host immune surveillance and tumor stem cell traits. Cells undergoing EMT may represent tumor cells with metastatic potential. Characterizing the EMT secretome may identify biomarkers to monitor EMT in tumor progression and provide a prognostic signature to predict patient survival. Utilizing a transforming growth factor-β-induced cell culture model of EMT, we quantitatively profiled differentially secreted proteins, by GeLC-tandem mass spectrometry. Integrating with the corresponding transcriptome, we derived an EMT-associated secretory phenotype (EASP) comprising of proteins that were differentially upregulated both at protein and mRNA levels. Four independent primary tumor-derived gene expression data sets of lung cancers were used for survival analysis by the random survival forests (RSF) method. Analysis of 97-gene EASP expression in human lung adenocarcinoma tumors revealed strong positive correlations with lymph node metastasis, advanced tumor stage and histological grade. RSF analysis built on a training set (n = 442), including age, sex and stage as variables, stratified three independent lung cancer data sets into low-, medium- and high-risk groups with significant differences in overall survival. We further refined EASP to a 20 gene signature (rEASP) based on variable importance scores from RSF analysis. Similar to EASP, rEASP predicted survival of both adenocarcinoma and squamous carcinoma patients. More importantly, it predicted survival in the early-stage cancers. These results demonstrate that integrative analysis of the critical biological process of EMT provides mechanism-based and clinically relevant biomarkers with significant prognostic value.\nResearch is published, core data not used but project description is relevant:\nhttp://www.jimmunol.org/content/194/12/5789.long\n" ## subject_species "Homo sapiens" ## department NA ## phone NA
As there are multiple datasets per study untargeted data needs to be requested
by Analysis ID. We will request DatasetExperiment format so that we can use the
data directly with
DE = do_query( context = 'study', input_item = 'analysis_id', input_value = 'AN000025', output_item = 'untarg_DatasetExperiment' ) DE
## A "DatasetExperiment" object ## ---------------------------- ## name: ## description: ## data: 39 rows x 3569 columns ## sample_meta: 39 rows x 4 columns ## variable_meta: 3569 rows x 1 columns
Now we construct a minimal metabolomics workflow consisting of quality filtering, normalisation, imputation and scaling before applying PCA.
# model sequence M = mv_feature_filter( threshold = 40, method='across', factor_name='FCS') + mv_sample_filter(mv_threshold =40) + vec_norm() + knn_impute() + log_transform() + mean_centre() + PCA() # apply model M = model_apply(M,DE)
## Warning in knnimp(x, k, maxmiss = rowmax, maxp = maxp): 198 rows with more than 50 % entries missing; ## mean imputation used for these rows
# pca scores plot C = pca_scores_plot(factor_name=c('FCS')) chart_plot(C,M[length(M)])