Omics technologies have advanced precision medicine by enabling the prediction of clinically relevant outcomes based on individual molecular profiles. However, their diagnostic and prognostic applications remain limited due to variability introduced by different technological platforms, laboratory protocols, and data processing methods. These inconsistencies hinder the generalizability of machine learning (ML) models trained on individual datasets.
Single-sample molecular scoring has emerged as a promising solution to improve interstudy reproducibility. By summarizing molecular features (e.g., gene expression) into pathway or gene set activity scores, these methods provide a more standardized and robust representation of biological processes. This enables the development of ML models that generalize across datasets, even when quantification methods differ.
Several R packages implement molecular scoring methods, such as
decoupleR
((???)) and GSVA
((???)). pathMED
unifies
and simplifies these approaches, providing a common input format and
allowing users to compute multiple scores efficiently. Additionally,
pathMED
includes our novel method, M-scores, which enhances disease
characterization and prediction by leveraging reference datasets
((???)).
A key limitation of pathway-based scoring is the heterogeneity within gene sets, where different genes may have opposing effects on biological processes. Direct scoring of such broad gene sets can obscure meaningful patterns. To address this, pathMED incorporates methods to refine gene sets by clustering features with coordinated activity, improving pathway granularity and revealing previously hidden subpathways.
Many ML-based clinical prediction models also suffer from overfitting, unreliable performance metrics, and limited reproducibility in external datasets, restricting their clinical applicability. To overcome these challenges, pathMED provides a unified framework for ML model training, validation, and prediction. While it leverages the caret package, pathMED simplifies its usage and implements nested cross-validation by default to enhance model reliability.
You can install pathMED from the GitHub repository:
devtools::install_github("jordimartorell/pathMED")
Please note that on some operating systems such as Ubuntu there may be installation errors in the packages “metrica”, “factoextra” and “FactoMineR”, as it is necessary to install system dependencies with administrator permissions. These system dependencies can be identified the following code (adapted to Ubuntu 20.04) and installed by running the output in a terminal as administrator:
if (!require("pak")) {
install.packages("pak"))
}
pak::pkg_sysreqs(
c("metrica", "factoextra", "FactoMineR"),
"ubuntu", "20.04"
)
Gene sets often contain genes associated with a specific biological
function, but their expression can be heterogeneous, sometimes even
exhibiting opposing directions when the function is altered (e.g., in
disease). The dissectDB()
function clusters the genes within each set
into subsets based on coordinated expression patterns. This clustering
can be performed using one or more molecular datasets, which may include
control samples or not. The same dataset used in the following sections
can also be used for this purpose. Both preloaded and custom gene sets
can be used (see the next section for details). The following code
splits KEGG pathways into subpathways using the example dataset:
library(pathMED)
data(pathMEDExampleData)
customKEGG <- dissectDB(list(pathMEDExampleData), geneSets = "kegg")
The output is a gene set that can be used with the getScores()
function. In the new gene set, subpathways are labeled with the suffix
_splitx. For example, for the pathway hsa04714:
# Before splitting
data(genesetsData)
print(head(genesetsData[["kegg"]][["hsa04714"]]))
## [1] "COX6A2" "COX7A1" "COX15" "GRB2" "HRAS" "MAP3K5"
# After splitting
print(customKEGG[grep("hsa04714", names(customKEGG))])
## $hsa04714.split1
## [1] "NDUFA4" "MGLL" "NDUFAF3" "NDUFB2"
##
## $hsa04714.split2
## [1] "NDUFS4" "UQCRQ" "NDUFA11" "SMARCD1" "NDUFA6" "NDUFA3" "COX16"
##
## $hsa04714.split3
## [1] "COX20" "RPS6" "NDUFB6" "NDUFC2" "UQCRC2" "ACTL6A" "UQCRH"
##
## $hsa04714.split4
## [1] "MLST8" "NDUFA2" "NDUFV1" "NDUFS8" "CPT1B" "NDUFS7"
Only the features available in the input dataset are included in this object.
The getScores()
function calculates a score for each gene set and
sample. It takes as input a matrix or data frame with molecular features
in rows and samples in columns, along with a gene set collection. The
function supports 20 different scoring methods, which can be specified
using the method
parameter. The available methods and their references
are:
Gene sets can be provided as a named list, where each entry contains the
genes associated with a specific gene set, using the geneSets
parameter. Alternatively, preloaded databases can be used by specifying
their name in the geneSets
parameter.
Note that these preloaded gene sets are built with gene symbols. Therefore, the row names of the input data should be gene symbols to use them.
If running the script on a multi-core processor, you can use the cores
parameter to specify the number of workers (most pathMED
functions
include this parameter). Additionally, specific parameters for the
scoring functions can be customized (see, for example,
?GSVA::gsvaParam
).
To run this function with the example data:
library(pathMED)
data(pathMEDExampleData)
scoresExample <- getScores(pathMEDExampleData, geneSets = "kegg",
method = "Z-score")
print(scoresExample[1:5, 1:5])
## 2f13c769_68 bd9262b7_91 564c1b1a_143 55d3e499_0 2f13c769_109
## hsa00010 0.6615978 0.3752335 -0.8888598 1.87367072 1.1322258
## hsa00020 -3.0246505 -2.4858390 0.9356097 0.75894566 1.3966702
## hsa00030 -2.4434677 1.3570865 -2.6715807 0.06263129 -0.2135047
## hsa00040 -0.3852843 -0.4486775 -0.4217614 0.51468743 0.6125073
## hsa00051 -0.3852843 -0.4486775 -0.4217614 0.51468743 0.6125073
The result is a matrix with gene set identifiers as rows and samples as
columns. To annotate the identifiers with their corresponding
descriptions, you can use the ann2term()
function.
annotatedPathways <- ann2term(scoresExample)
head(annotatedPathways)
## ID term
## 1 hsa00010 Glycolysis / Gluconeogenesis
## 2 hsa00020 Citrate cycle (TCA cycle)
## 3 hsa00030 Pentose phosphate pathway
## 4 hsa00040 Pentose and glucuronate interconversions
## 5 hsa00051 Fructose and mannose metabolism
## 6 hsa00052 Galactose metabolism
The next step in the core pathMED
pipeline is to use the scores
calculated in the previous step to train machine learning (ML) models
for predicting categorical or numerical clinical variables. The
trainModel()
function fits and tests multiple algorithms, selecting
the one with the best performance for predicting a variable. The input
objects are inputData
(the scores matrix obtained in the previous
step, or any other numerical matrix), and metadata
(a data frame
containing sample metadata). The var2predict
parameter indicates the
column name of metadata with the variable to be predicted.
The function methodsML()
should be used to prepare a list of ML
algorithms to be trained and tested. One or more algorithms can be
selected using the algorithms
parameter (see the function help for
details), and the amount of hyperparameters combinations to be tested is
specified with the tuneLength
parameter.
Nested cross-validation is performed using Koutter
folds for the outer
cross-validation (CV) and Kinner
folds for the inner CV. The inner
k-fold CV can be repeated multiple times (repeatsCV
parameter) for
hyperparameter tuning.
In the example dataset, the metadata include a column called Response
,
indicating whether patients responded to a treatment. The
positiveClass
parameter can be specified to calculate performance
metrics based on the chosen positive class (e.g., diseased).
data(pathMEDExampleMetadata)
modelsList <- methodsML(algorithms = c("rf", "knn"), outcomeClass = "character")
set.seed(123)
trainedModel <- trainModel(scoresExample,
metadata = pathMEDExampleMetadata,
var2predict = "Response",
positiveClass = "YES",
models = modelsList,
Koutter = 2,
Kinner = 2,
repeatsCV = 1
)
##
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
print(trainedModel)
## $model
## Random Forest
##
## 40 samples
## 310 predictors
## 2 classes: 'NO', 'YES'
##
## No pre-processing
## Resampling: None
##
## $stats
## rf knn
## mcc 0.3115628 0.1483202
## balacc 0.6446970 0.5696970
## accuracy 0.6466165 0.5751880
## recall 0.4393939 0.4393939
## specificity 0.8500000 0.7000000
## npv 0.6071429 0.5584416
## precision 0.7285714 0.6000000
## fscore 0.5476190 0.5000000
## perc.lossSamples 0.0000000 0.0000000
##
## $bestTune
## .mtry
## 1 2.5
##
## $subsample.preds
## NO YES obs
## bd9262b7_91 0.472 0.528 NO
## 564c1b1a_143 0.524 0.476 NO
## 55d3e499_0 0.574 0.426 NO
## 564c1b1a_124 0.572 0.428 NO
## 55d3e499_52 0.554 0.446 NO
## 2f13c769_0 0.526 0.474 NO
## 96bf8031_0 0.650 0.350 NO
## 3a42e820_170 0.582 0.418 NO
## 3a42e820_79 0.654 0.346 NO
## 2270bbfa_91 0.702 0.298 NO
## b60c5f9a_430 0.668 0.332 YES
## 3daec0f2_219 0.476 0.524 YES
## 3daec0f2_295 0.320 0.680 YES
## dc81f201_0 0.562 0.438 YES
## a1b702ca_382 0.376 0.624 YES
## 616d027e_0 0.486 0.514 YES
## 3daec0f2_384 0.344 0.656 YES
## b60c5f9a_625 0.562 0.438 YES
## dc81f201_157 0.466 0.534 YES
## 601fe1df_4 0.536 0.464 YES
## b60c5f9a_535 0.610 0.390 YES
## 2f13c769_68 0.524 0.476 NO
## 2f13c769_109 0.398 0.602 NO
## 564c1b1a_52 0.596 0.404 NO
## 3a42e820_435 0.550 0.450 NO
## 49eb5d3c_163 0.656 0.344 NO
## 96bf8031_96 0.680 0.320 NO
## bd9262b7_0 0.586 0.414 NO
## 3a42e820_560 0.494 0.506 NO
## 86a2a8eb_0 0.500 0.500 NO
## 2270bbfa_0 0.748 0.252 NO
## 3daec0f2_126 0.386 0.614 YES
## 3daec0f2_35 0.446 0.554 YES
## a1b702ca_284 0.432 0.568 YES
## 616d027e_672 0.574 0.426 YES
## a1b702ca_27 0.596 0.404 YES
## 601fe1df_0 0.630 0.370 YES
## dc81f201_38 0.524 0.476 YES
## b60c5f9a_89 0.540 0.460 YES
## a1b702ca_0 0.556 0.444 YES
The output object is a list with the following elements:
This object may be used with the predictExternal()
function to make
predictions on new data with the same features than the training data.
data(refData)
scoresExternal <- getScores(refData$dataset1,
geneSets = "kegg",
method = "Z-score"
)
predictions <- predictExternal(scoresExternal, trainedModel)
head(predictions)
## Obs Prediction
## 1 GSM1594346 YES
## 2 GSM1594449 NO
## 3 GSM1594572 YES
## 4 GSM1594656 NO
## 5 GSM1594994 YES
## 6 GSM1594605 NO
In addition to previously developed molecular scoring methods, pathMED includes the novel M-score method. M-scores were used in a previous study that used personalized scoring to predict drug response, symptoms and disease progression in Systemic Lupus Erythematosus patients based on gene expression data (???).
M-scores are calculated by comparing pathway activities between cases
and healthy controls, requiring the availability of control sample data.
To use M-scores with the getScores()
function, the labels
parameter
must be used to indicate control samples (using 0 or “Healthy”).
As described previously (???), M-scores can be used to identify key pathways associated with the studied disease. To achieve this, a reference dataset of patients and healthy controls must be constructed, either from a single dataset or by integrating multiple datasets, following the steps outlined below. This reference will also enable the imputation of M-scores in new cohorts or individual patients where healthy controls are not available.
We will load refData
into the environment, which includes four
log2-normalized gene expression datasets and four corresponding metadata
datasets containing clinical information. These datasets were extracted
from NCBI GEO and contain both lupus and healthy samples, encoded in the
metadata as “Healthy_sample” and “Disease_sample”.
The buildRefObject()
function can be used to create a refObject
object with the required structure for the following steps. The function
require:
data
) or a single dataset if only
one is available.groupVar
).controlGroup
), typically representing
healthy samples. If different datasets use different control labels,
a separate value should be provided for each metadata file.data(refData)
refObject <- buildRefObject(
data = list(
refData$dataset1, refData$dataset2,
refData$dataset3, refData$dataset4
),
metadata = list(
refData$metadata1, refData$metadata2,
refData$metadata3, refData$metadata4
),
groupVar = "group",
controlGroup = "Healthy_sample"
)
To create the refObject
object, it is recommended to include as many
datasets as possible to improve robustness. Ensure that all gene
expression datasets are log2 transformed and follow a normal
distribution. Additionally, gene names must be annotated as Gene Symbols
in both reference and test datasets.
After creating a refObject
object, the next step is to calculate
M-scores for the reference datasets with the mScores_createReference()
function.
refMscores <- mScores_createReference(refObject,
geneSets = "tmod",
cores = 1
)
To identify the most relevant gene sets for the studied phenotype, you
can use the mScores_filterPaths()
function. This function selects
pathways based on two key parameters: perc_samples
, which defines the
percentage of samples in which a pathway must be statistically
significant, and min_datasets
, which sets the minimum number of
datasets that must meet this th
These parameters help identify pathways that may be significant in small patient subgroups—a common scenario in heterogeneous diseases—while also ensuring that the selected pathways appear deregulated across multiple studies, reducing the risk of study-specific biases. The significance of each pathway in each sample is determined using its M-score, as described in the original study (???).
relevantPaths <- mScores_filterPaths(
MRef = refMscores,
min_datasets = 3,
perc_samples = 10
)
## Selected paths: 168
Once the reference data is prepared and the relevant gene sets are
selected, you can calculate M-scores for datasets without healthy
controls using the mScores_imputeFromReference()
function.
The function requires a matrix containing case samples, gene sets (or
the output of the mScores_filterPaths()
function), and the reference
data generated by mScores_createReference()
. M-scores are computed
using the k most similar samples from the reference data, with k
specified through the nk
parameter. The distance.threshold
parameter
defines the maximum allowable Euclidean distance between a sample and
the reference for M-score imputation.
The output is a list containing a matrix of M-scores, where gene sets are in rows and samples in columns, along with a dataframe of calculated distances between each sample and the reference.
mScoresExample <- mScores_imputeFromReference(
inputData = pathMEDExampleData,
geneSets = relevantPaths,
externalReference = refMscores,
distance.threshold = 50
)
## Calculating M-Scores by similarity with samples from external reference for 40 patients
print(mScoresExample$Mscores[1:5, 1:5])
## 2f13c769_68 bd9262b7_91 564c1b1a_143 55d3e499_0 2f13c769_109
## LI.M0 4.667796657 0.3831264 -0.7573253 -0.1096919 -1.1079187
## LI.M2.1 0.008238995 -0.2117298 -0.3545561 -0.3799828 -0.3549857
## LI.M4.1 1.559523495 1.3094181 1.2007531 1.2003028 1.6567628
## LI.M4.2 1.722801405 1.2468369 1.1522729 1.1678329 1.5024763
## LI.M4.4 0.496440873 0.5941196 0.8777018 0.9477592 1.2017842
print(mScoresExample$Distances[1:5, ])
## ID Distance
## 1 2f13c769_68 47.60733
## 2 bd9262b7_91 47.62470
## 3 564c1b1a_143 47.61592
## 4 55d3e499_0 47.69350
## 5 2f13c769_109 47.51437
## R Under development (unstable) (2025-03-13 r87965)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] caret_7.0-1 lattice_0.22-7 ggplot2_3.5.1 pathMED_0.99.2
## [5] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] jsonlite_2.0.0 magrittr_2.0.3
## [3] magick_2.8.6 TH.data_1.1-3
## [5] estimability_1.5.1 farver_2.1.2
## [7] rmarkdown_2.29 minerva_1.5.10
## [9] vctrs_0.6.5 memoise_2.0.1
## [11] tinytex_0.56 energy_1.7-12
## [13] htmltools_0.5.8.1 S4Arrays_1.7.3
## [15] xgboost_1.7.9.1 Rhdf5lib_1.29.2
## [17] rhdf5_2.51.2 SparseArray_1.7.7
## [19] pROC_1.18.5 sass_0.4.9
## [21] parallelly_1.43.0 bslib_0.9.0
## [23] htmlwidgets_1.6.4 plyr_1.8.9
## [25] sandwich_3.1-1 emmeans_1.11.0
## [27] zoo_1.8-13 lubridate_1.9.4
## [29] cachem_1.1.0 gam_1.22-5
## [31] lifecycle_1.0.4 iterators_1.0.14
## [33] pkgconfig_2.0.3 rsvd_1.0.5
## [35] Matrix_1.7-3 R6_2.6.1
## [37] fastmap_1.2.0 GenomeInfoDbData_1.2.14
## [39] MatrixGenerics_1.19.1 future_1.34.0
## [41] digest_0.6.37 colorspace_2.1-1
## [43] patchwork_1.3.0 AnnotationDbi_1.69.1
## [45] S4Vectors_0.45.4 irlba_2.3.5.1
## [47] GenomicRanges_1.59.1 RSQLite_2.3.9
## [49] beachmat_2.23.7 randomForest_4.7-1.2
## [51] timechange_0.3.0 httr_1.4.7
## [53] abind_1.4-8 compiler_4.6.0
## [55] proxy_0.4-27 gsl_2.1-8
## [57] bit64_4.6.0-1 withr_3.0.2
## [59] BiocParallel_1.41.5 metrica_2.1.0
## [61] DBI_1.2.3 HDF5Array_1.35.16
## [63] MASS_7.3-65 lava_1.8.1
## [65] DelayedArray_0.33.6 rjson_0.2.23
## [67] scatterplot3d_0.3-44 flashClust_1.01-2
## [69] ModelMetrics_1.2.2.2 tools_4.6.0
## [71] future.apply_1.11.3 FactoMineR_2.11
## [73] nnet_7.3-20 glue_1.8.0
## [75] h5mread_0.99.4 rhdf5filters_1.19.2
## [77] nlme_3.1-168 grid_4.6.0
## [79] cluster_2.1.8.1 reshape2_1.4.4
## [81] generics_0.1.3 recipes_1.2.1
## [83] gtable_0.3.6 class_7.3-23
## [85] tidyr_1.3.1 data.table_1.17.0
## [87] BiocSingular_1.23.0 ScaledMatrix_1.15.0
## [89] XVector_0.47.2 BiocGenerics_0.53.6
## [91] ggrepel_0.9.6 foreach_1.5.2
## [93] pillar_1.10.2 stringr_1.5.1
## [95] GSVA_2.1.12 splines_4.6.0
## [97] dplyr_1.1.4 survival_3.8-3
## [99] bit_4.6.0 caretEnsemble_4.0.1
## [101] annotate_1.85.0 tidyselect_1.2.1
## [103] SingleCellExperiment_1.29.2 pbapply_1.7-2
## [105] Biostrings_2.75.4 knitr_1.50
## [107] bookdown_0.42 IRanges_2.41.3
## [109] SummarizedExperiment_1.37.0 stats4_4.6.0
## [111] xfun_0.52 Biobase_2.67.0
## [113] hardhat_1.4.1 factoextra_1.0.7
## [115] timeDate_4041.110 matrixStats_1.5.0
## [117] DT_0.33 stringi_1.8.7
## [119] UCSC.utils_1.3.1 boot_1.3-31
## [121] yaml_2.3.10 evaluate_1.0.3
## [123] codetools_0.2-20 tibble_3.2.1
## [125] graph_1.85.3 BiocManager_1.30.25
## [127] multcompView_0.1-10 cli_3.6.4
## [129] rpart_4.1.24 xtable_1.8-4
## [131] munsell_0.5.1 jquerylib_0.1.4
## [133] Rcpp_1.0.14 GenomeInfoDb_1.43.4
## [135] globals_0.16.3 coda_0.19-4.1
## [137] png_0.1-8 XML_3.99-0.18
## [139] parallel_4.6.0 leaps_3.2
## [141] gower_1.0.2 blob_1.2.4
## [143] sparseMatrixStats_1.19.0 SpatialExperiment_1.17.0
## [145] listenv_0.9.1 mvtnorm_1.3-3
## [147] GSEABase_1.69.1 ipred_0.9-15
## [149] e1071_1.7-16 scales_1.3.0
## [151] prodlim_2024.06.25 purrr_1.0.4
## [153] crayon_1.5.3 ada_2.0-5
## [155] rlang_1.1.5 KEGGREST_1.47.1
## [157] multcomp_1.4-28