Contents

1 Introduction

Omics technologies have advanced precision medicine by enabling the prediction of clinically relevant outcomes based on individual molecular profiles. However, their diagnostic and prognostic applications remain limited due to variability introduced by different technological platforms, laboratory protocols, and data processing methods. These inconsistencies hinder the generalizability of machine learning (ML) models trained on individual datasets.

Single-sample molecular scoring has emerged as a promising solution to improve interstudy reproducibility. By summarizing molecular features (e.g., gene expression) into pathway or gene set activity scores, these methods provide a more standardized and robust representation of biological processes. This enables the development of ML models that generalize across datasets, even when quantification methods differ.

Several R packages implement molecular scoring methods, such as decoupleR ((???)) and GSVA ((???)). pathMED unifies and simplifies these approaches, providing a common input format and allowing users to compute multiple scores efficiently. Additionally, pathMED includes our novel method, M-scores, which enhances disease characterization and prediction by leveraging reference datasets ((???)).

A key limitation of pathway-based scoring is the heterogeneity within gene sets, where different genes may have opposing effects on biological processes. Direct scoring of such broad gene sets can obscure meaningful patterns. To address this, pathMED incorporates methods to refine gene sets by clustering features with coordinated activity, improving pathway granularity and revealing previously hidden subpathways.

Many ML-based clinical prediction models also suffer from overfitting, unreliable performance metrics, and limited reproducibility in external datasets, restricting their clinical applicability. To overcome these challenges, pathMED provides a unified framework for ML model training, validation, and prediction. While it leverages the caret package, pathMED simplifies its usage and implements nested cross-validation by default to enhance model reliability.

2 Installation

You can install pathMED from the GitHub repository:

devtools::install_github("jordimartorell/pathMED")

Please note that on some operating systems such as Ubuntu there may be installation errors in the packages “metrica”, “factoextra” and “FactoMineR”, as it is necessary to install system dependencies with administrator permissions. These system dependencies can be identified the following code (adapted to Ubuntu 20.04) and installed by running the output in a terminal as administrator:

if (!require("pak")) {
    install.packages("pak"))
}
pak::pkg_sysreqs(
    c("metrica", "factoextra", "FactoMineR"),
    "ubuntu", "20.04"
)

3 Core pipeline

3.1 (Optional) Decomposing gene sets into coexpressed subsets

Gene sets often contain genes associated with a specific biological function, but their expression can be heterogeneous, sometimes even exhibiting opposing directions when the function is altered (e.g., in disease). The dissectDB() function clusters the genes within each set into subsets based on coordinated expression patterns. This clustering can be performed using one or more molecular datasets, which may include control samples or not. The same dataset used in the following sections can also be used for this purpose. Both preloaded and custom gene sets can be used (see the next section for details). The following code splits KEGG pathways into subpathways using the example dataset:

library(pathMED)
data(pathMEDExampleData)

customKEGG <- dissectDB(list(pathMEDExampleData), geneSets = "kegg")

The output is a gene set that can be used with the getScores() function. In the new gene set, subpathways are labeled with the suffix _splitx. For example, for the pathway hsa04714:

# Before splitting
data(genesetsData)
print(head(genesetsData[["kegg"]][["hsa04714"]]))
## [1] "COX6A2" "COX7A1" "COX15"  "GRB2"   "HRAS"   "MAP3K5"
# After splitting
print(customKEGG[grep("hsa04714", names(customKEGG))])
## $hsa04714.split1
## [1] "NDUFA4"  "MGLL"    "NDUFAF3" "NDUFB2" 
## 
## $hsa04714.split2
## [1] "NDUFS4"  "UQCRQ"   "NDUFA11" "SMARCD1" "NDUFA6"  "NDUFA3"  "COX16"  
## 
## $hsa04714.split3
## [1] "COX20"  "RPS6"   "NDUFB6" "NDUFC2" "UQCRC2" "ACTL6A" "UQCRH" 
## 
## $hsa04714.split4
## [1] "MLST8"  "NDUFA2" "NDUFV1" "NDUFS8" "CPT1B"  "NDUFS7"

Only the features available in the input dataset are included in this object.

3.2 Scoring

The getScores() function calculates a score for each gene set and sample. It takes as input a matrix or data frame with molecular features in rows and samples in columns, along with a gene set collection. The function supports 20 different scoring methods, which can be specified using the method parameter. The available methods and their references are:

  • M-score ((???))
  • GSVA ((???))
  • ssGSEA ((???))
  • singscore ((???))
  • Plage ((???))
  • Z-score ((???))
  • AUCell ((???))
  • decoupleR methods (MDT, MLM, ORA, UDT, ULM, WMEAN, norm_WMEAN, corr_WMEAN, WSUM, norm_WSUM and corr_WSUM) ((???))
  • FGSEA and norm_FGSEA ((???))

Gene sets can be provided as a named list, where each entry contains the genes associated with a specific gene set, using the geneSets parameter. Alternatively, preloaded databases can be used by specifying their name in the geneSets parameter.

  • Reactome (“reactome”)
  • KEGG (“kegg”)
  • Transcriptional modules associated with immune functions (“tmod”)
  • Gene Ontology BP, MF and CC (“go_bp”, “go_mf”, “go_cc”)
  • Disgenet (“disgenet”)
  • Human Phenotype Ontology (“hpo”)
  • WikiPathways (“wikipathways”)
  • PharmGKB (“pharmgkb”)
  • LINCS (“lincs”)
  • Comparative Toxicogenomics Database (“ctd”)

Note that these preloaded gene sets are built with gene symbols. Therefore, the row names of the input data should be gene symbols to use them.

If running the script on a multi-core processor, you can use the cores parameter to specify the number of workers (most pathMED functions include this parameter). Additionally, specific parameters for the scoring functions can be customized (see, for example, ?GSVA::gsvaParam).

To run this function with the example data:

library(pathMED)
data(pathMEDExampleData)
scoresExample <- getScores(pathMEDExampleData, geneSets = "kegg", 
                            method = "Z-score")
print(scoresExample[1:5, 1:5])
##          2f13c769_68 bd9262b7_91 564c1b1a_143 55d3e499_0 2f13c769_109
## hsa00010   0.6615978   0.3752335   -0.8888598 1.87367072    1.1322258
## hsa00020  -3.0246505  -2.4858390    0.9356097 0.75894566    1.3966702
## hsa00030  -2.4434677   1.3570865   -2.6715807 0.06263129   -0.2135047
## hsa00040  -0.3852843  -0.4486775   -0.4217614 0.51468743    0.6125073
## hsa00051  -0.3852843  -0.4486775   -0.4217614 0.51468743    0.6125073

The result is a matrix with gene set identifiers as rows and samples as columns. To annotate the identifiers with their corresponding descriptions, you can use the ann2term() function.

annotatedPathways <- ann2term(scoresExample)
head(annotatedPathways)
##         ID                                     term
## 1 hsa00010             Glycolysis / Gluconeogenesis
## 2 hsa00020                Citrate cycle (TCA cycle)
## 3 hsa00030                Pentose phosphate pathway
## 4 hsa00040 Pentose and glucuronate interconversions
## 5 hsa00051          Fructose and mannose metabolism
## 6 hsa00052                     Galactose metabolism

3.3 Training ML models

The next step in the core pathMED pipeline is to use the scores calculated in the previous step to train machine learning (ML) models for predicting categorical or numerical clinical variables. The trainModel() function fits and tests multiple algorithms, selecting the one with the best performance for predicting a variable. The input objects are inputData (the scores matrix obtained in the previous step, or any other numerical matrix), and metadata (a data frame containing sample metadata). The var2predict parameter indicates the column name of metadata with the variable to be predicted.

The function methodsML() should be used to prepare a list of ML algorithms to be trained and tested. One or more algorithms can be selected using the algorithms parameter (see the function help for details), and the amount of hyperparameters combinations to be tested is specified with the tuneLength parameter.

Nested cross-validation is performed using Koutter folds for the outer cross-validation (CV) and Kinner folds for the inner CV. The inner k-fold CV can be repeated multiple times (repeatsCV parameter) for hyperparameter tuning.

In the example dataset, the metadata include a column called Response, indicating whether patients responded to a treatment. The positiveClass parameter can be specified to calculate performance metrics based on the chosen positive class (e.g., diseased).

data(pathMEDExampleMetadata)

modelsList <- methodsML(algorithms = c("rf", "knn"), outcomeClass = "character")

set.seed(123)
trainedModel <- trainModel(scoresExample,
    metadata = pathMEDExampleMetadata,
    var2predict = "Response",
    positiveClass = "YES",
    models = modelsList,
    Koutter = 2,
    Kinner = 2,
    repeatsCV = 1
)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%
print(trainedModel)
## $model
## Random Forest 
## 
##  40 samples
## 310 predictors
##   2 classes: 'NO', 'YES' 
## 
## No pre-processing
## Resampling: None 
## 
## $stats
##                         rf       knn
## mcc              0.3115628 0.1483202
## balacc           0.6446970 0.5696970
## accuracy         0.6466165 0.5751880
## recall           0.4393939 0.4393939
## specificity      0.8500000 0.7000000
## npv              0.6071429 0.5584416
## precision        0.7285714 0.6000000
## fscore           0.5476190 0.5000000
## perc.lossSamples 0.0000000 0.0000000
## 
## $bestTune
##   .mtry
## 1   2.5
## 
## $subsample.preds
##                 NO   YES obs
## bd9262b7_91  0.472 0.528  NO
## 564c1b1a_143 0.524 0.476  NO
## 55d3e499_0   0.574 0.426  NO
## 564c1b1a_124 0.572 0.428  NO
## 55d3e499_52  0.554 0.446  NO
## 2f13c769_0   0.526 0.474  NO
## 96bf8031_0   0.650 0.350  NO
## 3a42e820_170 0.582 0.418  NO
## 3a42e820_79  0.654 0.346  NO
## 2270bbfa_91  0.702 0.298  NO
## b60c5f9a_430 0.668 0.332 YES
## 3daec0f2_219 0.476 0.524 YES
## 3daec0f2_295 0.320 0.680 YES
## dc81f201_0   0.562 0.438 YES
## a1b702ca_382 0.376 0.624 YES
## 616d027e_0   0.486 0.514 YES
## 3daec0f2_384 0.344 0.656 YES
## b60c5f9a_625 0.562 0.438 YES
## dc81f201_157 0.466 0.534 YES
## 601fe1df_4   0.536 0.464 YES
## b60c5f9a_535 0.610 0.390 YES
## 2f13c769_68  0.524 0.476  NO
## 2f13c769_109 0.398 0.602  NO
## 564c1b1a_52  0.596 0.404  NO
## 3a42e820_435 0.550 0.450  NO
## 49eb5d3c_163 0.656 0.344  NO
## 96bf8031_96  0.680 0.320  NO
## bd9262b7_0   0.586 0.414  NO
## 3a42e820_560 0.494 0.506  NO
## 86a2a8eb_0   0.500 0.500  NO
## 2270bbfa_0   0.748 0.252  NO
## 3daec0f2_126 0.386 0.614 YES
## 3daec0f2_35  0.446 0.554 YES
## a1b702ca_284 0.432 0.568 YES
## 616d027e_672 0.574 0.426 YES
## a1b702ca_27  0.596 0.404 YES
## 601fe1df_0   0.630 0.370 YES
## dc81f201_38  0.524 0.476 YES
## b60c5f9a_89  0.540 0.460 YES
## a1b702ca_0   0.556 0.444 YES

The output object is a list with the following elements:

  • model: The selected algorithm trained with the entire dataset.
  • stats: Performance metrics for each tested algorithms. These metrics are calculated with the testing samples of the outter CV.
  • bestTune: Optimized hyperparameters for the selected algorithm.
  • subsample.preds: The prediction probability for each class and sample.

This object may be used with the predictExternal() function to make predictions on new data with the same features than the training data.

data(refData)

scoresExternal <- getScores(refData$dataset1,
    geneSets = "kegg",
    method = "Z-score"
)

predictions <- predictExternal(scoresExternal, trainedModel)
head(predictions)
##          Obs Prediction
## 1 GSM1594346        YES
## 2 GSM1594449         NO
## 3 GSM1594572        YES
## 4 GSM1594656         NO
## 5 GSM1594994        YES
## 6 GSM1594605         NO

4 Using M-scores to extract disease-relevant information

In addition to previously developed molecular scoring methods, pathMED includes the novel M-score method. M-scores were used in a previous study that used personalized scoring to predict drug response, symptoms and disease progression in Systemic Lupus Erythematosus patients based on gene expression data (???).

M-scores are calculated by comparing pathway activities between cases and healthy controls, requiring the availability of control sample data. To use M-scores with the getScores() function, the labels parameter must be used to indicate control samples (using 0 or “Healthy”).

As described previously (???), M-scores can be used to identify key pathways associated with the studied disease. To achieve this, a reference dataset of patients and healthy controls must be constructed, either from a single dataset or by integrating multiple datasets, following the steps outlined below. This reference will also enable the imputation of M-scores in new cohorts or individual patients where healthy controls are not available.

4.1 Build the reference data object

We will load refData into the environment, which includes four log2-normalized gene expression datasets and four corresponding metadata datasets containing clinical information. These datasets were extracted from NCBI GEO and contain both lupus and healthy samples, encoded in the metadata as “Healthy_sample” and “Disease_sample”.

The buildRefObject() function can be used to create a refObject object with the required structure for the following steps. The function require:

  • A list of expression datasets (data) or a single dataset if only one is available.
  • A list of metadata files (one per dataset) or a single metadata table containing information for all samples.
  • The name of the grouping column in the metadata (groupVar).
  • The control group label (controlGroup), typically representing healthy samples. If different datasets use different control labels, a separate value should be provided for each metadata file.
data(refData)

refObject <- buildRefObject(
    data = list(
        refData$dataset1, refData$dataset2,
        refData$dataset3, refData$dataset4
    ),
    metadata = list(
        refData$metadata1, refData$metadata2,
        refData$metadata3, refData$metadata4
    ),
    groupVar = "group",
    controlGroup = "Healthy_sample"
)

To create the refObject object, it is recommended to include as many datasets as possible to improve robustness. Ensure that all gene expression datasets are log2 transformed and follow a normal distribution. Additionally, gene names must be annotated as Gene Symbols in both reference and test datasets.

4.2 Calculate the reference M-scores

After creating a refObject object, the next step is to calculate M-scores for the reference datasets with the mScores_createReference() function.

refMscores <- mScores_createReference(refObject,
    geneSets = "tmod",
    cores = 1
)

4.3 Identify key gene sets

To identify the most relevant gene sets for the studied phenotype, you can use the mScores_filterPaths() function. This function selects pathways based on two key parameters: perc_samples, which defines the percentage of samples in which a pathway must be statistically significant, and min_datasets, which sets the minimum number of datasets that must meet this th

These parameters help identify pathways that may be significant in small patient subgroups—a common scenario in heterogeneous diseases—while also ensuring that the selected pathways appear deregulated across multiple studies, reducing the risk of study-specific biases. The significance of each pathway in each sample is determined using its M-score, as described in the original study (???).

relevantPaths <- mScores_filterPaths(
    MRef = refMscores,
    min_datasets = 3,
    perc_samples = 10
)
## Selected paths: 168

4.4 Calculate the M-scores without healthy controls

Once the reference data is prepared and the relevant gene sets are selected, you can calculate M-scores for datasets without healthy controls using the mScores_imputeFromReference() function.

The function requires a matrix containing case samples, gene sets (or the output of the mScores_filterPaths() function), and the reference data generated by mScores_createReference(). M-scores are computed using the k most similar samples from the reference data, with k specified through the nk parameter. The distance.threshold parameter defines the maximum allowable Euclidean distance between a sample and the reference for M-score imputation.

The output is a list containing a matrix of M-scores, where gene sets are in rows and samples in columns, along with a dataframe of calculated distances between each sample and the reference.

mScoresExample <- mScores_imputeFromReference(
    inputData = pathMEDExampleData,
    geneSets = relevantPaths,
    externalReference = refMscores,
    distance.threshold = 50
)
## Calculating M-Scores by similarity with samples from external reference for 40 patients
print(mScoresExample$Mscores[1:5, 1:5])
##         2f13c769_68 bd9262b7_91 564c1b1a_143 55d3e499_0 2f13c769_109
## LI.M0   4.667796657   0.3831264   -0.7573253 -0.1096919   -1.1079187
## LI.M2.1 0.008238995  -0.2117298   -0.3545561 -0.3799828   -0.3549857
## LI.M4.1 1.559523495   1.3094181    1.2007531  1.2003028    1.6567628
## LI.M4.2 1.722801405   1.2468369    1.1522729  1.1678329    1.5024763
## LI.M4.4 0.496440873   0.5941196    0.8777018  0.9477592    1.2017842
print(mScoresExample$Distances[1:5, ])
##             ID Distance
## 1  2f13c769_68 47.60733
## 2  bd9262b7_91 47.62470
## 3 564c1b1a_143 47.61592
## 4   55d3e499_0 47.69350
## 5 2f13c769_109 47.51437

5 Session info

## R Under development (unstable) (2025-03-13 r87965)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] caret_7.0-1      lattice_0.22-7   ggplot2_3.5.1    pathMED_0.99.2  
## [5] BiocStyle_2.35.0
## 
## loaded via a namespace (and not attached):
##   [1] jsonlite_2.0.0              magrittr_2.0.3             
##   [3] magick_2.8.6                TH.data_1.1-3              
##   [5] estimability_1.5.1          farver_2.1.2               
##   [7] rmarkdown_2.29              minerva_1.5.10             
##   [9] vctrs_0.6.5                 memoise_2.0.1              
##  [11] tinytex_0.56                energy_1.7-12              
##  [13] htmltools_0.5.8.1           S4Arrays_1.7.3             
##  [15] xgboost_1.7.9.1             Rhdf5lib_1.29.2            
##  [17] rhdf5_2.51.2                SparseArray_1.7.7          
##  [19] pROC_1.18.5                 sass_0.4.9                 
##  [21] parallelly_1.43.0           bslib_0.9.0                
##  [23] htmlwidgets_1.6.4           plyr_1.8.9                 
##  [25] sandwich_3.1-1              emmeans_1.11.0             
##  [27] zoo_1.8-13                  lubridate_1.9.4            
##  [29] cachem_1.1.0                gam_1.22-5                 
##  [31] lifecycle_1.0.4             iterators_1.0.14           
##  [33] pkgconfig_2.0.3             rsvd_1.0.5                 
##  [35] Matrix_1.7-3                R6_2.6.1                   
##  [37] fastmap_1.2.0               GenomeInfoDbData_1.2.14    
##  [39] MatrixGenerics_1.19.1       future_1.34.0              
##  [41] digest_0.6.37               colorspace_2.1-1           
##  [43] patchwork_1.3.0             AnnotationDbi_1.69.1       
##  [45] S4Vectors_0.45.4            irlba_2.3.5.1              
##  [47] GenomicRanges_1.59.1        RSQLite_2.3.9              
##  [49] beachmat_2.23.7             randomForest_4.7-1.2       
##  [51] timechange_0.3.0            httr_1.4.7                 
##  [53] abind_1.4-8                 compiler_4.6.0             
##  [55] proxy_0.4-27                gsl_2.1-8                  
##  [57] bit64_4.6.0-1               withr_3.0.2                
##  [59] BiocParallel_1.41.5         metrica_2.1.0              
##  [61] DBI_1.2.3                   HDF5Array_1.35.16          
##  [63] MASS_7.3-65                 lava_1.8.1                 
##  [65] DelayedArray_0.33.6         rjson_0.2.23               
##  [67] scatterplot3d_0.3-44        flashClust_1.01-2          
##  [69] ModelMetrics_1.2.2.2        tools_4.6.0                
##  [71] future.apply_1.11.3         FactoMineR_2.11            
##  [73] nnet_7.3-20                 glue_1.8.0                 
##  [75] h5mread_0.99.4              rhdf5filters_1.19.2        
##  [77] nlme_3.1-168                grid_4.6.0                 
##  [79] cluster_2.1.8.1             reshape2_1.4.4             
##  [81] generics_0.1.3              recipes_1.2.1              
##  [83] gtable_0.3.6                class_7.3-23               
##  [85] tidyr_1.3.1                 data.table_1.17.0          
##  [87] BiocSingular_1.23.0         ScaledMatrix_1.15.0        
##  [89] XVector_0.47.2              BiocGenerics_0.53.6        
##  [91] ggrepel_0.9.6               foreach_1.5.2              
##  [93] pillar_1.10.2               stringr_1.5.1              
##  [95] GSVA_2.1.12                 splines_4.6.0              
##  [97] dplyr_1.1.4                 survival_3.8-3             
##  [99] bit_4.6.0                   caretEnsemble_4.0.1        
## [101] annotate_1.85.0             tidyselect_1.2.1           
## [103] SingleCellExperiment_1.29.2 pbapply_1.7-2              
## [105] Biostrings_2.75.4           knitr_1.50                 
## [107] bookdown_0.42               IRanges_2.41.3             
## [109] SummarizedExperiment_1.37.0 stats4_4.6.0               
## [111] xfun_0.52                   Biobase_2.67.0             
## [113] hardhat_1.4.1               factoextra_1.0.7           
## [115] timeDate_4041.110           matrixStats_1.5.0          
## [117] DT_0.33                     stringi_1.8.7              
## [119] UCSC.utils_1.3.1            boot_1.3-31                
## [121] yaml_2.3.10                 evaluate_1.0.3             
## [123] codetools_0.2-20            tibble_3.2.1               
## [125] graph_1.85.3                BiocManager_1.30.25        
## [127] multcompView_0.1-10         cli_3.6.4                  
## [129] rpart_4.1.24                xtable_1.8-4               
## [131] munsell_0.5.1               jquerylib_0.1.4            
## [133] Rcpp_1.0.14                 GenomeInfoDb_1.43.4        
## [135] globals_0.16.3              coda_0.19-4.1              
## [137] png_0.1-8                   XML_3.99-0.18              
## [139] parallel_4.6.0              leaps_3.2                  
## [141] gower_1.0.2                 blob_1.2.4                 
## [143] sparseMatrixStats_1.19.0    SpatialExperiment_1.17.0   
## [145] listenv_0.9.1               mvtnorm_1.3-3              
## [147] GSEABase_1.69.1             ipred_0.9-15               
## [149] e1071_1.7-16                scales_1.3.0               
## [151] prodlim_2024.06.25          purrr_1.0.4                
## [153] crayon_1.5.3                ada_2.0-5                  
## [155] rlang_1.1.5                 KEGGREST_1.47.1            
## [157] multcomp_1.4-28

6 References