Omada, An unsupervised machine learning toolkit for automated sample clustering of gene expression profiles

Sokratis Kariotis

University of Sheffield, Agency for Science, Technology and Research (A*STAR)

Abstract

Symptomatic heterogeneity in complex diseases reveals differences in molecular states that need to be investigated. However, selecting the numerous parameters of an exploratory clustering analysis in RNA profiling studies requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent and further gene association analyses need to be performed independently. We have developed a suite of tools to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with four datasets characterised by different expression signal strengths. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Even in datasets with less clear biological distinctions, stable subgroups with different expression profiles and clinical associations were found.

Loading the library

Loading the library to access the functions and the two toy datasets: gene expressions and cluster memberships.

## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] omada_1.0.0       dplyr_1.0.10      glmnet_4.1-4      Matrix_1.5-1     
##  [5] clValid_0.7       cluster_2.1.4     clusterCrit_1.2.8 reshape_0.8.9    
##  [9] ggplot2_3.3.6     diceR_1.2.2       Rcpp_1.0.9        fpc_2.2-9        
## [13] kernlab_0.9-31    pdfCluster_1.0-3 
## 
## loaded via a namespace (and not attached):
##  [1] mclust_6.0.0      lattice_0.20-45   class_7.3-20      assertthat_0.2.1 
##  [5] digest_0.6.30     foreach_1.5.2     utf8_1.2.2        R6_2.5.1         
##  [9] plyr_1.8.7        magic_1.6-0       stats4_4.2.1      evaluate_0.17    
## [13] pillar_1.8.1      rlang_1.0.6       diptest_0.76-0    jquerylib_0.1.4  
## [17] rmarkdown_2.17    splines_4.2.1     stringr_1.4.1     munsell_0.5.0    
## [21] compiler_4.2.1    xfun_0.34         pkgconfig_2.0.3   shape_1.4.6      
## [25] htmltools_0.5.3   nnet_7.3-18       tidyselect_1.2.0  tibble_3.1.8     
## [29] codetools_0.2-18  fansi_1.0.3       withr_2.5.0       MASS_7.3-58.1    
## [33] grid_4.2.1        jsonlite_1.8.3    gtable_0.3.1      lifecycle_1.0.3  
## [37] DBI_1.1.3         magrittr_2.0.3    scales_1.2.1      cli_3.4.1        
## [41] stringi_1.7.8     cachem_1.0.6      flexmix_2.3-18    robustbase_0.95-0
## [45] geometry_0.4.6.1  bslib_0.4.0       generics_0.1.3    vctrs_0.5.0      
## [49] iterators_1.0.14  tools_4.2.1       glue_1.6.2        DEoptimR_1.0-11  
## [53] purrr_0.3.5       survival_3.4-0    abind_1.4-5       parallel_4.2.1   
## [57] fastmap_1.1.0     yaml_2.3.6        colorspace_2.0-3  prabclus_2.3-2   
## [61] knitr_1.40        sass_0.4.2        modeltools_0.2-23

Investigating feasibility of a dataset based on its dimensions (sample and feature sizes)

To investigate the clustering feasibility of a dataset this package provides two simulating functions of stability assessment that simulate a dataset of specific dimensions and calculate the dataset’s stabilities for a range of clusters. feasibilityAnalysis() generates an idependent dataset for specific number of classes, samples and features while feasibilityAnalysisDataBased() accepts an existing dataset extracting statistics(means and standard deviations) for a specific number of clusters. Note that these estimations only serve as an indication of the datasets fitness for the dowstream analysis and not as the actual measure of quality as they do not account for the actual signal in the data but only the relation between the number of samples, features and clusters.

Automated clustering analysis: Omada

Using omada() along with a gene expression dataframe and an upper k (number of clusters to be considered) we can run the whole analysis toolkit to automate clustering decision making and produce the estimated optimal clusters. Removal or imputation of NA values is required before running the any of the tools.

Selecting the most appropriate clustering approach based on a dataset

To select the most appropriate clustering technique for our dataset we compare the internal partition agreement of three different approaches, namely spectral, k-means and hierarchical clustering using the clusteringMethodSelection() function. We define the upper k to be considered as well as the number of internal comparisons per approach. Increased number of comparisons introduces more robustness and highest run times.

This suite also provides the function to individually calculate the partition agreement between two specific clustering approaches and parameter sets by utilizing function partitionAgreement() which requires the selection of the 2 algorithms, measures and number of clusters.

Selecting the most appropriate features

To select the features that provide the most stable clusters the function featureSelection() requires the minimum and maximum number of clusters(k) and the feature step that dictates the rate of each feature set’s increase. It is advised to use the algorithm the previous tools provide.

Estimating the most appropriate number of clusters

To estimate the most appropriate number of clusters based on an ensemble of internal metrics function clusterVoting() accepts the minimum and maximum number of clusters to be considered as well as the algorithm of choice (“sc” for spectral, “km” for kmeans and “hr” for hierachical clustering). It is advised to use the feature set and algorithm the previous tools provide.

Running the optimal clustering

Previous steps have provided every clustering parameter needed to go through with the partitioning utilising optimalClustering(). This tool is using the dataset with the most stable feature set, number of clusters(k) and appropriate algorithm. This tool additionally runs through the possible algorithm parameters to retain the one with the highest stability.

Meta analysis of feature/gene signatures

The first meta analysis of the discovered clusters includes the generation of feature (or gene) signatures associated to each cluster. To acquire the signatures function geneSignatures() provides the LASSO (regression analysis) coefficients of each gene as well as a plot of the highest 30% of coefficients per cluster. The previously generated cluster memberships and initial dataset are required.