Cancer is an umbrella term that includes a range of disorders, from those that are fast-growing and lethal to indolent lesions with low or delayed potential for progression to death. One critical unmet challenge is that molecular disease subtypes characterized by relevant clinical differences, such as survival, are difficult to differentiate. With the advancement of multi-omics technologies, subtyping methods have shifted toward data integration in order to differentiate among subtypes from a holistic perspective that takes into consideration phenomena at multiple levels. However, these integrative methods are still limited by their statistical assumption and their sensitivity to noise. In addition, they are unable to predict the risk scores of patients using multi-omics data.
To address this problem, we introduce Subtyping via Consensus Factor Analysis (SCFA), a novel method for cancer subtyping and risk prediction using consensus factor analysis. SCFA follows a three-stage hierarchical process to ensure the robustness of the discovered subtypes. First, the method uses an autoencoder to filter out genes with an insignificant contribution in characterizing each patient. Second, it applies a modified factor analysis to generate a collection of factor representations of the high-dimensional multi-omics data. Finally, it utilizes a consensus ensemble to find subtypes that are shared across all factor representations.
SCFA, you need to install the R pacakge from Bioconductor.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("SCFA")
After that, install
keras in python using
keras R package. You may also need to install Anaconda if you does not have it. For more information about installation of keras, please visit https://keras.rstudio.com/
if(is(try(reticulate::conda_version()), "try-error")) reticulate::install_miniconda(force = T) keras::install_keras(method = "conda", tensorflow = "1.10.0")
Load the example data
GBM. GBM is the Glioblastoma cancer dataset.
#Load required library library(SCFA) library(survival) # Load example data (GBM dataset), for other dataset, download the rds file from the Data folder at https://bioinformatics.cse.unr.edu/software/scfa/Data/ and load the rds object data("GBM") # List of one matrix of microRNA data, other examples would have 3 matrices of 3 data types dataList <- GBM$data # Survival information survival <- GBM$survival
We can use the main funtion
SCFA to generate subtypes from multi-omics data. The input of this function is a list of matrices from different data types. Each matrix has rows as samples and cells as features. The output of this function is subtype assignment for each patient. We can perform survival analysis to determine the significance in survival differences between discovered subtypes.
# Generating subtyping result set.seed(1) subtype <- SCFA(dataList, seed = 1, ncores = 4L)
## The number of clusters before voting is: 4 ## The optimal number of clusters for ensemble clustering is: 4 ## The number of clusters before voting is: 4 ## The optimal number of clusters for ensemble clustering is: 4
# Perform survival analysis on the result coxFit <- coxph(Surv(time = Survival, event = Death) ~ as.factor(subtype), data = survival, ties="exact") coxP <- round(summary(coxFit)$sctest,digits = 20) print(coxP)
## pvalue ## 0.01231359
We can use the function
SCFA.class to predict risk score of patients using available survival information from training data. We need to provide the function with training data with survival information, and testing data. The output is the risk score of each patient. Patient with higher risk scores have higher probablity to experience event before the other patient. Concordance index is use to confirm the correlation between predicted risk scores and survival information.
# Split data to train and test set.seed(1) idx <- sample.int(nrow(dataList[]), round(nrow(dataList[])/2) ) survival$Survival <- survival$Survival - min(survival$Survival) + 1 # Survival time must be positive trainList <- lapply(dataList, function(x) x[idx, ] ) trainSurvival <- Surv(time = survival[idx,]$Survival, event = survival[idx,]$Death) testList <- lapply(dataList, function(x) x[-idx, ] ) testSurvival <- Surv(time = survival[-idx,]$Survival, event = survival[-idx,]$Death) # Perform risk prediction result <- SCFA.class(trainList, trainSurvival, testList, seed = 1, ncores = 4L) # Validation using concordance index c.index <- survival::concordance(coxph(testSurvival ~ result))$concordance print(c.index)
##  0.5778744