Processing math: 30%

Contents

1 Motivation

There are various ways to detect modules from weighted networks. Conventional approaches such as clustering or graph partitioning purely use the network topology to define modules (Fortunato 2010). But we may need additional information such as increasing high-throughput omics data. On the one hand, the construction of reliable networks, especially for specific tissues, is relatively slow. On the other hand, integrating omics data with network topology has become a paradigm in system biology community in the past decade (Mitra et al. 2013). Weighted gene co-expression network (WGCN) is a pure data-driven gene network, which only relies on gene expression profiles. There is no rigorous definition of active modules in WGCN, but the module itself should be more compact and informative compared with random subnetworks. AMOUNTAIN (D. Li et al. 2016) provides a convex optimization based approach to identify such modules. Here we embed parts of the examples from the corresponding package AMOUNTAIN help pages into a single document.

2 Network simulation

We follow (W. Li et al. 2011) to construct gene co-expression networks for simulation study. Let n be the number of genes, and edge weights W as well as node score z follow the uniform distribution in range [0,1]. A module contains k genes inside which the edge weights as well as node score follow the uniform distribution in range [θ,1], where θ={0.5,0.6,0.7,0.8,0.9}.

library(AMOUNTAIN)
n = 100
k = 20
theta = 0.5
pp <- networkSimulation(n, k, theta)
moduleid <- pp[[3]]
netid <- 1:100
restp <- netid[-moduleid]
groupdesign <- list(moduleid,restp)
names(groupdesign) <- c('module','background')

The following figure shows the weighted co-expression network when n=100,k=20 and red nodes indicate module members and wider edges mean larger similarities. Visualization is based on qgraph.

require(qgraph)
## Loading required package: qgraph
pg <- qgraph(pp[[1]],groups=groupdesign,legend=TRUE)

When simulating a two-layer network, the basic method is to connect two independent networks with an inter-layer weight matrix A, which is designed to have larger weights between two modules.

n1 = 100
k1 = 20
theta1 = 0.5
n2 = 80
k2 = 10
theta2 = 0.5
ppresult <- twolayernetworkSimulation(n1,k1,theta1,n2,k2,theta2)
A <- ppresult[[3]]
pp <- ppresult[[1]]
moduleid <- pp[[3]]
netid <- 1:n1
restp <- netid[-moduleid]
pp2 <- ppresult[[2]]
moduleid2 <- pp2[[3]]
netid2 <- 1:n2
restp2 <- netid2[-moduleid2]

library(qgraph)
## labelling the groups
groupdesign <- list(moduleid,restp,(moduleid2+n1),(restp2+n1))
names(groupdesign) <- c('module1','background1','module2',
                     'background2')
twolayernet <- matrix(0,nrow=(n1+n2),ncol=(n1+n2))
twolayernet[1:n1,1:n1] <- pp[[1]]
twolayernet[(n1+1):(n1+n2),(n1+1):(n1+n2)] <- pp2[[1]]
twolayernet[1:n1,(n1+1):(n1+n2)] <- A
twolayernet[(n1+1):(n1+n2),1:n1] <- t(A)

The following figure shows the the two-layer weighted co-expression network based on above simulation.

g <- qgraph(twolayernet,groups=groupdesign,legend=TRUE)

3 Module identification for single layer network

Given the network G with edges weight matrix WRn×n and nodes weight vector zRn, where n is the number of nodes, we formulate the active modules identification on WGCN as a elastic net contrained optmization problem:

min

where the module membership vector {\bf x}\in\mathbb{R}_+^n is relaxed from a 0-1 vector in which x_i\neq0 means node i is in the module. And \alpha is the parameter to balance \ell_1-norm and \ell_2-norm which actually controls the module size. Larger \alpha means a more sparse vector, corresponding smaller module in this case. We adopt the euclidean projection based technique (Gong, Gai, and Zhang 2011) to solve the problem.

Here we show how to use the following function in the package to find an active module for above simulated single layer network. With groundtruth in hand, we can evaluate the quality of identified modules by F-score. In order to get higher quality, we need to tune parameter \alpha in the elastic net penalty and \lambda = 1 in the objective function. The common way to select two optimal parameters is grid search.

n = 100
k = 20
theta = 0.5
pp <- networkSimulation(n,k,theta)
moduleid <- pp[[3]]
alphaset <- seq(0.1,0.9,by=0.1)
lambdaset <- 2^seq(-5,5)
## using a grid search to select lambda and alpha
Fscores <- matrix(0,nrow = length(alphaset),ncol = length(lambdaset))
for (j in 1:length(alphaset)) {
    for (k in 1:length(lambdaset)) {
        x <- moduleIdentificationGPFixSS(pp[[1]],pp[[2]],rep(1/n,n),maxiter = 500,
                                         a=alphaset[j],lambda = lambdaset[k])
        predictedid<-which(x[[2]]!=0)
        recall <- length(intersect(predictedid,moduleid))/length(moduleid)
        precise <- length(intersect(predictedid,moduleid))/length(predictedid)
        Fscores[j,k] <- 2*precise*recall/(precise+recall)
    }
}

We can show gridFscore by 3-D plot to see how these parameters affect the performance. By certain combination of these two parameters, we can almost exactly find the target model nodes with F-score=1.

persp(Fscores,theta = 45,phi = 30,col = "gray",scale = FALSE,xlab = 'alpha',ylab = 'lambda',
      zlab = 'F-score',main = 'Fscores of identified module',box = TRUE)

4 Module identification for two-layer network

The basic idea to identification modules on a two-layer network is to find two active modules on each layer, at the same time with maximal inter-later links. we have function \texttt{moduleIdentificationGPFixSSTwolayer} in the package. Following the two-layer network simulation in section 1, we call the method.

## network simulation is the same as before
modres <- moduleIdentificationGPFixSSTwolayer(pp[[1]],pp[[2]],rep(1/n1,n1),pp2[[1]],pp2[[2]],rep(1/n2,n2),A)
predictedid <- which(modres[[1]]!=0)
recall <- length(intersect(predictedid,moduleid))/length(moduleid)
precise <- length(intersect(predictedid,moduleid))/length(predictedid)
F1 <- 2*precise*recall/(precise+recall)
predictedid2 <- which(modres[[2]]!=0)
recall2 <- length(intersect(predictedid2,moduleid2))/length(moduleid2)
precise2 <- length(intersect(predictedid2,moduleid2))/length(predictedid2)
F2 <- 2*precise2*recall2/(precise2+recall2)

And we can also select optimal parameters combination in a more sophisticated way based on the example in section 2.

5 Module identification for multi-layer network

A general multi-layer network is a natural extension of two-layer networks. Here we consider a specific form of multi-layer network that we could conduct simple operations. The basic idea to identification modules on a two-layer network is to find two active modules on each layer, at the same time with maximal inter-later links. we have function \texttt{moduleIdentificationGPFixSSMultilayer} in the package. Following the multi-layer network simulation, we call the method as:

## network simulation
n = 100
k = 20
L = 5
theta = 0.5
cpl <- multilayernetworkSimulation(n,k,theta,L)
listz <- list()
for (i in 1:L){
listz[[i]] <- cpl[[i+2]]
}
moduleid <- cpl[[2]]
## use default parameters here
x <- moduleIdentificationGPFixSSMultilayer(cpl[[1]],listz,rep(1/n,n))
predictedid <- which(x[[2]]!=0)
recall <- length(intersect(predictedid,moduleid))/length(moduleid)
precise <- length(intersect(predictedid,moduleid))/length(predictedid)
Fscore <- (2*precise*recall/(precise+recall))

And we can also select optimal parameters combination in a more sophisticated way based on the example in section 2.

6 Module identification for real-world data

The usage of the package functions is the same for real-world data, but we need to be aware of two aspects. First of all the way to calculate edges score and nodes score in a weighted network can make an impact on the performance. Different input W and \bf z in the objective function may lead to different modules.

Secondly, we do not have groundtruth about module membership for real world data. In this case, we may need to select the proper parameter so that the desired module size can be archived. When fixing \lambda=0.01, we use a binary search method to select \alpha for the elastic net penalty which controls the sparsity of the module.

## binary search parameter to fix module size to 100~200
abegin = 0.01
aend = 0.9
maxsize = 200
minsize = 100
for (i in 1:100) {
    x <- moduleIdentificationGPFixSS(W,z,rep(1/n,n),a=(abegin+aend)/2,lambda = 0.001,maxiter = 500)
    predictedid <- which(x[[2]]!=0) 
    if(length(predictedid) > maxsize){
        abegin <- (abegin+aend)/2
    }else if (length(predictedid) < minsize){
        aend <- (abegin+aend)/2
    }else
        break
}

7 High-performance considerations

When dealing with large scale networks, pure R is proven to be slow. In the developing version we reimplement the core functions of AMOUNTAIN by C, in which the matrix operations are based on open source GSL. Currently we tested it on Linux platform. Here is a table of C-version functions and pure R functions:

C-version Pure R Brief description
\texttt{CGPFixSS} \texttt{moduleIdentificationGPFixSS} Module identification on single network
\texttt{CGPFixSSTwolayer} \texttt{moduleIdentificationGPFixSSTwolayer} Module identification on two-layer network
\texttt{CGPFixSSMultiLayer} \texttt{moduleIdentificationGPFixSSMultilayer} Module identification on multi-layer network

We found that the most efficient way to call C functions is to compile .c file into shared libraries (For Linux .so and Windows .dll) and to use \texttt{.C(xxx)} in R. Although we could follow the standard way to use Rcpp and even RcppGSL to make use of GSL, the data format transformation makes it slower. For instance, you have to transform an array \texttt{double *} into \texttt{gsl_vector} or even \texttt{RcppGSL::Vector} object by filling each entry. It would cause additional overhead especially for large scale data.

8 Biological explanation

Finally, we can do gene annotation enrichment analysis with interactive tools like DAVID or Enrichr, to see whether a module gene list can be explained by existing biological process, pathways or even diseases.

9 Developer page

Please visit AMOUNTAIN for new features.

10 Session Information

Here is the output of sessionInfo() on the system on which this document was compiled:

## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] qgraph_1.9.8     AMOUNTAIN_1.33.0 BiocStyle_2.35.0
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6        xfun_0.48           bslib_0.8.0        
##  [4] ggplot2_3.5.1       htmlwidgets_1.6.4   psych_2.4.6.26     
##  [7] lattice_0.22-6      quadprog_1.5-8      vctrs_0.6.5        
## [10] tools_4.5.0         generics_0.1.3      stats4_4.5.0       
## [13] parallel_4.5.0      tibble_3.2.1        fansi_1.0.6        
## [16] highr_0.11          cluster_2.1.6       pkgconfig_2.0.3    
## [19] Matrix_1.7-1        data.table_1.16.2   checkmate_2.3.2    
## [22] lifecycle_1.0.4     compiler_4.5.0      stringr_1.5.1      
## [25] tinytex_0.53        munsell_0.5.1       mnormt_2.1.1       
## [28] htmltools_0.5.8.1   glasso_1.11         sass_0.4.9         
## [31] fdrtool_1.2.18      yaml_2.3.10         htmlTable_2.4.3    
## [34] Formula_1.2-5       pillar_1.9.0        jquerylib_0.1.4    
## [37] cachem_1.1.0        magick_2.8.5        Hmisc_5.2-0        
## [40] abind_1.4-8         rpart_4.1.23        nlme_3.1-166       
## [43] lavaan_0.6-19       gtools_3.9.5        tidyselect_1.2.1   
## [46] digest_0.6.37       stringi_1.8.4       dplyr_1.1.4        
## [49] reshape2_1.4.4      bookdown_0.41       fastmap_1.2.0      
## [52] grid_4.5.0          colorspace_2.1-1    cli_3.6.3          
## [55] magrittr_2.0.3      base64enc_0.1-3     utf8_1.2.4         
## [58] pbivnorm_0.6.0      foreign_0.8-87      corpcor_1.6.10     
## [61] scales_1.3.0        backports_1.5.0     rmarkdown_2.28     
## [64] jpeg_0.1-10         igraph_2.1.1        nnet_7.3-19        
## [67] gridExtra_2.3       png_0.1-8           pbapply_1.7-2      
## [70] evaluate_1.0.1      knitr_1.48          rlang_1.1.4        
## [73] Rcpp_1.0.13         glue_1.8.0          BiocManager_1.30.25
## [76] rstudioapi_0.17.1   jsonlite_1.8.9      R6_2.5.1           
## [79] plyr_1.8.9

References

Fortunato, Santo. 2010. “Community Detection in Graphs.” Physics Reports 486 (3): 75–174.
Gong, Pinghua, Kun Gai, and Changshui Zhang. 2011. “Efficient Euclidean Projections via Piecewise Root Finding and Its Application in Gradient Projection.” Neurocomputing 74 (17): 2754–66.
Li, Dong, Shan He, Zhisong Pan, and Guyu Hu. 2016. “Active Modules for Multilayer Weighted Gene Co-Expression Networks: A Continuous Optimization Approach.” bioRxiv. https://doi.org/10.1101/056952.
Li, Wenyuan, Chun-Chi Liu, Tong Zhang, Haifeng Li, Michael S Waterman, and Xianghong Jasmine Zhou. 2011. “Integrative Analysis of Many Weighted Co-Expression Networks Using Tensor Computation.” PLoS Comput Biol 7 (6): e1001106.
Mitra, Koyel, Anne-Ruxandra Carvunis, Sanath Kumar Ramesh, and Trey Ideker. 2013. “Integrative Approaches for Finding Modular Structure in Biological Networks.” Nature Reviews Genetics 14 (10): 719–32.