cogena, a workflow for co-expressed gene-set enrichment analysis

Zhilong Jia, Michael R. Barnes

2018-10-30

1 Abstract

Co-expressed gene-set enrichment analysis, cogena, is a workflow for gene set enrichment analysis of co-expressed genes. The cogena worklow (Figure 1) proceeds from co-expression analysis using a range of clustering methods, through to gene set enrichment analysis based on a range of pre-defined gene sets. Cogena can be applied to a number of different analytical scenarios dependent on the gene set used. Currently cogena is pre-built with gene sets from Msigdb and Connectivity Map, allowing pathway analysis and drug repositioning analysis respectively, the user can also add custom genes sets to expand analytical functionality further.

Overview of the cogena workflow

Overview of the cogena workflow

The following sections outline a typical example of the cogena workflow, describing the input data and typical analysis steps required for pathway analysis and drug repositioning analysis.

2 A quick start

See examples using the ?cogena command in R.

3 Data Input

Note: all the gene names should be gene SYMBOLs since they are used in
the gene set files. Other kinds of gene identifiers can be used
according to the identifiers used in the user-defined gene-set files.

3.1 Input data required

3.2 Example dataset

The cogena package has an example dataset, from the NCBI GEO database GSE13355. The samples are derived from lesional and non-lesional skin of psoriasis patients. There are two objects in the Psoriasis dataset. See ?Psoriasis for more details.

## Warning: replacing previous import 'class::somgrid' by 'kohonen::somgrid'
## when loading 'cogena'
## [1] "DEexprs"     "sampleLabel"

4 Various Analyses

4.1 What kind of analysis can be done?

The gene set used determines the type of analysis.

There are a variety of gene sets in the cogena package, partly collected from MSigDB and CMap. Gene sets are summarized in Table 2.

Table 2. Cogena Gene Sets

Gene Set Description
c2.cp.biocarta.v5.0.symbols.gmt.xz Biocarta gene sets
c2.cp.kegg.v5.0.symbols.gmt.xz KEGG gene sets
c2.cp.reactome.v5.0.symbols.gmt.xz Reactome gene sets
c2.cp.v5.0.symbols.gmt.xz all canonical pathways
c5.bp.v5.0.symbols.gmt.xz GO biological processes
c5.mf.v5.0.symbols.gmt.xz GO molecular functions
CmapDn100.gmt.xz Connectivity map gene sets: top 100 down regulated per drug
CmapUp100.gmt.xz Connectivity map gene sets: top 100 up regulated per drug

User-defined gene-sets must be formatted gmt and/or compressed by xz, (such as c2.cp.kegg.v5.0.symbols.gmt or c2.cp.kegg.v5.0.symbols.gmt.xz). Gene sets should be copied to the extdata directory in the installation directory of cogena, such as ~/R/x86_64-pc-linux-gnu- library/3.2/cogena/extdata in Linux), or kindly send to the author of cogena to share with others.

4.2 Types of analyses

5 Pathway Analysis

Firstly, KEGG Pathway Analysis, will be demonstrated to show the utility of cogena. The other analyses based on cogena are similar to the process of pathway analysis. Here we used the KEGG pathway gene set (c2.cp.kegg.v5.0.symbols.gmt.xz), hierarchical and Pam clustering methods, 10 clusters, 2 CPU cores and “correlation” distance metric to set up the pathway analysis.

5.1 Parameter setting

5.2 Cogena running

There are two steps for cogena analysis, co-expression analysis and then gene set enrichment analysis (here is pathway anlysis).

5.3 Results of pathway analysis

5.3.1 Summary of cogena result

After completing the cogena analysis, the user can use summary to see the summary of the result of cogena. And enrichment caculates the enrichment score of certain clustering methods and certain numbers of cluster.

Cogena does not automatically set the clustering method or the number of clusters. Here we show some principles to guide the user towards optimal selection of method and number of clusters:

  • Different clusters should account for different gene sets.
  • A gene set should enriched only in one cluster but not two or more.
  • The number of genes in a gene set enriched cluster should be the smallest small possible to achieve the highest enrichment score.
## 
## Clustering Methods:
##  hierarchical pam 
## 
## The Number of Clusters:
##  10 
## 
## Metric of Distance Matrix:
##  correlation 
## 
## Agglomeration method for hierarchical clustering (hclust and agnes):
##  complete 
## 
## Gene set:
##  c2.cp.kegg.v5.0.symbols.gmt.xz

5.3.2 Heatmap of expression profiling with clusters

heatmapCluster is developed to show the co-expression of differentially expressed genes. Figure 2 produced by heatmapCluster is an enhanced heatmap with co-expression information. Moreover, it is obvious to know which cluster contains up-regulated or down-regulated genes based on the colour.

## The number of genes in each cluster:
## upDownGene
##   1   2 
## 468 238 
## cluster_size
##   1   2   3   4   5   6   7   8   9  10 
## 158  65  38  92  50  67  63  94  61  18
Heatmap of expression profiling with clusters

Heatmap of expression profiling with clusters

5.3.3 Enrichment heatmap of co-expressed genes

heatmapPEI can be used to show the enrichment graph. See Figure 3. See ?heatmapPEI for more details. Many parameters are configurable, while generally the default will be fine.

KEGG pathway enrichment Figure 3A shows the pathway enrichment for each cluster as well as up-regulated, down-regulated and all the differentially expressed genes. The enrichment scores can be ranked by a certain cluster or the max or average scores of all the scores for each pathway.

KEGG pathway enrichment circle visualization Figure 3B shows the pathway enrichment using circle plotting.

6 Drug repositioning

Pathway analysis demonstrates that specific disease pathways are often represented by a single cluster. Accordingly, we recommend that drug repositioning is performed based on co-expressed gene clusters instead of all the differentially expressed genes. If the input of cogena is disease related data, the drugs enriched should recover the gene expression changed by the disease (the drug should induce an opposite direction in expression to the disease), while if the input is drug related, the enriched drugs should show similar gene expression changes caused by the drug studied. Here we show drugs for treating psoriasis, an autoimmune disease.

6.1 Drug repositioning analysis running

The drug repositioning gene set choice of CmapDn100.gmt.xz or CmapUp100.gmt.xz should be made based on the regulation direction of clusters. For example, as the 7th cluster contains up-regulated genes for psoriasis, the CmapDn100.gmt.xz is chosen for drug repositioning of psoriasis to recover the gene expression changes caused by the disease.

6.2 Original result of drug repositioning

Showing the results ordered by the 7th cluster in Figure 5. The parameter orderMethod is used to order the results.

Drug repositioning

Drug repositioning

6.3 Multi-instance merged result of drug repositioning

Usually there is more than one instance for a drug with different doses or time-points in the Cmap gene set. heatmapCmap can merge the multi-instance results based on parameter mergeMethod (“mean” or “max”). Figure 6 shows the multi-instance merged results ordered by the 7th cluster.

Drug repositioning (multi-instance merged)

Drug repositioning (multi-instance merged)

6.4 Other useful functions

6.4.1 Querying genes in a certain cluster

The user can obtain the genes in a certain cluster via geneInCluster , enabling other analyses, such as drug target identification.

## [1] "CD47"      "SERPINB13" "PNP"       "MPZL2"     "KCNJ15"    "SOX7"

6.4.2 Gene expression profiling with cluster information

It can be obtained by geneExpInCluster. There are two items, clusterGeneExp and label, in the returned object of geneExpInCluster. It can be used for other application.

##         cluster_id GSM337261 GSM337262 GSM337263
## PI3              1  6.556556  6.040479  7.033708
## S100A7A          1  4.989918  4.971686  5.677227
## S100A12          1  4.873823  5.168421  5.255036
## GSM337261 GSM337262 GSM337263 GSM337264 
##        ct        ct        ct        ct 
## Levels: ct Psoriasis

6.4.3 The gene correlation in a cluster

The correlation among a cluster can be checked and visulised by corInCluster. See Figure 4.

Correlation of genes in a cluster

Correlation of genes in a cluster

7 Bug Report

https://github.com/zhilongjia/cogena/issues

8 Citation

Jia, Zhilong, et al. “Cogena, a novel tool for co-expressed gene-set enrichment analysis, applied to drug repositioning and drug mode of action discovery.” BMC Genomics 17.1 (2016): 1.

9 Other Information

System info

## R version 3.5.1 Patched (2018-07-12 r74967)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] bindrcpp_0.2.2  cogena_1.16.0   kohonen_3.0.6   ggplot2_3.1.0  
## [5] cluster_2.0.7-1
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.42.0      pkgload_1.0.2       foreach_1.4.4      
##  [4] gtools_3.8.1        assertthat_0.2.0    highr_0.7          
##  [7] stats4_3.5.1        amap_0.8-16         yaml_2.2.0         
## [10] robustbase_0.93-3   remotes_2.0.2       corrplot_0.84      
## [13] sessioninfo_1.1.0   pillar_1.3.0        backports_1.1.2    
## [16] lattice_0.20-35     glue_1.3.0          digest_0.6.18      
## [19] colorspace_1.3-2    htmltools_0.3.6     Matrix_1.2-14      
## [22] plyr_1.8.4          pcaPP_1.9-73        pkgconfig_2.0.2    
## [25] devtools_2.0.1      purrr_0.2.5         mvtnorm_1.0-8      
## [28] scales_1.0.0        processx_3.2.0      gdata_2.18.0       
## [31] tibble_1.4.2        biwt_1.0            usethis_1.4.0      
## [34] withr_2.1.2         fastcluster_1.1.25  BiocGenerics_0.28.0
## [37] lazyeval_0.2.1      cli_1.0.1           mclust_5.4.1       
## [40] magrittr_1.5        crayon_1.3.4        memoise_1.1.0      
## [43] evaluate_0.12       ps_1.2.0            apcluster_1.4.7    
## [46] fs_1.2.6            doParallel_1.0.14   MASS_7.3-51        
## [49] gplots_3.0.1        class_7.3-14        pkgbuild_1.0.2     
## [52] tools_3.5.1         prettyunits_1.0.2   stringr_1.3.1      
## [55] munsell_0.5.0       callr_3.0.0         compiler_3.5.1     
## [58] caTools_1.17.1.1    rlang_0.3.0.1       grid_3.5.1         
## [61] iterators_1.0.10    bitops_1.0-6        base64enc_0.1-3    
## [64] rmarkdown_1.10      testthat_2.0.1      gtable_0.2.0       
## [67] codetools_0.2-15    reshape2_1.4.3      rrcov_1.4-4        
## [70] R6_2.3.0            knitr_1.20          dplyr_0.7.7        
## [73] bindr_0.1.1         rprojroot_1.3-2     KernSmooth_2.23-15 
## [76] desc_1.2.0          stringi_1.2.4       parallel_3.5.1     
## [79] Rcpp_0.12.19        DEoptimR_1.0-8      tidyselect_0.2.5