Contents

1 Introduction

Self-training Gene Clustering Pipeline (SGCP) is a framework for gene co-expression network construction and analysis. The goal in these networks is to group the genes in a way that those with similar expression pattern fall within the same network cluster, commonly called module. SGCP consists of multiple novel steps that enable the computation of highly enriched modules in an unsupervised manner. But unlike all existing frameworks, it further incorporates a novel step that leverages Gene Ontology (GO) information in a semi-supervised clustering method that further improves the quality of the computed modules. SGCP results in highly enriched modules. Preprint of manuscript describing SGCP in details is available in here.

1.1 SGCP installation

To install the package SGCP use the following. For more information please visit here). In this manual guide, SGCP also relies on two more packages; SummarizedExperiment and org.Hs.eg.db.

library(BiocManager)
BiocManager::install('SGCP')

Let’s start.

library(SGCP)

1.2 SGCP General Input

SGCP has three main input; gene expression, geneID, and genome wide annotation database.

  • expData is a matrix or a dataframe of size m*n where m and n are the number of genes and samples respectively and it can be either DNA-microarray or RNA-seq . In another words, in expData, rows and columns correspond to genes and samples respectively and the entry i,j is an expression value for gene i in sample j. SGCP does not perform any normalization or correction for batch effects and it is assumed that these preprocessing steps have been already performed.

  • geneID is a vector of size m where entry at index i denotes the gene identifier for gene i. Note that there is one-to-one correspondence between the rows in expData and geneID vector where index i in geneId indicates the gene identifier for row i in expData.

  • annotation_db the name of a genome wide annotation package of the organism of interest for gene ontology (GO) enrichment step. annotation_db must be installed by user prior using SGCP.

Here, are some important annotation_db along with its corresponding identifiers.

organism annotation_db gene identifier
Homo sapiens (Hs) org.Hs.eg.db Entrez Gene identifiers
Drosophila melanogaster (Dm) org.Dm.eg.db Entrez Gene identifiers
Rattus norvegicus (Rn) org.Rn.eg.db Entrez Gene identifiers
Mus musculus (Mm) org.Mm.eg.db Entrez Gene identifiers
Arabidopsis thaliana (At) org.At.tair.db TAIR identifiers

Note that genes: * must have expression values across all samples (i.e. no missing value). * must have non-zero variance across all the samples. * must have exactly one unique identifier, say geneID. * must have GO annotation.

Note that SGCP depends on GOstats for GO enrichment, thus (geneID) and (annotation_db) must be compatible to this package standard.

1.2.1 SGCP Input Example

For illustrative purposes we will use an example of a gene expression provided with SGCP. This data originally represent 5000 genes in 57 samples for Homo sapiens, for more information see (Cheng et al.) Here, to ease the computation, 1000 genes that have the most variance selected with 5 samples have been selected.

For this input example we need to install the following packages.

The input example is organized as an object of SummarizedExperiment. The assay field shows the expression matrix in which rows and columns correspond to gene and samples respectively. rowData denotes the gene Entrez ids for the gene. Note that there is 1-to-1 correspondence between rows in assays and rowData fields, thus rowData at index i shows the gene Entrez id for gene at row i in assays field. We call this object cheng. annotation_db is org.Hs.eg.db, and is required to be installed if is not, see link. Let’s look at the input example.

library(SummarizedExperiment)
data(cheng)
cheng
## class: SummarizedExperiment 
## dim: 1500 10 
## metadata(0):
## assays(1): Normalized gene expression Of ischemic cardiomyopathy (ICM)
## rownames(1500): 7503 100462977 ... 1152 11082
## rowData names(2): SYMBOL ENTREZID
## colnames(10): SRR7426784 SRR7426785 ... SRR7426792 SRR7426793
## colData names(1): sampleName
print("gene expression...")
## [1] "gene expression..."
print("rownames and colnames correspond to gene Entrez ids and sample names")
## [1] "rownames and colnames correspond to gene Entrez ids and sample names"
head(assay(cheng))
##           SRR7426784 SRR7426785 SRR7426786 SRR7426787 SRR7426788 SRR7426789
## 7503        4.608516   4.822920  11.477016  11.164324  10.631023   5.372338
## 100462977   6.566498  11.170815  10.721761   7.120300  10.991193  12.889641
## 4878        6.925341   9.098948   9.550381   7.742856   7.539139   6.818301
## 4879        6.020677  10.545173   6.398164   8.415020   7.936110   5.492805
## 6192        9.256628   9.511192   4.276300   3.705296   3.392204   9.581542
## 9086        9.098932   9.110217   3.858846   3.077543   2.923131   9.426101
##           SRR7426790 SRR7426791 SRR7426792 SRR7426793
## 7503        4.618841   4.628351   4.398288   4.686936
## 100462977  12.024684  12.295130   7.481746   6.661691
## 4878        6.181450   6.083118   9.985733   7.622827
## 4879        5.255798   6.266800   8.812797   8.678046
## 6192        9.211812   9.109689   9.309379   9.158489
## 9086        9.567030   9.557452   9.316588   9.059718
print(" \n gene ids...")
## [1] " \n gene ids..."
print("rownames are the gene Entrez ids")
## [1] "rownames are the gene Entrez ids"
head(rowData(cheng))
## DataFrame with 6 rows and 2 columns
##                SYMBOL    ENTREZID
##           <character> <character>
## 7503             XIST        7503
## 100462977    MTRNR2L1   100462977
## 4878             NPPA        4878
## 4879             NPPB        4879
## 6192           RPS4Y1        6192
## 9086           EIF1AY        9086

This data has the dimension of 1500 genes and 10 samples.

Now, we are ready to create the three main inputs. Using assay and rowData functions we can have access to expression matrix and gene Entrez ids respectively.

message("expData")
## expData
expData <- assay(cheng)
head(expData)
##           SRR7426784 SRR7426785 SRR7426786 SRR7426787 SRR7426788 SRR7426789
## 7503        4.608516   4.822920  11.477016  11.164324  10.631023   5.372338
## 100462977   6.566498  11.170815  10.721761   7.120300  10.991193  12.889641
## 4878        6.925341   9.098948   9.550381   7.742856   7.539139   6.818301
## 4879        6.020677  10.545173   6.398164   8.415020   7.936110   5.492805
## 6192        9.256628   9.511192   4.276300   3.705296   3.392204   9.581542
## 9086        9.098932   9.110217   3.858846   3.077543   2.923131   9.426101
##           SRR7426790 SRR7426791 SRR7426792 SRR7426793
## 7503        4.618841   4.628351   4.398288   4.686936
## 100462977  12.024684  12.295130   7.481746   6.661691
## 4878        6.181450   6.083118   9.985733   7.622827
## 4879        5.255798   6.266800   8.812797   8.678046
## 6192        9.211812   9.109689   9.309379   9.158489
## 9086        9.567030   9.557452   9.316588   9.059718
dim(expData)
## [1] 1500   10
message(" \n geneID")
##  
##  geneID
geneID <- rowData(cheng)
geneID <- geneID$ENTREZID
head(geneID)
## [1] "7503"      "100462977" "4878"      "4879"      "6192"      "9086"
length(geneID)
## [1] 1500
library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
annotation_db <- "org.Hs.eg.db"

1.3 SGCP Pipeline Parameters and Workflow

SGCP is based on 5 main steps to produce the final modules. Each step offers parameters that can be adjusted by the user as follow. Each step is implemented in a single function.

  1. Network Construction step (adjacencyMatrix function)
    • calibration: boolean, default FALSE, if TRUE SGCP performs calibration step for more information see link.
    • norm: boolean, default TRUE, if TRUE SGCP divides each genes (rows) by its norm2.
    • tom: boolean, default TRUE, if TRUE SGCP adds TOM to the network.
    • saveAdja: boolean, default FALSE, if TRUE, the adjacency matrix will be saved .
    • adjaNameFile: string indicates the name of file for saving adjacency matrix.
    • hm: string indicates the name of file for saving adjacency matrix heat map.
  2. Network Clustering step (clustering function)
    • kopt: an integer, default NULL, denotes the optimal number of clusters k by the user.
    • method: either “relativeGap”, “secondOrderGap”, “additiveGap”, or NULL, default NULL. Defines the method for number of cluster.
    • func.GO: a function for gene ontology validation, default is sum.
    • func.conduct: a function for conductance validation, default is min.
    • maxIter: an integer, identifies the maximum number of iteration in kmeans.
    • numStart: an integer, identifies the number of start in kmeans.
    • saveOrig: boolean, default TRUE, if TRUE, keeps the transformed matrix.
    • n_egvec: either “all” or an integer, default = 100, indicates the number of columns of transformed matrix to be kept.
    • sil: boolean, default FALSE, if TRUE, calculates silhouette index per gene. at the end of this step, initial clusters are produced.
  3. Gene Ontology Enrichment step (geneOntology function)
    • direction: test direction, default c(“over”, “under”), for over-represented, or under-represented GO terms
    • ontology: GO ontologies, default c(“BP”, “CC”, “MF”), BP: Biological Process, CC: Cellular Component, MF: Molecular Function.
    • hgCutoff: a numeric value in (0,1) as the p-value baseline, default 0.05, GO terms smaller than hgCutoff value are kept.
    • cond: boolean, default TRUE, if TRUE conditional hypergeometric test is performed.
  4. Gene Semi-labeling step (semiLabeling function)
    • cutoff: a numeric in (0, 1) default NULL, is a base line for GO term significancy to identify remarkable and unremarkable genes.
    • percent: a number in (0,1) default 0.1, indicate the percentile for finding top GO terms.
    • stp: a number in (0,1) default 0.01, indicates increasing value to be added to percent parameter.
  5. Semi-supervised Classification step (semiSupevised function)
    • model: either “knn” or “lr” for classification model, knn: k nearest neighbors, lr: logistic regression.
    • kn: an integer default NULL indicating the number of neighbors in knn, if kn is NULL, then kn = 20 : (20 + 2 * k) if 2 * k < 30 otherwise 20 : 30, at the end of this step, final modules are produced.

At the end, SGCP performs on more step of Gene Ontology Enrichment on the final modules.

Detailed of the steps are available in the manuscript in SGCP. In Network Construction, user can identify of any of steps to be added to the network. In Network Clustering, If kopt is not null, SGCP will find clusters based on kopt. Otherwise, if method is not null, SGCP will pick k based on the selected method. Otherwise, if geneID and annotation_db is null, SGCP will determine the optimal method and its corresponding number of cluster based on conductance validation. It picks a method that func.conduct (default min) on its cluster is minimum. Otherwise, SGCP will use gene ontology validation (by default) to find the optimal method and its corresponding number of clusters. To this end, it performs gene ontology enrichment on the cluster with minimum conductance index per method and pick the one that has the maximum func.GO (default sum) over -log10 of p-values. In Gene Semi-labeling step, if cutoff is not NULL, SGCP considers genes associated to GO terms more significant than\(~\)cutoff\(~\) as remarkable. Otherwise, SGCP collects all GO terms from all clusters and picks the percent (default 0.1) mot significant GO terms among them. If Genes associated to these significant terms come from more than a single cluster, SGCP takes these genes as remarkable. Otherwise, it adds\(~\)stp\(~\)to\(~\)percent\(~\)and repeat this process until remarkable genes come from at least two clusters. In Semi-supervised Classification remarkable clusters are the clusters that have at least one remarkable gene.

2 Automatic Run, ezSGCP Function

ezSGCP function implements the SGCP pipeline in one function. Parameters are the same as here except calib corresponds to calibration parameter in Network Construction method_k, f.GO, f.conduct, maxIteration, numberStart parameters correspond to method , func.GO, func.conduct, maxIter, numStart in Network Clustering respectively. dir, onto, hgCut, condTest correspond to direction, ontology, hgCutoff, and cond in Gene Ontology Enrichment respectively. semilabel also is a boolean parameter, that if is FALSE, SGCP will stop after initial clusters. For full parameter description, see the help document.

Below shows how to call this function. Because, this may take up to 15 minutes to be completed, I have already stored the result as sgcp and commented the function. For your practice, uncomment the function and run it. In this call, I set the sil parameter to TRUE to get the gene silhouette index.

# sgcp <- ezSGCP(expData = expData, geneID = geneID, annotation_db = annotation_db, 
#               eff.egs = FALSE , saveOrig = FALSE, sil = TRUE, hm= NULL)
data(sgcp)
summary(sgcp, show.data=TRUE)
##                Length Class      Mode   
## semilabel       1     -none-     logical
## clusterLabels   3     data.frame list   
## clustering     13     -none-     list   
## initial.GO      2     -none-     list   
## semiLabeling    2     -none-     list   
## semiSupervised  3     -none-     list   
## final.GO        2     -none-     list

If you run the function, you will see the following during the execution, which tells you the current state of ezSGCP function.

##  starting network construction step...
## normalization...
## Gaussian kernel...
## it may take time...
## TOM...
##  it may take time...
## network is created, done!...

##  starting network clustering step...
## calculating normalized Laplacian 
##  it may take time...
## calculating eigenvalues/vectors 
##  it may take time...
## n_egvec is 100
## number of clusters for relativeGap method is 2
## number of clusters for secondOrderGap method is 5
## number of clusters for additiveGap method is 3

##  Gene Ontology Validation...
## method relativeGap....
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...

##  method secondOrderGap...
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...

##  method additiveGap....
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...

##  method relativeGap is selected using GO validation and k is 2
## calculating the Silhouette index 
##  it may take time...
## network clustering is done...

## starting initial gene ontology enrichment step...
## GOenrichment for cluster 2
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...
## GOenrichment for cluster 1
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...
## gene ontology done!..

## starting semi-labeling stage...
## cutoff value is 9.28130152105493e-05
## semiLabeling done!..

## starting semi-supervised step...
## creating train and test sets based on remarkable and unremarkable genes...
## number of remarkable genes is 1165
## number of unremarkable genes is 335
## performing knn...
##  it may take time
## Loading required package: ggplot2
## Loading required package: lattice
##             Length Class      Mode     
## learn       2      -none-     list     
## k           1      -none-     numeric  
## theDots     0      -none-     list     
## xNames      4      -none-     character
## problemType 1      -none-     character
## tuneValue   1      data.frame list     
## obsLevels   2      -none-     character
## param       0      -none-     list     
## 24-nearest neighbor model
## Training set outcome distribution:

##   1   2 
## 498 667 

## class assignment for unremarkable genes...
## top class labels, and bottom number of predicted genes
## prediction
##   1   2 
## 130 205 
## semi-supervised done!..

## starting final gene ontology enrichment step...
## GOenrichment for cluster 2
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...
## GOenrichment for cluster 1
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...
## gene ontology done!..

## ezSGCP done!..

As it is seen, SGCP print out information of each step. In this example, there are three potential number of clusters, 2, 3, and 5, and based on gene ontology validation, 2 is selected as the optimal number of cluster.

We found out SGCP performs well with default parameters in most of cases. However, users have option to change the setting according to their need. For instance, there are cases that calib increases the final module enrichment. Similarly, adding TOM is not always helpful. By default SGCP assumes that user does not know the exact number of cluster, nor does it know the method for it. Therefore, it uses gene ontology validation to identify the initial clusters. We highly recommend users to perform SGCP on the three potential metod “relativeGap”, “secondOrderGap”, “additiveGap”.

SGCP allows user to visualize the result. SGCP_ezPLOT takes sgcp result from ezSGCP function along with expData and geneID. It returns two files; excel and pdf depending on the initial call. User also can keep the plotting object by setting keep = TRUE. Here, we set silhouette_in to TRUE to see the silhouette index plot.

message("PCA of expression data without label")
SGCP_ezPLOT(sgcp = sgcp, expreData = expData, silhouette_index = TRUE, keep = FALSE)