Self-training Gene Clustering Pipeline (SGCP
) is a framework for gene co-expression network construction and analysis. The goal in these networks is to group the genes in a way that those with similar expression pattern fall within the same network cluster, commonly called module. SGCP
consists of multiple novel steps that enable the computation of highly enriched modules in an unsupervised manner. But unlike all existing frameworks, it further incorporates a novel step that leverages Gene Ontology (GO) information in a semi-supervised clustering method that further improves the quality of the computed modules. SGCP
results in highly enriched modules. Preprint of manuscript describing SGCP
in details is available
in here.
SGCP
installationTo install the package SGCP
use the following. For more information please visit here). In this manual guide, SGCP
also relies on two more packages; SummarizedExperiment
and org.Hs.eg.db
.
library(BiocManager)
BiocManager::install('SGCP')
Let’s start.
library(SGCP)
SGCP
General InputSGCP
has three main input; gene expression, geneID, and genome wide annotation
database.
expData is a matrix or a dataframe of size m*n
where m
and n
are the number of genes and samples respectively and it can be either DNA-microarray or RNA-seq . In another words, in expData
, rows and columns correspond to genes and samples respectively and the entry i,j
is an expression value for gene i
in sample j
. SGCP
does not perform any normalization or correction for batch effects and it is assumed that these
preprocessing steps have been already performed.
geneID is a vector of size m
where entry at index i
denotes the
gene identifier for gene i
. Note that there is one-to-one correspondence
between the rows in expData
and geneID
vector where index i
in geneId
indicates the gene identifier for row i
in expData
.
annotation_db the name of a genome wide annotation package of the organism
of interest for gene ontology (GO) enrichment step. annotation_db
must be
installed by user prior using SGCP
.
Here, are some important annotation_db
along with its corresponding identifiers.
organism | annotation_db | gene identifier |
---|---|---|
Homo sapiens (Hs) | org.Hs.eg.db | Entrez Gene identifiers |
Drosophila melanogaster (Dm) | org.Dm.eg.db | Entrez Gene identifiers |
Rattus norvegicus (Rn) | org.Rn.eg.db | Entrez Gene identifiers |
Mus musculus (Mm) | org.Mm.eg.db | Entrez Gene identifiers |
Arabidopsis thaliana (At) | org.At.tair.db | TAIR identifiers |
Note that genes: * must have expression values across all samples (i.e. no missing value). * must have non-zero variance across all the samples. * must have exactly one unique identifier, say geneID. * must have GO annotation.
Note that SGCP
depends on GOstats
for GO enrichment, thus (geneID) and (annotation_db) must be compatible to
this package standard.
SGCP
Input ExampleFor illustrative purposes we will use an example of a gene expression provided with SGCP
.
This data originally represent 5000 genes in 57 samples for Homo sapiens,
for more information see (Cheng et al.) Here, to ease the computation, 1000 genes that have the most variance selected with 5 samples have been selected.
For this input example we need to install the following packages.
The input example is organized as an object of SummarizedExperiment
. The assay
field shows the expression matrix in which rows and columns correspond to gene and samples respectively. rowData
denotes the gene Entrez ids for the gene. Note that there is 1-to-1 correspondence between rows in assays
and rowData
fields, thus rowData
at index i
shows the gene Entrez id for gene at row i
in assays
field. We call this object cheng
.
annotation_db
is org.Hs.eg.db, and is required to be installed if is not, see
link. Let’s look at the input example.
library(SummarizedExperiment)
data(cheng)
cheng
## class: SummarizedExperiment
## dim: 1500 10
## metadata(0):
## assays(1): Normalized gene expression Of ischemic cardiomyopathy (ICM)
## rownames(1500): 7503 100462977 ... 1152 11082
## rowData names(2): SYMBOL ENTREZID
## colnames(10): SRR7426784 SRR7426785 ... SRR7426792 SRR7426793
## colData names(1): sampleName
print("gene expression...")
## [1] "gene expression..."
print("rownames and colnames correspond to gene Entrez ids and sample names")
## [1] "rownames and colnames correspond to gene Entrez ids and sample names"
head(assay(cheng))
## SRR7426784 SRR7426785 SRR7426786 SRR7426787 SRR7426788 SRR7426789
## 7503 4.608516 4.822920 11.477016 11.164324 10.631023 5.372338
## 100462977 6.566498 11.170815 10.721761 7.120300 10.991193 12.889641
## 4878 6.925341 9.098948 9.550381 7.742856 7.539139 6.818301
## 4879 6.020677 10.545173 6.398164 8.415020 7.936110 5.492805
## 6192 9.256628 9.511192 4.276300 3.705296 3.392204 9.581542
## 9086 9.098932 9.110217 3.858846 3.077543 2.923131 9.426101
## SRR7426790 SRR7426791 SRR7426792 SRR7426793
## 7503 4.618841 4.628351 4.398288 4.686936
## 100462977 12.024684 12.295130 7.481746 6.661691
## 4878 6.181450 6.083118 9.985733 7.622827
## 4879 5.255798 6.266800 8.812797 8.678046
## 6192 9.211812 9.109689 9.309379 9.158489
## 9086 9.567030 9.557452 9.316588 9.059718
print(" \n gene ids...")
## [1] " \n gene ids..."
print("rownames are the gene Entrez ids")
## [1] "rownames are the gene Entrez ids"
head(rowData(cheng))
## DataFrame with 6 rows and 2 columns
## SYMBOL ENTREZID
## <character> <character>
## 7503 XIST 7503
## 100462977 MTRNR2L1 100462977
## 4878 NPPA 4878
## 4879 NPPB 4879
## 6192 RPS4Y1 6192
## 9086 EIF1AY 9086
This data has the dimension of 1500 genes and 10 samples.
Now, we are ready to create the three main inputs. Using assay
and rowData
functions we can have access to expression matrix and gene Entrez ids respectively.
message("expData")
## expData
expData <- assay(cheng)
head(expData)
## SRR7426784 SRR7426785 SRR7426786 SRR7426787 SRR7426788 SRR7426789
## 7503 4.608516 4.822920 11.477016 11.164324 10.631023 5.372338
## 100462977 6.566498 11.170815 10.721761 7.120300 10.991193 12.889641
## 4878 6.925341 9.098948 9.550381 7.742856 7.539139 6.818301
## 4879 6.020677 10.545173 6.398164 8.415020 7.936110 5.492805
## 6192 9.256628 9.511192 4.276300 3.705296 3.392204 9.581542
## 9086 9.098932 9.110217 3.858846 3.077543 2.923131 9.426101
## SRR7426790 SRR7426791 SRR7426792 SRR7426793
## 7503 4.618841 4.628351 4.398288 4.686936
## 100462977 12.024684 12.295130 7.481746 6.661691
## 4878 6.181450 6.083118 9.985733 7.622827
## 4879 5.255798 6.266800 8.812797 8.678046
## 6192 9.211812 9.109689 9.309379 9.158489
## 9086 9.567030 9.557452 9.316588 9.059718
dim(expData)
## [1] 1500 10
message(" \n geneID")
##
## geneID
geneID <- rowData(cheng)
geneID <- geneID$ENTREZID
head(geneID)
## [1] "7503" "100462977" "4878" "4879" "6192" "9086"
length(geneID)
## [1] 1500
library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
annotation_db <- "org.Hs.eg.db"
SGCP
Pipeline Parameters and WorkflowSGCP
is based on 5
main steps to produce the final modules.
Each step offers parameters that can be adjusted by the user as follow. Each step
is implemented in a single function.
adjacencyMatrix
function)
calibration
: boolean, default FALSE, if TRUE SGCP
performs calibration step
for more information see link.norm
: boolean, default TRUE, if TRUE SGCP
divides each genes (rows) by its norm2.tom
: boolean, default TRUE, if TRUE SGCP
adds TOM to the network.saveAdja
: boolean, default FALSE, if TRUE, the adjacency matrix will be saved .adjaNameFile
: string indicates the name of file for saving adjacency matrix.hm
: string indicates the name of file for saving adjacency matrix heat map.clustering
function)
kopt
: an integer, default NULL, denotes the optimal number of clusters k by the user.method
: either “relativeGap”, “secondOrderGap”, “additiveGap”, or NULL,
default NULL. Defines the method for number of cluster.func.GO
: a function for gene ontology validation, default is sum.func.conduct
: a function for conductance validation, default is min.maxIter
: an integer, identifies the maximum number of iteration in kmeans.numStart
: an integer, identifies the number of start in kmeans.saveOrig
: boolean, default TRUE, if TRUE, keeps the transformed matrix.n_egvec
: either “all” or an integer, default = 100, indicates the
number of columns of transformed matrix to be kept.sil
: boolean, default FALSE, if TRUE, calculates silhouette index per gene.
at the end of this step, initial clusters are produced.geneOntology
function)
direction
: test direction, default c(“over”, “under”), for over-represented, or under-represented GO termsontology
: GO ontologies, default c(“BP”, “CC”, “MF”), BP: Biological Process, CC: Cellular Component, MF: Molecular Function.hgCutoff
: a numeric value in (0,1) as the p-value baseline, default 0.05, GO terms smaller than hgCutoff
value are kept.cond
: boolean, default TRUE, if TRUE conditional hypergeometric test is performed.semiLabeling
function)
cutoff
: a numeric in (0, 1) default NULL, is a base line for GO term significancy to identify remarkable and unremarkable genes.percent
: a number in (0,1) default 0.1, indicate the percentile for finding top GO terms.stp
: a number in (0,1) default 0.01, indicates increasing value to be added to percent
parameter.semiSupevised
function)
model
: either “knn” or “lr” for classification model, knn: k nearest neighbors, lr: logistic regression.kn
: an integer default NULL indicating the number of neighbors in knn,
if kn
is NULL, then kn
= 20 : (20 + 2 * k) if 2 * k < 30 otherwise 20 : 30,
at the end of this step, final modules are produced.At the end, SGCP
performs on more step of Gene Ontology Enrichment on the final modules.
Detailed of the steps are available in the manuscript in SGCP.
In Network Construction, user can identify of any of steps to be
added to the network. In Network Clustering, If kopt
is not null,
SGCP
will find clusters based on kopt
. Otherwise, if method
is not null,
SGCP
will pick k based on the selected method. Otherwise, if geneID
and
annotation_db
is null, SGCP
will determine the optimal method and its
corresponding number of cluster based on conductance validation. It picks a
method that func.conduct
(default min) on its cluster is minimum. Otherwise,
SGCP
will use gene ontology validation (by default) to find the optimal method and its
corresponding number of clusters. To this end, it performs gene ontology enrichment
on the cluster with minimum conductance index per method and pick the one that
has the maximum func.GO
(default sum) over -log10 of p-values. In
Gene Semi-labeling step, if cutoff
is not NULL, SGCP
considers genes
associated to GO terms more significant than\(~\)cutoff
\(~\) as remarkable. Otherwise,
SGCP
collects all GO terms from all clusters and picks the percent
(default 0.1)
mot significant GO terms among them. If Genes associated to these significant terms
come from more than a single cluster, SGCP
takes these genes as remarkable.
Otherwise, it adds\(~\)stp
\(~\)to\(~\)percent
\(~\)and repeat this process until remarkable genes
come from at least two clusters. In Semi-supervised Classification
remarkable clusters are the clusters that have at least one remarkable gene.
ezSGCP
FunctionezSGCP
function implements the SGCP
pipeline in one function. Parameters are the same as
here except calib
corresponds to calibration
parameter in
Network Construction method_k
, f.GO
, f.conduct
,
maxIteration
, numberStart
parameters correspond to method
, func.GO
,
func.conduct
, maxIter
, numStart
in Network Clustering respectively.
dir
, onto
, hgCut
, condTest
correspond to direction
, ontology
,
hgCutoff
, and cond
in Gene Ontology Enrichment respectively. semilabel
also is a boolean parameter, that if is FALSE, SGCP
will stop after initial clusters. For full parameter description, see the help document.
Below shows how to call this function. Because, this may take up to 15 minutes to be completed, I have already stored the result as sgcp
and commented the function. For your practice, uncomment the function and run it. In this call, I set the sil
parameter to TRUE to get the gene silhouette index.
# sgcp <- ezSGCP(expData = expData, geneID = geneID, annotation_db = annotation_db,
# eff.egs = FALSE , saveOrig = FALSE, sil = TRUE, hm= NULL)
data(sgcp)
summary(sgcp, show.data=TRUE)
## Length Class Mode
## semilabel 1 -none- logical
## clusterLabels 3 data.frame list
## clustering 13 -none- list
## initial.GO 2 -none- list
## semiLabeling 2 -none- list
## semiSupervised 3 -none- list
## final.GO 2 -none- list
If you run the function, you will see the following during the execution, which tells you the current state of ezSGCP
function.
## starting network construction step...
## normalization...
## Gaussian kernel...
## it may take time...
## TOM...
## it may take time...
## network is created, done!...
## starting network clustering step...
## calculating normalized Laplacian
## it may take time...
## calculating eigenvalues/vectors
## it may take time...
## n_egvec is 100
## number of clusters for relativeGap method is 2
## number of clusters for secondOrderGap method is 5
## number of clusters for additiveGap method is 3
## Gene Ontology Validation...
## method relativeGap....
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## method secondOrderGap...
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## method additiveGap....
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## method relativeGap is selected using GO validation and k is 2
## calculating the Silhouette index
## it may take time...
## network clustering is done...
## starting initial gene ontology enrichment step...
## GOenrichment for cluster 2
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## GOenrichment for cluster 1
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## gene ontology done!..
## starting semi-labeling stage...
## cutoff value is 9.28130152105493e-05
## semiLabeling done!..
## starting semi-supervised step...
## creating train and test sets based on remarkable and unremarkable genes...
## number of remarkable genes is 1165
## number of unremarkable genes is 335
## performing knn...
## it may take time
## Loading required package: ggplot2
## Loading required package: lattice
## Length Class Mode
## learn 2 -none- list
## k 1 -none- numeric
## theDots 0 -none- list
## xNames 4 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
## 24-nearest neighbor model
## Training set outcome distribution:
## 1 2
## 498 667
## class assignment for unremarkable genes...
## top class labels, and bottom number of predicted genes
## prediction
## 1 2
## 130 205
## semi-supervised done!..
## starting final gene ontology enrichment step...
## GOenrichment for cluster 2
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## GOenrichment for cluster 1
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## gene ontology done!..
## ezSGCP done!..
As it is seen, SGCP
print out information of each step.
In this example, there are three potential number of clusters, 2, 3, and 5, and based on gene ontology validation, 2 is selected as the optimal number of cluster.
We found out
SGCP
performs well with default parameters in most of cases. However, users have option to change the setting according to their need. For instance, there are cases thatcalib
increases the final module enrichment. Similarly, adding TOM is not always helpful. By defaultSGCP
assumes that user does not know the exact number of cluster, nor does it know the method for it. Therefore, it uses gene ontology validation to identify the initial clusters. We highly recommend users to performSGCP
on the three potential metod “relativeGap”, “secondOrderGap”, “additiveGap”.
SGCP
allows user to visualize the result. SGCP_ezPLOT
takes sgcp
result from
ezSGCP
function along with expData
and geneID
. It returns two files; excel
and pdf depending on the initial call. User also can keep the plotting object
by setting keep = TRUE. Here, we set silhouette_in
to TRUE to see the silhouette index plot.
message("PCA of expression data without label")
SGCP_ezPLOT(sgcp = sgcp, expreData = expData, silhouette_index = TRUE, keep = FALSE)