The Self-training Gene Clustering Pipeline (SGCP
) is a robust framework designed for constructing and analyzing gene co-expression networks. Its primary objective is to organize genes into clusters, or modules, based on their similar expression patterns within these networks. SGCP
introduces several innovative steps that facilitate the identification of highly enriched modules through an unsupervised approach. What sets SGCP
apart from existing frameworks is its incorporation of a unique semi-supervised clustering method that integrates Gene Ontology (GO) information. This additional step significantly enhances the quality of the identified modules. SGCP
ultimately produces modules that are notably enriched with relevant biological insights. For a comprehensive overview of SGCP
, including detailed methodology and results, refer to the preprint of the manuscript available here.
SGCP
installationTo install the package SGCP
package and its dependencies, follow these steps. For more information please visit here). In this manual guide, SGCP
also relies on two more packages; SummarizedExperiment
and org.Hs.eg.db
.
library(BiocManager)
BiocManager::install('SGCP')
Let’s start.
library(SGCP)
SGCP
General InputSGCP
requires three main inputs: gene expression data, gene identifiers, and a genome-wide annotation database.
expData: A matrix or dataframe of size m*n
, where m
is the number of genes and n
is the number of samples. It can be either DNA-microarray or RNA-seq data. In expData
, rows represent genes, columns represent samples, and the entry at position (i, j)
is the expression value for gene i
in sample j.
SGCP
does not perform normalization or batch effect correction; these preprocessing steps should be completed beforehand.
geneID: A vector of size m, where each entry at index i
denotes the gene identifier for gene i
. There is a one-to-one correspondence between the rows in expData
and the entries in the geneID vector, meaning index i
in geneID corresponds to row i in expData.
annotation_db: The name of a genome-wide annotation package for the organism of interest, used for the gene ontology (GO) enrichment step. The annotation_db
must be installed by the user before using SGCP.
Here, are some important annotation_db
along with its corresponding identifiers.
organism | annotation_db | gene identifier |
---|---|---|
Homo sapiens (Hs) | org.Hs.eg.db | Entrez Gene identifiers |
Drosophila melanogaster (Dm) | org.Dm.eg.db | Entrez Gene identifiers |
Rattus norvegicus (Rn) | org.Rn.eg.db | Entrez Gene identifiers |
Mus musculus (Mm) | org.Mm.eg.db | Entrez Gene identifiers |
Arabidopsis thaliana (At) | org.At.tair.db | TAIR identifiers |
Note that genes: * must have expression values across all samples (i.e. no missing value). * must have non-zero variance across all the samples. * must have exactly one unique identifier, say geneID. * must have GO annotation.
Note that SGCP
depends on GOstats
for GO enrichment, thus (geneID) and (annotation_db) must be compatible to
this package standard.
SGCP
Input ExampleFor illustrative purposes we will use an example of a gene expression provided with SGCP
.
This dataset originally represent 5000 genes in 57 samples for Homo sapiens,
for more information see (Cheng et al.) Here, to ease the computation, 1000 genes with the highest variance and 5 samples have been selected from the original dataset.
For this input example we need to install the following packages.
The input example is organized as an object of SummarizedExperiment
. The assay
field contains the expression matrix, where rows correspond to genes and columns correspond to samples. The rowData
field denotes the gene Entrez IDs. There is a one-to-one correspondence between rows in the assay and rowData
fields; thus, rowData
at index i
shows the gene Entrez ID for the gene at row i
in the assay field. We call this object cheng.
The annotation_db
used is org.Hs.eg.db
, which must be installed if it is not already. For installation, see
link. Let’s look at the input example.
library(SummarizedExperiment)
data(cheng)
cheng
## class: SummarizedExperiment
## dim: 1500 10
## metadata(0):
## assays(1): Normalized gene expression Of ischemic cardiomyopathy (ICM)
## rownames(1500): 7503 100462977 ... 1152 11082
## rowData names(2): SYMBOL ENTREZID
## colnames(10): SRR7426784 SRR7426785 ... SRR7426792 SRR7426793
## colData names(1): sampleName
print("gene expression...")
## [1] "gene expression..."
print("rownames and colnames correspond to gene Entrez ids and sample names")
## [1] "rownames and colnames correspond to gene Entrez ids and sample names"
head(assay(cheng))
## SRR7426784 SRR7426785 SRR7426786 SRR7426787 SRR7426788 SRR7426789
## 7503 4.608516 4.822920 11.477016 11.164324 10.631023 5.372338
## 100462977 6.566498 11.170815 10.721761 7.120300 10.991193 12.889641
## 4878 6.925341 9.098948 9.550381 7.742856 7.539139 6.818301
## 4879 6.020677 10.545173 6.398164 8.415020 7.936110 5.492805
## 6192 9.256628 9.511192 4.276300 3.705296 3.392204 9.581542
## 9086 9.098932 9.110217 3.858846 3.077543 2.923131 9.426101
## SRR7426790 SRR7426791 SRR7426792 SRR7426793
## 7503 4.618841 4.628351 4.398288 4.686936
## 100462977 12.024684 12.295130 7.481746 6.661691
## 4878 6.181450 6.083118 9.985733 7.622827
## 4879 5.255798 6.266800 8.812797 8.678046
## 6192 9.211812 9.109689 9.309379 9.158489
## 9086 9.567030 9.557452 9.316588 9.059718
print(" \n gene ids...")
## [1] " \n gene ids..."
print("rownames are the gene Entrez ids")
## [1] "rownames are the gene Entrez ids"
head(rowData(cheng))
## DataFrame with 6 rows and 2 columns
## SYMBOL ENTREZID
## <character> <character>
## 7503 XIST 7503
## 100462977 MTRNR2L1 100462977
## 4878 NPPA 4878
## 4879 NPPB 4879
## 6192 RPS4Y1 6192
## 9086 EIF1AY 9086
This data has the dimension of 1500 genes and 10 samples.
Now, we are ready to create the three main inputs. Using assay
and rowData
functions we can have access to expression matrix and gene Entrez IDs respectively.
message("expData")
## expData
expData <- assay(cheng)
head(expData)
## SRR7426784 SRR7426785 SRR7426786 SRR7426787 SRR7426788 SRR7426789
## 7503 4.608516 4.822920 11.477016 11.164324 10.631023 5.372338
## 100462977 6.566498 11.170815 10.721761 7.120300 10.991193 12.889641
## 4878 6.925341 9.098948 9.550381 7.742856 7.539139 6.818301
## 4879 6.020677 10.545173 6.398164 8.415020 7.936110 5.492805
## 6192 9.256628 9.511192 4.276300 3.705296 3.392204 9.581542
## 9086 9.098932 9.110217 3.858846 3.077543 2.923131 9.426101
## SRR7426790 SRR7426791 SRR7426792 SRR7426793
## 7503 4.618841 4.628351 4.398288 4.686936
## 100462977 12.024684 12.295130 7.481746 6.661691
## 4878 6.181450 6.083118 9.985733 7.622827
## 4879 5.255798 6.266800 8.812797 8.678046
## 6192 9.211812 9.109689 9.309379 9.158489
## 9086 9.567030 9.557452 9.316588 9.059718
dim(expData)
## [1] 1500 10
message(" \n geneID")
##
## geneID
geneID <- rowData(cheng)
geneID <- geneID$ENTREZID
head(geneID)
## [1] "7503" "100462977" "4878" "4879" "6192" "9086"
length(geneID)
## [1] 1500
library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
annotation_db <- "org.Hs.eg.db"
SGCP
Pipeline Parameters and WorkflowSGCP
operates through five main steps to produce the final modules. Each step provides parameters that can be adjusted by the user. Each step is implemented as a single function.
adjacencyMatrix
function)
calibration
: Boolean, default FALSE
. If TRUE
SGCP
performs calibration step
for more information see link.norm
: Boolean, default TRUE
. If TRUE
SGCP
divides each genes (rows) by its norm2.tom
: Boolean, default TRUE
. If TRUE
SGCP
adds TOM to the network.saveAdja
: Boolean, default FALSE
. If TRUE
, the adjacency matrix will be saved .adjaNameFile
: string indicating the name of file for saving adjacency matrix.hm
: string indicates the name of file for saving adjacency matrix heat map.clustering
function)
kopt
: An integer, default is NULL
. Denotes the optimal number of clusters k
by the user.method
: Either “relativeGap”, “secondOrderGap”, “additiveGap”, or NULL
(defualt),
Defines the method for number of cluster.func.GO
: A function for gene ontology validation, default is sum.
func.conduct
: A function for conductance validation, default is min.
maxIter
: An integer identifying the maximum number of iteration in kmeans.
numStart
: An integer identifying the number of starts in kmeans.
saveOrig
: Boolean, default TRUE
. If TRUE
, keeps the transformed matrix.n_egvec
: Either “all” or an integer, default is 100
. Indicates the
number of columns of transformed matrix to be kept.sil
: Boolean, default FALSE
. If TRUE
, calculates silhouette index per gene.
at the end of this step, initial clusters are produced.geneOntology
function)
direction
: Test direction, default is c(“over”, “under”), for over-represented, or under-represented GO termsontology
: GO ontologies, default c(“BP”, “CC”, “MF”), where BP stands for Biological Process, CC for Cellular Component, and MF for Molecular Function.hgCutoff
: A numeric value in 0
and 1
, default is 0.05
. GO terms with p-values smaller than hgCutoff
are retained.cond
: Boolean, default is TRUE
. If TRUE
, performs conditional hypergeometric test.semiLabeling
function)
cutoff
: a numeric in 0 and 1, default is NULL
. This baseline is used to determined the significance of GO terms, distinguishing between remarkable and unremarkable genes.percent
: a number in 0 and 1, default is 0.1
. Indicates the percentile used to identify top GO terms.stp
: a number in 0 and ,1 default is 0.01
. Specifies the increment added to the percent
parameter.semiSupevised
function)
model
: Either “knn” for k-nearest neighbors or “lr” for logistic regression, specifying the classification model.kn
: An integer, default is NULL
, indicating the number of neighbors in k-nearest neighbors (knn
).
If kn
is NULL
, then kn
is determined as follow kn = 20
if 2 * k < 30 otherwise kn = 20 : 30
,At the end of this step, final modules are produced. Subsequently, SGCP
performs an additional step of Gene Ontology Enrichment on those final modules.
Detailed of the steps are available in the manuscript in SGCP.
In Network Construction, user can identify of any of steps to be
added to the network. In Network Clustering, SGCP
behaves as follows;
If kopt
is not NULL
, SGCP
finds clusters based on kopt
. If kopt
is NULL
and method
is not NULL
,
SGCP
picks k
based on the selected method. If geneID
and
annotation_db
are NULL
, SGCP
determines the optimal method and its
corresponding number of cluster based on conductance validation. It selects the
method that func.conduct
(default min
) on its cluster is minimum. Otherwise,
SGCP
will use gene ontology validation (default) to find the optimal method and its
corresponding number of clusters. It performs gene ontology enrichment
on the cluster with minimum conductance index per method and selects the method that
has the maximum func.GO
(default sum
) over -log10 of p-values. In
Gene Semi-labeling step, if cutoff
is not NULL
, SGCP
considers genes
associated to GO terms more significant than\(~\)cutoff
\(~\) as remarkable. Otherwise,
SGCP
collects all GO terms from all clusters and selects the top percent
(default 0.1
)
most significant GO terms among them. If genes associated to these terms
come from more than a single cluster, SGCP
these genes as remarkable.
Otherwise, it increases\(~\)percent
\(~\)to\(~\)stp
\(~\)and repeat this process until remarkable genes
come from at least two clusters. In Semi-supervised Classification,
remarkable clusters are the clusters that have at least one remarkable gene.
ezSGCP
FunctionThe ezSGCP
function implements the SGCP
pipeline in one function. Parameters align with those specified in the
here with specific mappings for clarity.
*__Network Construction__: Parameter `calib` corresponds to `calibration`.
*__Network Clustering__: Parameters `method_k`, `f.GO`, `f.conduct`, `maxIteration`, and `numberStart` correspond to `method`, `func.GO`, `func.conduct`, `maxIter`, and `numStart`, respectively.
*__Gene Ontology Enrichment__: Parameters `dir`, `onto`, `hgCut`, and `condTest` correspond to `direction`, `ontology`, `hgCutoff`, and `cond`, respectively.
* __Gene Ontology Enrichment__: The boolean parameter `semilabel` controls whether `SGCP` continues to the Semi-supervised Classification step (`TRUE`) or stops after initial clusters (`FALSE`).
For a comprehensive description of each parameter, refer to the help documentation.
Below is an example of how to call this function. Due to its computation time (up to 15
minutes), the result has been precomputed and stored as sgcp
, with the function call commented out. To run it yourself, uncomment the function call. In this example, the sil
parameter is set to TRUE to compute the gene silhouette index.
# sgcp <- ezSGCP(expData = expData, geneID = geneID, annotation_db = annotation_db,
# eff.egs = FALSE , saveOrig = FALSE, sil = TRUE, hm = NULL)
data(sgcp)
summary(sgcp, show.data=TRUE)
## Length Class Mode
## semilabel 1 -none- logical
## clusterLabels 3 data.frame list
## clustering 13 -none- list
## initial.GO 2 -none- list
## semiLabeling 2 -none- list
## semiSupervised 3 -none- list
## final.GO 2 -none- list
If you run the ezSGCP
function, it may provide status updates during execution. Here’s an example of what you might see:
## starting network construction step...
## normalization...
## Gaussian kernel...
## it may take time...
## TOM...
## it may take time...
## network is created, done!...
## starting network clustering step...
## calculating normalized Laplacian
## it may take time...
## calculating eigenvalues/vectors
## it may take time...
## n_egvec is 100
## number of clusters for relativeGap method is 2
## number of clusters for secondOrderGap method is 5
## number of clusters for additiveGap method is 3
## Gene Ontology Validation...
## method relativeGap....
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## method secondOrderGap...
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## method additiveGap....
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## method relativeGap is selected using GO validation and k is 2
## calculating the Silhouette index
## it may take time...
## network clustering is done...
## starting initial gene ontology enrichment step...
## GOenrichment for cluster 2
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## GOenrichment for cluster 1
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## gene ontology done!..
## starting semi-labeling stage...
## cutoff value is 9.28130152105493e-05
## semiLabeling done!..
## starting semi-supervised step...
## creating train and test sets based on remarkable and unremarkable genes...
## number of remarkable genes is 1165
## number of unremarkable genes is 335
## performing knn...
## it may take time
## Loading required package: ggplot2
## Loading required package: lattice
## Length Class Mode
## learn 2 -none- list
## k 1 -none- numeric
## theDots 0 -none- list
## xNames 4 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
## 24-nearest neighbor model
## Training set outcome distribution:
## 1 2
## 498 667
## class assignment for unremarkable genes...
## top class labels, and bottom number of predicted genes
## prediction
## 1 2
## 130 205
## semi-supervised done!..
## starting final gene ontology enrichment step...
## GOenrichment for cluster 2
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## GOenrichment for cluster 1
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## gene ontology done!..
## ezSGCP done!..
As observed, SGCP
provides detailed information at each step of its execution. In this example, the analysis considers three possible numbers of clusters: 2
, 3
, and 5
. Following gene ontology validation, the optimal number of clusters is determined to be 2
.
While SGCP generally performs well with default parameters, users have the option to customize settings according to their specific needs. For example, enabling calib can enhance the enrichment of the final modules in certain cases. Similarly, the addition of TOM may not always be beneficial.
By default, SGCP operates under the assumption that users do not have prior knowledge of the exact number of clusters or the optimal method for clustering. Therefore, it utilizes gene ontology validation to establish initial clusters. We highly recommend that users evaluate SGCP using all three potential methods: “relativeGap”, “secondOrderGap”, and “additiveGap”.
SGCP
offers visualization capabilities to users. The SGCP_ezPLOT
function accepts the sgcp
result from ezSGCP
, along with expData
and geneID
inputs. It generates two files, an Excel spreadsheet and a PDF, depending on the initial call. Users can also retain the plotting object by setting keep = TRUE
. In this example, we enable silhouette_in to visualize the silhouette index plot.
message("PCA of expression data without label")
SGCP_ezPLOT(sgcp = sgcp, expreData = expData, silhouette_index = TRUE, keep = FALSE)