Andreas Mock
Cancer Research UK Cambridge Institute, University of Cambridge
The following tutorial describes the generation of a weighted co-expression network from TCGA (The Cancer Genome Atlas) RNAseq data using the WGCNA
R package by Langfelder and Horvarth1. In addition, individual genes and modules will be related to sample traits. Exemplarly, a co-expression network for skin cutaneous melanomas (SKCM) will be generated. However, the following weighted gene co-expression analysis (WGCNA) framework is applicable to any TCGA tumour entity.
The code of this vignette is a proof of principial example that can’t be run as listed without assembling the RNAseq data as described in the following beforehand.
Melanoma RNAseq data for the CVE extension were downloaded as expression estimates per gene (RNAseq2 level 3 data) from the TCGA data portal. Please note that the TCGA Data portal is no longer operational and all TCGA data now resides at the Genomic Data Commons. For WGCNA, the individual TCGA RNAseq2 level 3 files were concatenated to a matrix RNAseq
with gene symbols as row and TCGA patient barcodes as column names.
Further preprocessing included the removal of control samples (for more information see the TCGA Wiki) and expression estimates with counts in less than 20% of cases.
RNAseq = RNAseq[apply(RNAseq,1,function(x) sum(x==0))<ncol(RNAseq)*0.8,]
To relate co-expression modules to disease phenotypes, clinical metadata is needed. As for the melanoma TCGA data, the clinical
data was published as a curated spreadsheet in the supplements of the latest publication (suppl_table_S1D.txt)2.
As read counts follow a negative binomial distribution, which has a mathematical theory less tractable than that of the normal distribution, RNAseq data was normalised with the voom
methodology3. The voom
method estimates the mean-variance of the log-counts and generates a precision weight for each observation. This way, a comparative analysis can be performed with all bioinformatic workflows originally developed for microarray analyses.
library(limma)
RNAseq_voom = voom(RNAseq)$E
A large fraction of genes are not differentially expressed between samples. These have to be excluded from WGCNA, as two genes without notable variance in expression between patients will be highly correlated. As a heuristic cutoff, the top 5000 most variant genes have been used in most WGCNA studies. In detail the median absolute devision (MAD) was used as a robust measure of variability.
#transpose matrix to correlate genes in the following
WGCNA_matrix = t(RNAseq_voom[order(apply(RNAseq_voom,1,mad), decreasing = T)[1:5000],])
The connections within a network can be fully described by its adjacency matrix \(a_{ij}\), a \(N~x~N\) matrix whose component \(a_{ij}\) denotes the connection strength between node \(i\) and \(j\). The connection strength is defined by the co-expression similarity \(s_{ij}\). The most widely used method defines \(s_{ij}\) as the absolute value of the correlation coefficient between the profiles of node \(i\) and \(j\): \(s_{ij} = |cor(x_i,x_j)|\). However, we employed the biweight midcorrelation to define \(s_{ij}\), as it is more robust to outliers4. This feature is pivotal, as we do not expect genes to be co-expressed in all patients.
#similarity measure between gene profiles: biweight midcorrelation
library(WGCNA)
s = abs(bicor(WGCNA_matrix))
Originally, the co-expression similarity matrix was transformed into the adjacency matrix using a ‘hard’ threshold. In these unweighted co-expression networks, two genes were identified to be linked (\(a_{ij} = 1\)), if the absolute correlation between their expression profiles were higher than a ‘hard’ threshold \(\tau\). However, this hard threshold does not reflect the underlying continuous co-expression measure and leads to a significant loss of information. As a consequence, Horvath and colleagues introduced a new framework for weighted gene co-expression analysis (WGCNA)5. At its core, a weighted adjacency is defined by raising the co-expression similarity to a power (‘soft’ threshold):
\[a_{ij} = s_{ij}^\beta\]
with \(\beta \geq 1\). To choose an appropriate \(\beta\)-value, the authors present a methodology that assesses the scale free topology of the network. For detailed rational of this approach, please see Zhang and Horvath6.
powers = c(c(1:10), seq(from = 12, to=20, by=2))
sft = pickSoftThreshold(WGCNA_matrix, powerVector = powers, verbose = 5)
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab='Soft Threshold (power)',ylab='Scale Free Topology Model Fit,signed R^2',
type='n', main = paste('Scale independence'));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,cex=1,col='red'); abline(h=0.90,col='red')
As for the melanoma network, a beta value of 3 was the lowest power for which the scale-free topology fit index curve flattens out upon reaching a high value (\(R^2\) 0.9 as suggested by Langfelder and Horvarth).
#calculation of adjacency matrix
beta = 3
a = s^beta
Lastly, the dissimilarity measure is defined by
\[w_{ij} = 1 - a_{ij}\]
#dissimilarity measure
w = 1-a
Please note that TOM-based (topological overlap matrix) dissimilarity proposed by Horvarth and colleagues did not result in distinct gene modules for the analysed melanoma network.
To identify co-expression modules, genes are next clustered based on the dissimilarity measure, where branches of the dendrogram correspond to modules. The gene dendrogram obtained by average linkage hierarchical clustering is depicted in figure 2. Ultimately, gene co-expression modules are detected by applying a branch cutting method. We employed the dynamic branch cut method developed by Langfelder and colleagues 7, as constant height cutoffs exhibit suboptimal performance on complicated dendrograms. WGCNA of the 472 TCGA melanoma samples revealed 41 co-expression modules. All genes that are not significantly co-expressed within a module are summarized in an additional module 0 for further analysis.
#create gene tree by average linkage hierarchical clustering
geneTree = hclust(as.dist(w), method = 'average')
#module identification using dynamic tree cut algorithm
modules = cutreeDynamic(dendro = geneTree, distM = w, deepSplit = 4, pamRespectsDendro = FALSE,
minClusterSize = 30)
#assign module colours
module.colours = labels2colors(modules)
#plot the dendrogram and corresponding colour bars underneath
plotDendroAndColors(geneTree, module.colours, 'Module colours', dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05, main='')
The relation between the identified co-expression modules can be visualized by a dendrogram of their eigengenes (fig. 3). The module eigengene is defined as the first principal component of its expression matrix. It could be shown that the module= eigengene is highly correlated with the gene that has the highest intramodular connectivity8.
library(ape)
#calculate eigengenes
MEs = moduleEigengenes(WGCNA_matrix, colors = module.colours, excludeGrey = FALSE)$eigengenes
#calculate dissimilarity of module eigengenes
MEDiss = 1-cor(MEs);
#cluster module eigengenes
METree = hclust(as.dist(MEDiss), method = 'average');
#plot the result with phytools package
par(mar=c(2,2,2,2))
plot.phylo(as.phylo(METree),type = 'fan',show.tip.label = FALSE, main='')
tiplabels(frame = 'circle',col='black', text=rep('',length(unique(modules))), bg = levels(as.factor(module.colours)))
An advantage of co-expression network analysis is the possibility to integrate external information. At the lowest hierarchical level, gene significance (GS) measures can be defined as the statistical significance (i.e. p-value, \(p_i\)) between the \(i\)-th node profile (gene) \(x_i\) and the sample trait \(T\)
\[GS_i = -log~p_i\]
Module significance in turn can be determined as the average absolute gene significance measure. This conceptual framework can be adapted to any research question. The clinical metadata used in the following was obtained from the recent TCGA melanoma publication9 (Supplemental Table S1D: Patient Centric Table).
#load clinical metadata. Make sure that patient barcodes are in the same format
#create second expression matrix for which the detailed clinical data is available
WGCNA_matrix2 = WGCNA_matrix[match(clinical$Name, rownames(WGCNA_matrix)),]
#CAVE: 1 sample of detailed clinical metadata is not in downloaded data (TCGA-GN-A269-01')
not.available = which(is.na(rownames(WGCNA_matrix2))==TRUE)
WGCNA_matrix2 = WGCNA_matrix2[-not.available,]
str(WGCNA_matrix2)
#hence it needs to be removed from clinical table for further analysis
clinical = clinical[-not.available,]
Representatively, co-expression modules will be related to the so called lymphocyte score, which summarises the lymphocyte distribution and density in the pathological review.
#grouping in high and low lymphocyte score (lscore)
lscore = as.numeric(clinical$LYMPHOCYTE.SCORE)
lscore[lscore<3] = 0
lscore[lscore>0] = 1
#calculate gene significance measure for lymphocyte score (lscore) - Welch's t-Test
GS_lscore = t(sapply(1:ncol(WGCNA_matrix2),function(x)c(t.test(WGCNA_matrix2[,x]~lscore,var.equal=F)$p.value,
t.test(WGCNA_matrix2[,x]~lscore,var.equal=F)$estimate[1],
t.test(WGCNA_matrix2[,x]~lscore,var.equal=F)$estimate[2])))
GS_lscore = cbind(GS.lscore, abs(GS_lscore[,2] - GS_lscore[,3]))
colnames(GS_lscore) = c('p_value','mean_high_lscore','mean_low_lscore',
'effect_size(high-low score)'); rownames(GS_lscore) = colnames(WGCNA_matrix2)
To enable a high-level interpretation of the dendrogram of module eigengenes, gene ontology (GO) enrichment analysis was performed for the module genes using the GOstats
R package 10. Modules were named according to the most significant GO einrichment given a cutoff for the ontology size. The smaller the ontology size, the more specific the term. In this analysis a cutoff of 100 terms per ontology was chosen.
#reference genes = all 5000 top mad genes
ref_genes = colnames(WGCNA_matrix2)
#create data frame for GO analysis
library(org.Hs.eg.db)
GO = toTable(org.Hs.egGO); SYMBOL = toTable(org.Hs.egSYMBOL)
GO_data_frame = data.frame(GO$go_id, GO$Evidence,SYMBOL$symbol[match(GO$gene_id,SYMBOL$gene_id)])
#create GOAllFrame object
library(AnnotationDbi)
GO_ALLFrame = GOAllFrame(GOFrame(GO_data_frame, organism = 'Homo sapiens'))
#create gene set
library(GSEABase)
gsc <- GeneSetCollection(GO_ALLFrame, setType = GOCollection())
#perform GO enrichment analysis and save results to list - this make take several minutes
library(GEOstats)
GSEAGO = vector('list',length(unique(modules)))
for(i in 0:(length(unique(modules))-1)){
GSEAGO[[i+1]] = summary(hyperGTest(GSEAGOHyperGParams(name = 'Homo sapiens GO',
geneSetCollection = gsc, geneIds = colnames(RNAseq)[modules==i],
universeGeneIds = ref.genes, ontology = 'BP', pvalueCutoff = 0.05,
conditional = FALSE, testDirection = 'over')))
print(i)
}
cutoff_size = 100
GO_module_name = rep(NA,length(unique(modules)))
for (i in 1:length(unique(modules))){
GO.module.name[i] =
GSEAGO[[i]][GSEAGO[[i]]$Size<cutoff_size,
][which(GSEAGO[[i]][GSEAGO[[i]]$Size<cutoff_size,]$Count==max(GSEAGO[[i]][GSEAGO[[i]]$
Size<cutoff.size,]$Count)),7]
}
GO.module.name[1] = 'module 0'
#calculate module significance
MS.lscore = as.data.frame(cbind(GS.lscore,modules))
MS.lscore$log_p_value = -log10(as.numeric(MS.lscore$p_value))
MS.lscore = ddply(MS.lscore, .(modules), summarize, mean(log_p_value), sd(log_p_value))
colnames(MS.lscore) = c('modules','pval','sd')
MS.lscore.bar = as.numeric(MS.lscore[,2])
MS.lscore.bar[MS.lscore.bar<(-log10(0.05))] = 0
names(MS.lscore.bar) = GO.module.name
METree.GO = METree
label.order = match(METree$labels,paste0('ME',labels2colors(0:(length(unique(modules))-1))))
METree.GO$labels = GO.module.name[label.order]
plotTree.wBars(as.phylo(METree.GO), MS.lscore.bar, tip.labels = TRUE, scale = 0.2)
Assessing the module significance for different sample traits facilitates an understanding of individual co-expression modules for melanoma biology. As for the prioritisation of variants we are next interested in the role of the variant gene within a co-expression module. To this end, Langfelder and Horvath suggest a ‘fuzzy’ measure of module membership defined as
\[K^q = |cor(x_i,E^q)|\]
where \(x_i\) is the profile of gene \(i\) and \(E^q\) is the eigengene of module \(q\). Based on this definition, \(K\) describes how closely related gene \(i\) is to module \(q\). A meaningful visualization is consequently plotting the module membership over the p-value of the respective GS measure. As a third dimension, the dot-size is weighted according to the effect size.
#Calculate module membership
MM = abs(bicor(RNAseq, MEs))
#plot individual module of interest (MOI)
MOI = 3 #T cell differentiation co-expression module
plot(-log10(GS.lscore[modules==MOI,1]), MM[modules==MOI,MOI], pch=20,
cex=(GS.lscore[modules==MOI,4]/max(GS.lscore[,4],na.rm=TRUE))*4,
xlab='p-value (-log10) lymphocyte score', ylab='membership to module 3')
abline(v=-log10(0.05), lty=2, lwd=2)
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocStyle_2.2.0
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 assertthat_0.1 formatR_1.4 htmltools_0.3.5
## [5] tools_3.3.1 yaml_2.1.13 tibble_1.2 Rcpp_0.12.7
## [9] stringi_1.1.2 rmarkdown_1.1 knitr_1.14 stringr_1.1.0
## [13] digest_0.6.10 evaluate_0.10
Peter Langfelder and Steve Horvath. WGCNA: an R package for weighted correlation network analysis. In: BMC Bioinformatics 9 (Jan. 2008), pp. 559–559.↩
Cancer Genome Atlas Network. Genomic Classification of Cutaneous Melanoma. In: Cell 161.7 (June 2015), pp. 1681–1696.↩
Charity W Law et al. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. In: Genome biology 15.2 (Jan. 2014), R29–R29.↩
Chun-Hou Zheng et al. Gene differential coexpression analysis based on biweight correlation and maximum clique. In: BMC bioinformatics 15 Suppl 15 (2014), S3.↩
Bin Zhang and Steve Horvath. A general framework for weighted gene co-expression network analysis. In: Statistical applications in genetics and molecular biology 4 (2005), Article17.↩
Bin Zhang and Steve Horvath. A general framework for weighted gene co-expression network analysis. In: Statistical applications in genetics and molecular biology 4 (2005), Article17.↩
Steve Horvath and Jun Dong. Geometric Interpretation of Gene Coexpression Network Analysis. In: PLoS Computational Biology (PLOSCB) 4(8) 4.8 (2008), e1000117–e1000117.↩
Steve Horvath and Jun Dong. Geometric Interpretation of Gene Coexpression Network Analysis. In: PLoS Computational Biology (PLOSCB) 4(8) 4.8 (2008), e1000117–e1000117.↩
Cancer Genome Atlas Network. Genomic Classification of Cutaneous Melanoma. In: Cell 161.7 (June 2015), pp. 1681–1696.↩
S Falcon and R Gentleman. Using GOstats to test gene lists for GO term association. In: Bioinformatics 23.2 (Jan. 2007), pp. 257–258.↩