This vignette illustrates calculating, visualising, and interpreting the irrelevance-threshold matrix of dissimilarities. This matrix is derived from the statistical method implemented in goSorensen to identify significant equivalence between two or more gene lists based on the Sorensen dissimilarity and the joint frequencies of GO terms enrichment.
Initially, we provide a brief introduction to the meaning of the matrix. Subsequently, we elucidate the process of computing this matrix and its significance. Subsequently, we represent the dissimilarity matrix using interpretable statistical graphs, such as the dendrogram and the Biplot generated from an MDS multidimensional scaling analysis. In particular, we suggest a method for identifying the GO terms that explain the formation of the axes (two more relevant dimensions) of the MDS-Biplot. This characterisation is valuable as it enables us to ascertain the biological functions associated with the equivalences detected between compared lists.
goSorensen 1.8.0
library(goSorensen)
The majority of functions of goSorensen are devoted to implementing a statistical hypothesis test to detect equivalence between two or more gene lists. This method was introduced in Flores, P., Salicrú, M., Sánchez-Pla, A. and Ocaña, J.(2022) “An equivalence test between features lists, based on the Sorensen - Dice index and the joint frequencies of GO node enrichment”, BMC Bioinformatics, 2022 23:207.
Following this method, we develop the irrelevance-threshold matrix of dissimilarities \(\mathfrak{D}\):
\[\begin{equation} \begin{aligned} & \begin{matrix} \hspace{0.6cm} L_1 & L_2 & \hspace{0.1cm} L_3 & \hspace{0.1cm} \ldots \hspace{0.1cm} & L_{s-1} \end{matrix} \\ \mathfrak{D} = \begin{matrix} L_2 \\ L_3 \\ L_4 \\ \vdots \\ L_s \end{matrix} & \begin{pmatrix} \mathfrak{d}_{21} & & & & \\ \mathfrak{d}_{31} & \mathfrak{d}_{32} & & & \\ \mathfrak{d}_{41} & \mathfrak{d}_{42} & \mathfrak{d}_{43} & & \\ \vdots & \vdots & \vdots & & \\ \mathfrak{d}_{s1} & \mathfrak{d}_{s2} & \mathfrak{d}_{s3} & \ldots & \mathfrak{d}_{s(s-1)} \\ \end{pmatrix} \end{aligned} \label{it-diss} \end{equation}\]
The goal is to determine a dissimilarity measure \(\mathfrak{d}_{ij}\) for comparing two lists (\(L_i, L_j\)). This dissimilarity is not only a descriptive measure, but it is based on the irrelevance threshold determining whether two lists (\(L_i, L_j\)) are equivalent, which ensures that the dissimilarity \(\mathfrak{d}_{ij}\) is directly associated with the statistically significant equivalence declaration for a specific ontology (BP, MF, or CC) and GO level.
The dissimilarity measures between all the compared \(s\) feature lists can be collectively obtained in the symmetric matrix \(\mathfrak{D}\) using the following algorithm:
Step 1. Establish \(h = s(s-1)\) equivalence hypothesis tests. Each test \(ET_I\) (with \(I = 1, \ldots, h\)) compares some specific pair of lists \(L_{i}, L_{j}\) (with \(i, j = 1, 2, \ldots, s\)).
Step 2. Let \(\mathfrak{d}_h\) be the smallest irrelevance limit that makes all the null hypotheses corresponding to the \(h\) equivalence tests be rejected (All the \(s\) lists are equivalent): \[\mathfrak{d}_h = \text{min} \left(\mathfrak{d}: P_v(\mathfrak{d})_{(I)} \leq \frac{\alpha}{h+1-I}; \hspace{0.5cm} I=1, 2, \ldots,h \right).\]
Step 3. Obtain \(\mathfrak{d}_h\) and use it as the irrelevance limit between the lists \(L_i, L_j\) corresponding to the last position in a vector of ordered p-values \(P_v(\mathfrak{d}_{h})_{(I)}\), i.e. \(\mathfrak{d}_{ij} = \mathfrak{d}_{h}.\)
Step 4. Set \(h = h-1\), excluding the comparison between the lists \(L_i, L_j\) from Step \(3\), and go back to Step \(2\) until \(h=0\).
We use a specific database to provide a clear and precise explanation of the computation, visualization, and interpretation of the \(\mathfrak{D}\) matrix. Specifically, we will employ the set of PBTs
lists, which is available as a resource in the goSorensen package. Access to this database can be obtained as follows:
data("pbtGeneLists")
PBTs
is an object of class list, containing 14 character vectors with the ENTREZ gene identifiers of a gene list related to cell graft rejection:
sapply(pbtGeneLists, length)
ABMR_RATs BAT CT1 ENDAT GRIT2 GRIT3 IRITD1 IRITD3
108 105 948 118 131 205 181 321
IRITD5 KT1 LT1 LT3 Rej_RATs TCMR_RATs
215 562 913 693 107 121
One method to compute the dissimilarity matrix \(\mathfrak{D}\) for a given ontology and level (e.g., ontology BP, level 5) is to directly input the gene lists into the sorenThreshold
function:
# Previously load the genomic annotation package for the studied specie:
library(org.Hs.eg.db)
humanEntrezIDs <- keys(org.Hs.eg.db, keytype = "ENTREZID")
# Computing the irrelevance-threshold matrix of dissimilarities
dismatBP5 <- sorenThreshold(pbtGeneLists, onto = "BP", GOLevel = 5,
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db")
Tue Oct 29 20:59:43 2024 Building all enrichment contingency tables...
Tue Oct 29 21:03:19 2024 Computing all threshold equivalence distances...
round(dismatBP5, 2)
ABMR_RATs BAT CT1 ENDAT GRIT2 GRIT3 IRITD3 IRITD5 KT1 LT1 LT3
BAT 0.98
CT1 1.00 1.00
ENDAT 0.90 1.00 1.00
GRIT2 0.69 0.98 1.00 0.98
GRIT3 0.45 0.98 1.00 0.96 0.37
IRITD3 0.91 1.00 1.00 0.76 0.94 0.94
IRITD5 0.87 1.00 1.00 0.88 0.89 0.88 0.91
KT1 1.00 1.00 0.81 0.95 1.00 1.00 1.00 1.00
LT1 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
LT3 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.30
Rej_RATs 0.34 0.99 1.00 0.95 0.56 0.40 0.93 0.87 1.00 1.00 1.00
TCMR_RATs 0.50 0.96 1.00 0.97 0.62 0.44 0.91 0.86 1.00 1.00 1.00
Rej_RATs
BAT
CT1
ENDAT
GRIT2
GRIT3
IRITD3
IRITD5
KT1
LT1
LT3
Rej_RATs
TCMR_RATs 0.42
Each value in this matrix represents dissimilarities determined by the equivalence threshold which determines whether the compared lists are equivalent. In addition, it is essential to consider that when the Sorensen dissimilarity value is irrelevantly equal to zero (equivalent to zero), it suggests biological similarity. Based on this, we can deduce that the smallest values (near zero) in the matrix \(\mathfrak{D}\) indicate equivalence between lists, whereas the largest values (close to 1) indicate that the lists cannot be considered equivalent.
Another way is to compute all the pairwise \(2 \times 2\) enrichment contingency tables previously. From this table, we can obtain the dissimilarity matrix as follows:
# Previously compute the 2x2 contingency tables:
ctableBP5 <- buildEnrichTable(pbtGeneLists, onto = "BP", GOLevel = 5,
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db")
# Computing the irrelevance-threshold matrix of dissimilarities
dismatBP5 <- sorenThreshold(ctableBP5)
The result is exactly the same matrix od dissimilarities above showed.
Similarly, we can directly perform the computation using the gene lists dataset:
allDismat <- allSorenThreshold (pbtGeneLists, geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db", ontos = c("BP", "CC"),
GOLevels = 4:7)
Or, previously computing all the pairwise contingency tables for the ontologies (BP, CC, MF) and GO levels defined by the user:
allTabs <- allBuildEnrichTable(pbtGeneLists, geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db", ontos = c("BP", "CC"),
GOLevels = 4:7)
allDismat <- allSorenThreshold(allTabs)
Given the substantial quantity of outcomes obtained, exhibiting all of these results is not feasible. The result is a collection of dissimilarity matrices for the different ontologies and GO levels defined in the ontos
and GOLevels
arguments.
Suppose the ontos
and GOLevels
arguments are not provided. In that case, the allSorenThreshold
function automatically computes the dissimilarity matrices for all three ontologies (BP, CC and MF) and GO levels ranging from 3 to 10.
Below, we graph a dendrogram based on the irrelevance-threshold matrix of dissimilarities for the particular case of the MF ontology, GO level 5, in our PBTs dataset. The dissimilarity matrix for this particular case was computed above and it is saved in the object dismatBP5
.
clust.threshold <- hclustThreshold(dismatBP5)
plot(clust.threshold)
The dendrogram displays the lists of genes that are most similar and most distant. For instance, when considering a dissimilarity of 0.6, chosen arbitrarily, we can observe the formation of three groups. Within each group, the lists can be regarded as similar but different from those in other groups. There are three groups of lists that can be considered equivalent. The first group includes LT3 and LT1. The second group contains GRIT2 and GRIT3. The third group comprises the lists TCMR_RATs, ABMR_RATs, and Rej_RATs.
We initiate the process by performing a multidimensional scaling MDS analysis. MDS transforms the initial dissimilarity matrix into Euclidean distances while maintaining the maximum possible variation of the original dissimilarities. In this instance, we compute the first two dimensions, which account for the most relevant proportion of the original variability.
mds <- as.data.frame(cmdscale(dismatBP5, k = 2))
mds
V1 V2
ABMR_RATs 0.37749638 -0.05316328
BAT -0.18271776 0.09899928
CT1 -0.28485420 0.26753828
ENDAT -0.17334937 0.31109043
GRIT2 0.34624136 -0.08875324
GRIT3 0.43263735 -0.10879125
IRITD3 -0.12025074 0.26261937
IRITD5 -0.01466796 0.15346417
KT1 -0.29190006 0.29352922
LT1 -0.44970672 -0.48333325
LT3 -0.44970672 -0.48333325
Rej_RATs 0.42374314 -0.09174890
TCMR_RATs 0.38703531 -0.07811760
Below we plot the biplot:
library(ggplot2)
library(ggrepel)
graph <- ggplot() +
geom_point(aes(mds[,1], mds[,2]), color = "blue", size = 3) +
geom_text_repel(aes(mds[,1], mds[,2], label = attr(dismatBP5, "Labels")),
color = "black", size = 3) +
xlab("Dim 1") +
ylab("Dim 2") +
theme_light()
graph