This vignette illustrates calculating, visualising, and interpreting the irrelevance-threshold matrix of dissimilarities. This matrix is derived from the statistical method implemented in goSorensen to identify significant equivalence between two or more gene lists based on the Sorensen dissimilarity and the joint frequencies of GO terms enrichment.
Initially, we provide a brief introduction to the meaning of the matrix. Subsequently, we elucidate the process of computing this matrix and its significance. Subsequently, we represent the dissimilarity matrix using interpretable statistical graphs, such as the dendrogram and the Biplot generated from an MDS multidimensional scaling analysis. In particular, we suggest a method for identifying the GO terms that explain the formation of the axes (two more relevant dimensions) of the MDS-Biplot. This characterisation is valuable as it enables us to ascertain the biological functions associated with the equivalences detected between compared lists.
goSorensen 1.7.1
library(goSorensen)
The majority of functions of goSorensen are devoted to implementing a statistical hypothesis test to detect equivalence between two or more gene lists. This method was introduced in Flores, P., Salicrú, M., Sánchez-Pla, A. and Ocaña, J.(2022) “An equivalence test between features lists, based on the Sorensen - Dice index and the joint frequencies of GO node enrichment”, BMC Bioinformatics, 2022 23:207.
Following this method, we develop the irrelevance-threshold matrix of dissimilarities \(\mathfrak{D}\):
\[\begin{equation} \begin{aligned} & \begin{matrix} \hspace{0.6cm} L_1 & L_2 & \hspace{0.1cm} L_3 & \hspace{0.1cm} \ldots \hspace{0.1cm} & L_{s-1} \end{matrix} \\ \mathfrak{D} = \begin{matrix} L_2 \\ L_3 \\ L_4 \\ \vdots \\ L_s \end{matrix} & \begin{pmatrix} \mathfrak{d}_{21} & & & & \\ \mathfrak{d}_{31} & \mathfrak{d}_{32} & & & \\ \mathfrak{d}_{41} & \mathfrak{d}_{42} & \mathfrak{d}_{43} & & \\ \vdots & \vdots & \vdots & & \\ \mathfrak{d}_{s1} & \mathfrak{d}_{s2} & \mathfrak{d}_{s3} & \ldots & \mathfrak{d}_{s(s-1)} \\ \end{pmatrix} \end{aligned} \label{it-diss} \end{equation}\]
The goal is to determine a dissimilarity measure \(\mathfrak{d}_{ij}\) for comparing two lists (\(L_i, L_j\)). This dissimilarity is not only a descriptive measure, but it is based on the irrelevance threshold determining whether two lists (\(L_i, L_j\)) are equivalent, which ensures that the dissimilarity \(\mathfrak{d}_{ij}\) is directly associated with the statistically significant equivalence declaration for a specific ontology (BP, MF, or CC) and GO level.
The dissimilarity measures between all the compared \(s\) feature lists can be collectively obtained in the symmetric matrix \(\mathfrak{D}\) using the following algorithm:
Step 1. Establish \(h = s(s-1)\) equivalence hypothesis tests. Each test \(ET_I\) (with \(I = 1, \ldots, h\)) compares some specific pair of lists \(L_{i}, L_{j}\) (with \(i, j = 1, 2, \ldots, s\)).
Step 2. Let \(\mathfrak{d}_h\) be the smallest irrelevance limit that makes all the null hypotheses corresponding to the \(h\) equivalence tests be rejected (All the \(s\) lists are equivalent): \[\mathfrak{d}_h = \text{min} \left(\mathfrak{d}: P_v(\mathfrak{d})_{(I)} \leq \frac{\alpha}{h+1-I}; \hspace{0.5cm} I=1, 2, \ldots,h \right).\]
Step 3. Obtain \(\mathfrak{d}_h\) and use it as the irrelevance limit between the lists \(L_i, L_j\) corresponding to the last position in a vector of ordered p-values \(P_v(\mathfrak{d}_{h})_{(I)}\), i.e. \(\mathfrak{d}_{ij} = \mathfrak{d}_{h}.\)
Step 4. Set \(h = h-1\), excluding the comparison between the lists \(L_i, L_j\) from Step \(3\), and go back to Step \(2\) until \(h=0\).
We use a specific database to provide a clear and precise explanation of the computation, visualization, and interpretation of the \(\mathfrak{D}\) matrix. Specifically, we will employ the set of PBTs
lists, which is available as a resource in the goSorensen package. Access to this database can be obtained as follows:
data("pbtGeneLists")
PBTs
is an object of class list, containing 14 character vectors with the ENTREZ gene identifiers of a gene list related to cell graft rejection:
sapply(pbtGeneLists, length)
ABMR_RATs BAT CT1 ENDAT GRIT2 GRIT3 IRITD1 IRITD3
108 105 948 118 131 205 181 321
IRITD5 KT1 LT1 LT3 Rej_RATs TCMR_RATs
215 562 913 693 107 121
One method to compute the dissimilarity matrix \(\mathfrak{D}\) for a given ontology and level (e.g., ontology BP, level 5) is to directly input the gene lists into the sorenThreshold
function:
# Previously load the genomic annotation package for the studied specie:
library(org.Hs.eg.db)
humanEntrezIDs <- keys(org.Hs.eg.db, keytype = "ENTREZID")
# Computing the irrelevance-threshold matrix of dissimilarities
dismatBP5 <- sorenThreshold(pbtGeneLists, onto = "BP", GOLevel = 5,
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db")
Wed Jul 31 17:39:37 2024 Building all enrichment contingency tables...
Wed Jul 31 17:43:04 2024 Computing all threshold equivalence distances...
round(dismatBP5, 2)
ABMR_RATs BAT CT1 ENDAT GRIT2 GRIT3 IRITD1 IRITD3 IRITD5 KT1 LT1
BAT 0.98
CT1 1.00 1.00
ENDAT 0.89 1.00 1.00
GRIT2 0.71 0.97 1.00 0.97
GRIT3 0.47 0.98 1.00 0.97 0.35
IRITD1 1.00 1.00 1.00 0.99 1.00 1.00
IRITD3 0.92 1.00 1.00 0.79 0.96 0.96 1.00
IRITD5 0.87 1.00 1.00 0.88 0.90 0.87 1.00 0.90
KT1 1.00 1.00 0.81 0.95 1.00 1.00 1.00 1.00 1.00
LT1 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
LT3 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.29
Rej_RATs 0.32 0.98 1.00 0.95 0.64 0.38 1.00 0.94 0.85 1.00 1.00
TCMR_RATs 0.52 0.96 1.00 0.98 0.58 0.42 1.00 0.93 0.86 1.00 1.00
LT3 Rej_RATs
BAT
CT1
ENDAT
GRIT2
GRIT3
IRITD1
IRITD3
IRITD5
KT1
LT1
LT3
Rej_RATs 1.00
TCMR_RATs 1.00 0.39
Each value in this matrix represents dissimilarities determined by the equivalence threshold which determines whether the compared lists are equivalent. In addition, it is essential to consider that when the Sorensen dissimilarity value is irrelevantly equal to zero (equivalent to zero), it suggests biological similarity. Based on this, we can deduce that the smallest values (near zero) in the matrix \(\mathfrak{D}\) indicate equivalence between lists, whereas the largest values (close to 1) indicate that the lists cannot be considered equivalent.
Another way is to compute all the pairwise \(2 \times 2\) enrichment contingency tables previously. From this table, we can obtain the dissimilarity matrix as follows:
# Previously compute the 2x2 contingency tables:
ctableBP5 <- buildEnrichTable(pbtGeneLists, onto = "BP", GOLevel = 5,
geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db")
# Computing the irrelevance-threshold matrix of dissimilarities
dismatBP5 <- sorenThreshold(ctableBP5)
The result is exactly the same matrix od dissimilarities above showed.
Similarly, we can directly perform the computation using the gene lists dataset:
allDismat <- allSorenThreshold (pbtGeneLists, geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db", ontos = c("BP", "CC"),
GOLevels = 4:7)
Or, previously computing all the pairwise contingency tables for the ontologies (BP, CC, MF) and GO levels defined by the user:
allTabs <- allBuildEnrichTable(pbtGeneLists, geneUniverse = humanEntrezIDs,
orgPackg = "org.Hs.eg.db", ontos = c("BP", "CC"),
GOLevels = 4:7)
allDismat <- allSorenThreshold(allTabs)
Given the substantial quantity of outcomes obtained, exhibiting all of these results is not feasible. The result is a collection of dissimilarity matrices for the different ontologies and GO levels defined in the ontos
and GOLevels
arguments.
Suppose the ontos
and GOLevels
arguments are not provided. In that case, the allSorenThreshold
function automatically computes the dissimilarity matrices for all three ontologies (BP, CC and MF) and GO levels ranging from 3 to 10.
Below, we graph a dendrogram based on the irrelevance-threshold matrix of dissimilarities for the particular case of the MF ontology, GO level 5, in our PBTs dataset. The dissimilarity matrix for this particular case was computed above and it is saved in the object dismatBP5
.
clust.threshold <- hclustThreshold(dismatBP5)
plot(clust.threshold)
The dendrogram displays the lists of genes that are most similar and most distant. For instance, when considering a dissimilarity of 0.6, chosen arbitrarily, we can observe the formation of three groups. Within each group, the lists can be regarded as similar but different from those in other groups. There are three groups of lists that can be considered equivalent. The first group includes LT3 and LT1. The second group contains GRIT2 and GRIT3. The third group comprises the lists TCMR_RATs, ABMR_RATs, and Rej_RATs.
We initiate the process by performing a multidimensional scaling MDS analysis. MDS transforms the initial dissimilarity matrix into Euclidean distances while maintaining the maximum possible variation of the original dissimilarities. In this instance, we compute the first two dimensions, which account for the most relevant proportion of the original variability.
mds <- as.data.frame(cmdscale(dismatBP5, k = 2))
mds
V1 V2
ABMR_RATs 0.38854334 0.04392676
BAT -0.16327259 -0.09079187
CT1 -0.26862731 -0.24698285
ENDAT -0.16612448 -0.28420320
GRIT2 0.35204607 0.07836793
GRIT3 0.45673252 0.10300851
IRITD1 -0.22078619 -0.12783291
IRITD3 -0.13587210 -0.24607399
IRITD5 0.01069298 -0.13659312
KT1 -0.27602226 -0.27344008
LT1 -0.41625933 0.50991290
LT3 -0.41625933 0.50991290
Rej_RATs 0.43787856 0.08027021
TCMR_RATs 0.41733012 0.08051881
Below we plot the biplot:
library(ggplot2)
library(ggrepel)
graph <- ggplot() +
geom_point(aes(mds[,1], mds[,2]), color = "blue", size = 3) +
geom_text_repel(aes(mds[,1], mds[,2], label = attr(dismatBP5, "Labels")),
color = "black", size = 3) +
xlab("Dim 1") +
ylab("Dim 2") +
theme_light()
graph
The MDS-Biplot suggests that the formation of groups is consistent with what was observed in the dendrogram. It is important to remember that the distances observed between lists in the biplot are defined based on the dissimilarity threshold that establishes the equivalence of the compared lists. For instance, the proximity between the Rej_RATs and GRIT3 lists implies that both are equivalent.
However, it is more interesting to identify the GO terms most strongly associated with the dimensions generated in the biplot rather than just determining if the lists are equivalent. These GO terms elucidate the biological functions associated with the observed distances between lists, explaining the formation of groups and the biological reasons for detecting equivalences between lists. In the next section, we develop a 5-step procedure for identifying these GO terms:
The following 5-step procedure helps identify the GO terms that biologically elucidate the creation of groups in each dimension of the Biplot. In this specific instance, we will execute the technique on Dimension 1. However, the approach remains identical if we were to explain Dimension 2.
# Split the axis 20% to the left, 60% to the middle and 20% to the right:
prop <- c(0.2, 0.6, 0.2)
# Sort according dimension 1:
sorted <- mds[order(mds[, 1]), ]
# Determine the range for dimension 1.
range <- sorted[, 1][c(1, nrow(mds))]
# Find the cutpoints to split the axis:
cutpoints <- cumsum(prop) * diff(range) + range[1]
cutpoints <- cutpoints[-length(cutpoints)]
# Identify lists to the left:
lleft <- rownames(sorted[sorted[, 1] < cutpoints[1], ])
# Identify lists to the right
lright <- rownames(sorted[sorted[, 1] > cutpoints[2], ])
lleft
[1] "LT1" "LT3" "KT1" "CT1"
lright
[1] "GRIT2" "ABMR_RATs" "TCMR_RATs" "Rej_RATs" "GRIT3"
graph +
geom_vline(xintercept = cutpoints, color = "red", linetype = "dashed",
linewidth = 0.75)