library(goSorensen)

Introduction.

The majority of functions of goSorensen are devoted to implementing a statistical hypothesis test to detect equivalence between two or more gene lists. This method was introduced in Flores, P., Salicrú, M., Sánchez-Pla, A. and Ocaña, J.(2022) “An equivalence test between features lists, based on the Sorensen - Dice index and the joint frequencies of GO node enrichment”, BMC Bioinformatics, 2022 23:207.

Following this method, we develop the irrelevance-threshold matrix of dissimilarities \(\mathfrak{D}\):

\[\begin{equation} \begin{aligned} & \begin{matrix} \hspace{0.6cm} L_1 & L_2 & \hspace{0.1cm} L_3 & \hspace{0.1cm} \ldots \hspace{0.1cm} & L_{s-1} \end{matrix} \\ \mathfrak{D} = \begin{matrix} L_2 \\ L_3 \\ L_4 \\ \vdots \\ L_s \end{matrix} & \begin{pmatrix} \mathfrak{d}_{21} & & & & \\ \mathfrak{d}_{31} & \mathfrak{d}_{32} & & & \\ \mathfrak{d}_{41} & \mathfrak{d}_{42} & \mathfrak{d}_{43} & & \\ \vdots & \vdots & \vdots & & \\ \mathfrak{d}_{s1} & \mathfrak{d}_{s2} & \mathfrak{d}_{s3} & \ldots & \mathfrak{d}_{s(s-1)} \\ \end{pmatrix} \end{aligned} \label{it-diss} \end{equation}\]

The goal is to determine a dissimilarity measure \(\mathfrak{d}_{ij}\) for comparing two lists (\(L_i, L_j\)). This dissimilarity is not only a descriptive measure, but it is based on the irrelevance threshold determining whether two lists (\(L_i, L_j\)) are equivalent, which ensures that the dissimilarity \(\mathfrak{d}_{ij}\) is directly associated with the statistically significant equivalence declaration for a specific ontology (BP, MF, or CC) and GO level.

The dissimilarity measures between all the compared \(s\) feature lists can be collectively obtained in the symmetric matrix \(\mathfrak{D}\) using the following algorithm:

  • Step 1. Establish \(h = s(s-1)\) equivalence hypothesis tests. Each test \(ET_I\) (with \(I = 1, \ldots, h\)) compares some specific pair of lists \(L_{i}, L_{j}\) (with \(i, j = 1, 2, \ldots, s\)).

  • Step 2. Let \(\mathfrak{d}_h\) be the smallest irrelevance limit that makes all the null hypotheses corresponding to the \(h\) equivalence tests be rejected (All the \(s\) lists are equivalent): \[\mathfrak{d}_h = \text{min} \left(\mathfrak{d}: P_v(\mathfrak{d})_{(I)} \leq \frac{\alpha}{h+1-I}; \hspace{0.5cm} I=1, 2, \ldots,h \right).\]

  • Step 3. Obtain \(\mathfrak{d}_h\) and use it as the irrelevance limit between the lists \(L_i, L_j\) corresponding to the last position in a vector of ordered p-values \(P_v(\mathfrak{d}_{h})_{(I)}\), i.e. \(\mathfrak{d}_{ij} = \mathfrak{d}_{h}.\)

  • Step 4. Set \(h = h-1\), excluding the comparison between the lists \(L_i, L_j\) from Step \(3\), and go back to Step \(2\) until \(h=0\).

Data used in this vignette.

We use a specific database to provide a clear and precise explanation of the computation, visualization, and interpretation of the \(\mathfrak{D}\) matrix. Specifically, we will employ the set of PBTs lists, which is available as a resource in the goSorensen package. Access to this database can be obtained as follows:

data("pbtGeneLists")

PBTs is an object of class list, containing 14 character vectors with the ENTREZ gene identifiers of a gene list related to cell graft rejection:

sapply(pbtGeneLists, length)
ABMR_RATs       BAT       CT1     ENDAT     GRIT2     GRIT3    IRITD1    IRITD3 
      108       105       948       118       131       205       181       321 
   IRITD5       KT1       LT1       LT3  Rej_RATs TCMR_RATs 
      215       562       913       693       107       121 

1 Performing the Irrelevance - Threshold Matrix of Dissimilarities.

1.1 For an Specific Ontology and GO level.

One method to compute the dissimilarity matrix \(\mathfrak{D}\) for a given ontology and level (e.g., ontology BP, level 5) is to directly input the gene lists into the sorenThreshold function:

# Previously load the genomic annotation package for the studied specie:
library(org.Hs.eg.db)
humanEntrezIDs <- keys(org.Hs.eg.db, keytype = "ENTREZID")

# Computing the irrelevance-threshold matrix of dissimilarities
dismatBP5 <- sorenThreshold(pbtGeneLists, onto = "BP", GOLevel = 5,
                            geneUniverse = humanEntrezIDs, 
                            orgPackg = "org.Hs.eg.db")

 Tue Oct 29 20:59:43 2024   Building all enrichment contingency tables...

 Tue Oct 29 21:03:19 2024   Computing all threshold equivalence distances...
round(dismatBP5, 2)
          ABMR_RATs  BAT  CT1 ENDAT GRIT2 GRIT3 IRITD3 IRITD5  KT1  LT1  LT3
BAT            0.98                                                         
CT1            1.00 1.00                                                    
ENDAT          0.90 1.00 1.00                                               
GRIT2          0.69 0.98 1.00  0.98                                         
GRIT3          0.45 0.98 1.00  0.96  0.37                                   
IRITD3         0.91 1.00 1.00  0.76  0.94  0.94                             
IRITD5         0.87 1.00 1.00  0.88  0.89  0.88   0.91                      
KT1            1.00 1.00 0.81  0.95  1.00  1.00   1.00   1.00               
LT1            1.00 1.00 1.00  1.00  1.00  1.00   1.00   1.00 1.00          
LT3            1.00 1.00 1.00  1.00  1.00  1.00   1.00   1.00 1.00 0.30     
Rej_RATs       0.34 0.99 1.00  0.95  0.56  0.40   0.93   0.87 1.00 1.00 1.00
TCMR_RATs      0.50 0.96 1.00  0.97  0.62  0.44   0.91   0.86 1.00 1.00 1.00
          Rej_RATs
BAT               
CT1               
ENDAT             
GRIT2             
GRIT3             
IRITD3            
IRITD5            
KT1               
LT1               
LT3               
Rej_RATs          
TCMR_RATs     0.42

Each value in this matrix represents dissimilarities determined by the equivalence threshold which determines whether the compared lists are equivalent. In addition, it is essential to consider that when the Sorensen dissimilarity value is irrelevantly equal to zero (equivalent to zero), it suggests biological similarity. Based on this, we can deduce that the smallest values (near zero) in the matrix \(\mathfrak{D}\) indicate equivalence between lists, whereas the largest values (close to 1) indicate that the lists cannot be considered equivalent.

Another way is to compute all the pairwise \(2 \times 2\) enrichment contingency tables previously. From this table, we can obtain the dissimilarity matrix as follows:

# Previously compute the 2x2 contingency tables:
ctableBP5 <- buildEnrichTable(pbtGeneLists, onto = "BP", GOLevel = 5,
                            geneUniverse = humanEntrezIDs, 
                            orgPackg = "org.Hs.eg.db")

# Computing the irrelevance-threshold matrix of dissimilarities
dismatBP5 <- sorenThreshold(ctableBP5)

The result is exactly the same matrix od dissimilarities above showed.

1.2 For More than One Ontology and GO Level Defined by the User.

Similarly, we can directly perform the computation using the gene lists dataset:

allDismat <- allSorenThreshold (pbtGeneLists, geneUniverse = humanEntrezIDs, 
                            orgPackg = "org.Hs.eg.db", ontos = c("BP", "CC"), 
                            GOLevels = 4:7)

Or, previously computing all the pairwise contingency tables for the ontologies (BP, CC, MF) and GO levels defined by the user:

allTabs <- allBuildEnrichTable(pbtGeneLists, geneUniverse = humanEntrezIDs, 
                            orgPackg = "org.Hs.eg.db", ontos = c("BP", "CC"), 
                            GOLevels = 4:7)
allDismat <- allSorenThreshold(allTabs)

Given the substantial quantity of outcomes obtained, exhibiting all of these results is not feasible. The result is a collection of dissimilarity matrices for the different ontologies and GO levels defined in the ontos and GOLevels arguments.

Suppose the ontos and GOLevels arguments are not provided. In that case, the allSorenThreshold function automatically computes the dissimilarity matrices for all three ontologies (BP, CC and MF) and GO levels ranging from 3 to 10.

2 Representing the Dissimilarity Matrix in Statistic Graphs.

2.1 Dendrogram.

Below, we graph a dendrogram based on the irrelevance-threshold matrix of dissimilarities for the particular case of the MF ontology, GO level 5, in our PBTs dataset. The dissimilarity matrix for this particular case was computed above and it is saved in the object dismatBP5.

clust.threshold <- hclustThreshold(dismatBP5)
plot(clust.threshold)

The dendrogram displays the lists of genes that are most similar and most distant. For instance, when considering a dissimilarity of 0.6, chosen arbitrarily, we can observe the formation of three groups. Within each group, the lists can be regarded as similar but different from those in other groups. There are three groups of lists that can be considered equivalent. The first group includes LT3 and LT1. The second group contains GRIT2 and GRIT3. The third group comprises the lists TCMR_RATs, ABMR_RATs, and Rej_RATs.

2.2 MDS-Biplot.

We initiate the process by performing a multidimensional scaling MDS analysis. MDS transforms the initial dissimilarity matrix into Euclidean distances while maintaining the maximum possible variation of the original dissimilarities. In this instance, we compute the first two dimensions, which account for the most relevant proportion of the original variability.

mds <- as.data.frame(cmdscale(dismatBP5, k = 2))
mds
                   V1          V2
ABMR_RATs  0.37749638 -0.05316328
BAT       -0.18271776  0.09899928
CT1       -0.28485420  0.26753828
ENDAT     -0.17334937  0.31109043
GRIT2      0.34624136 -0.08875324
GRIT3      0.43263735 -0.10879125
IRITD3    -0.12025074  0.26261937
IRITD5    -0.01466796  0.15346417
KT1       -0.29190006  0.29352922
LT1       -0.44970672 -0.48333325
LT3       -0.44970672 -0.48333325
Rej_RATs   0.42374314 -0.09174890
TCMR_RATs  0.38703531 -0.07811760

Below we plot the biplot:

library(ggplot2)
library(ggrepel)
graph <- ggplot() +
  geom_point(aes(mds[,1], mds[,2]), color = "blue", size = 3) +
  geom_text_repel(aes(mds[,1], mds[,2], label = attr(dismatBP5, "Labels")), 
                  color = "black", size = 3) +
    xlab("Dim 1") +
    ylab("Dim 2") +
    theme_light()
graph