library(goSorensen)

Introduction.

The majority of functions of goSorensen are devoted to implementing a statistical hypothesis test to detect equivalence between two or more gene lists. This method was introduced in Flores, P., Salicrú, M., Sánchez-Pla, A. and Ocaña, J.(2022) “An equivalence test between features lists, based on the Sorensen - Dice index and the joint frequencies of GO node enrichment”, BMC Bioinformatics, 2022 23:207.

Following this method, we develop the irrelevance-threshold matrix of dissimilarities \(\mathfrak{D}\):

\[\begin{equation} \begin{aligned} & \begin{matrix} \hspace{0.6cm} L_1 & L_2 & \hspace{0.1cm} L_3 & \hspace{0.1cm} \ldots \hspace{0.1cm} & L_{s-1} \end{matrix} \\ \mathfrak{D} = \begin{matrix} L_2 \\ L_3 \\ L_4 \\ \vdots \\ L_s \end{matrix} & \begin{pmatrix} \mathfrak{d}_{21} & & & & \\ \mathfrak{d}_{31} & \mathfrak{d}_{32} & & & \\ \mathfrak{d}_{41} & \mathfrak{d}_{42} & \mathfrak{d}_{43} & & \\ \vdots & \vdots & \vdots & & \\ \mathfrak{d}_{s1} & \mathfrak{d}_{s2} & \mathfrak{d}_{s3} & \ldots & \mathfrak{d}_{s(s-1)} \\ \end{pmatrix} \end{aligned} \label{it-diss} \end{equation}\]

The goal is to determine a dissimilarity measure \(\mathfrak{d}_{ij}\) for comparing two lists (\(L_i, L_j\)). This dissimilarity is not only a descriptive measure, but it is based on the irrelevance threshold determining whether two lists (\(L_i, L_j\)) are equivalent, which ensures that the dissimilarity \(\mathfrak{d}_{ij}\) is directly associated with the statistically significant equivalence declaration for a specific ontology (BP, MF, or CC) and GO level.

The dissimilarity measures between all the compared \(s\) feature lists can be collectively obtained in the symmetric matrix \(\mathfrak{D}\) using the following algorithm:

  • Step 1. Establish \(h = s(s-1)\) equivalence hypothesis tests. Each test \(ET_I\) (with \(I = 1, \ldots, h\)) compares some specific pair of lists \(L_{i}, L_{j}\) (with \(i, j = 1, 2, \ldots, s\)).

  • Step 2. Let \(\mathfrak{d}_h\) be the smallest irrelevance limit that makes all the null hypotheses corresponding to the \(h\) equivalence tests be rejected (All the \(s\) lists are equivalent): \[\mathfrak{d}_h = \text{min} \left(\mathfrak{d}: P_v(\mathfrak{d})_{(I)} \leq \frac{\alpha}{h+1-I}; \hspace{0.5cm} I=1, 2, \ldots,h \right).\]

  • Step 3. Obtain \(\mathfrak{d}_h\) and use it as the irrelevance limit between the lists \(L_i, L_j\) corresponding to the last position in a vector of ordered p-values \(P_v(\mathfrak{d}_{h})_{(I)}\), i.e. \(\mathfrak{d}_{ij} = \mathfrak{d}_{h}.\)

  • Step 4. Set \(h = h-1\), excluding the comparison between the lists \(L_i, L_j\) from Step \(3\), and go back to Step \(2\) until \(h=0\).

Data used in this vignette.

We use a specific database to provide a clear and precise explanation of the computation, visualization, and interpretation of the \(\mathfrak{D}\) matrix. Specifically, we will employ the set of PBTs lists, which is available as a resource in the goSorensen package. Access to this database can be obtained as follows:

data("pbtGeneLists")

PBTs is an object of class list, containing 14 character vectors with the ENTREZ gene identifiers of a gene list related to cell graft rejection:

sapply(pbtGeneLists, length)
ABMR_RATs       BAT       CT1     ENDAT     GRIT2     GRIT3    IRITD1    IRITD3 
      108       105       948       118       131       205       181       321 
   IRITD5       KT1       LT1       LT3  Rej_RATs TCMR_RATs 
      215       562       913       693       107       121 

1 Performing the Irrelevance - Threshold Matrix of Dissimilarities.

1.1 For an Specific Ontology and GO level.

One method to compute the dissimilarity matrix \(\mathfrak{D}\) for a given ontology and level (e.g., ontology BP, level 5) is to directly input the gene lists into the sorenThreshold function:

# Previously load the genomic annotation package for the studied specie:
library(org.Hs.eg.db)
humanEntrezIDs <- keys(org.Hs.eg.db, keytype = "ENTREZID")

# Computing the irrelevance-threshold matrix of dissimilarities
dismatBP5 <- sorenThreshold(pbtGeneLists, onto = "BP", GOLevel = 5,
                            geneUniverse = humanEntrezIDs, 
                            orgPackg = "org.Hs.eg.db")

 Wed Jul 31 17:39:37 2024   Building all enrichment contingency tables...

 Wed Jul 31 17:43:04 2024   Computing all threshold equivalence distances...
round(dismatBP5, 2)
          ABMR_RATs  BAT  CT1 ENDAT GRIT2 GRIT3 IRITD1 IRITD3 IRITD5  KT1  LT1
BAT            0.98                                                           
CT1            1.00 1.00                                                      
ENDAT          0.89 1.00 1.00                                                 
GRIT2          0.71 0.97 1.00  0.97                                           
GRIT3          0.47 0.98 1.00  0.97  0.35                                     
IRITD1         1.00 1.00 1.00  0.99  1.00  1.00                               
IRITD3         0.92 1.00 1.00  0.79  0.96  0.96   1.00                        
IRITD5         0.87 1.00 1.00  0.88  0.90  0.87   1.00   0.90                 
KT1            1.00 1.00 0.81  0.95  1.00  1.00   1.00   1.00   1.00          
LT1            1.00 1.00 1.00  1.00  1.00  1.00   1.00   1.00   1.00 1.00     
LT3            1.00 1.00 1.00  1.00  1.00  1.00   1.00   1.00   1.00 1.00 0.29
Rej_RATs       0.32 0.98 1.00  0.95  0.64  0.38   1.00   0.94   0.85 1.00 1.00
TCMR_RATs      0.52 0.96 1.00  0.98  0.58  0.42   1.00   0.93   0.86 1.00 1.00
           LT3 Rej_RATs
BAT                    
CT1                    
ENDAT                  
GRIT2                  
GRIT3                  
IRITD1                 
IRITD3                 
IRITD5                 
KT1                    
LT1                    
LT3                    
Rej_RATs  1.00         
TCMR_RATs 1.00     0.39

Each value in this matrix represents dissimilarities determined by the equivalence threshold which determines whether the compared lists are equivalent. In addition, it is essential to consider that when the Sorensen dissimilarity value is irrelevantly equal to zero (equivalent to zero), it suggests biological similarity. Based on this, we can deduce that the smallest values (near zero) in the matrix \(\mathfrak{D}\) indicate equivalence between lists, whereas the largest values (close to 1) indicate that the lists cannot be considered equivalent.

Another way is to compute all the pairwise \(2 \times 2\) enrichment contingency tables previously. From this table, we can obtain the dissimilarity matrix as follows:

# Previously compute the 2x2 contingency tables:
ctableBP5 <- buildEnrichTable(pbtGeneLists, onto = "BP", GOLevel = 5,
                            geneUniverse = humanEntrezIDs, 
                            orgPackg = "org.Hs.eg.db")

# Computing the irrelevance-threshold matrix of dissimilarities
dismatBP5 <- sorenThreshold(ctableBP5)

The result is exactly the same matrix od dissimilarities above showed.

1.2 For More than One Ontology and GO Level Defined by the User.

Similarly, we can directly perform the computation using the gene lists dataset:

allDismat <- allSorenThreshold (pbtGeneLists, geneUniverse = humanEntrezIDs, 
                            orgPackg = "org.Hs.eg.db", ontos = c("BP", "CC"), 
                            GOLevels = 4:7)

Or, previously computing all the pairwise contingency tables for the ontologies (BP, CC, MF) and GO levels defined by the user:

allTabs <- allBuildEnrichTable(pbtGeneLists, geneUniverse = humanEntrezIDs, 
                            orgPackg = "org.Hs.eg.db", ontos = c("BP", "CC"), 
                            GOLevels = 4:7)
allDismat <- allSorenThreshold(allTabs)

Given the substantial quantity of outcomes obtained, exhibiting all of these results is not feasible. The result is a collection of dissimilarity matrices for the different ontologies and GO levels defined in the ontos and GOLevels arguments.

Suppose the ontos and GOLevels arguments are not provided. In that case, the allSorenThreshold function automatically computes the dissimilarity matrices for all three ontologies (BP, CC and MF) and GO levels ranging from 3 to 10.

2 Representing the Dissimilarity Matrix in Statistic Graphs.

2.1 Dendrogram.

Below, we graph a dendrogram based on the irrelevance-threshold matrix of dissimilarities for the particular case of the MF ontology, GO level 5, in our PBTs dataset. The dissimilarity matrix for this particular case was computed above and it is saved in the object dismatBP5.

clust.threshold <- hclustThreshold(dismatBP5)
plot(clust.threshold)

The dendrogram displays the lists of genes that are most similar and most distant. For instance, when considering a dissimilarity of 0.6, chosen arbitrarily, we can observe the formation of three groups. Within each group, the lists can be regarded as similar but different from those in other groups. There are three groups of lists that can be considered equivalent. The first group includes LT3 and LT1. The second group contains GRIT2 and GRIT3. The third group comprises the lists TCMR_RATs, ABMR_RATs, and Rej_RATs.

2.2 MDS-Biplot.

We initiate the process by performing a multidimensional scaling MDS analysis. MDS transforms the initial dissimilarity matrix into Euclidean distances while maintaining the maximum possible variation of the original dissimilarities. In this instance, we compute the first two dimensions, which account for the most relevant proportion of the original variability.

mds <- as.data.frame(cmdscale(dismatBP5, k = 2))
mds
                   V1          V2
ABMR_RATs  0.38854334  0.04392676
BAT       -0.16327259 -0.09079187
CT1       -0.26862731 -0.24698285
ENDAT     -0.16612448 -0.28420320
GRIT2      0.35204607  0.07836793
GRIT3      0.45673252  0.10300851
IRITD1    -0.22078619 -0.12783291
IRITD3    -0.13587210 -0.24607399
IRITD5     0.01069298 -0.13659312
KT1       -0.27602226 -0.27344008
LT1       -0.41625933  0.50991290
LT3       -0.41625933  0.50991290
Rej_RATs   0.43787856  0.08027021
TCMR_RATs  0.41733012  0.08051881

Below we plot the biplot:

library(ggplot2)
library(ggrepel)
graph <- ggplot() +
  geom_point(aes(mds[,1], mds[,2]), color = "blue", size = 3) +
  geom_text_repel(aes(mds[,1], mds[,2], label = attr(dismatBP5, "Labels")), 
                  color = "black", size = 3) +
    xlab("Dim 1") +
    ylab("Dim 2") +
    theme_light()
graph

The MDS-Biplot suggests that the formation of groups is consistent with what was observed in the dendrogram. It is important to remember that the distances observed between lists in the biplot are defined based on the dissimilarity threshold that establishes the equivalence of the compared lists. For instance, the proximity between the Rej_RATs and GRIT3 lists implies that both are equivalent.

However, it is more interesting to identify the GO terms most strongly associated with the dimensions generated in the biplot rather than just determining if the lists are equivalent. These GO terms elucidate the biological functions associated with the observed distances between lists, explaining the formation of groups and the biological reasons for detecting equivalences between lists. In the next section, we develop a 5-step procedure for identifying these GO terms:

3 Characterizing MDS-Biplot Dimensions.

The following 5-step procedure helps identify the GO terms that biologically elucidate the creation of groups in each dimension of the Biplot. In this specific instance, we will execute the technique on Dimension 1. However, the approach remains identical if we were to explain Dimension 2.

  • STEP 1: Split the axis (Dimension 1) in three parts to identify the lists located at the extremes.
# Split the axis 20% to the left, 60% to the middle and 20% to the right:
prop <- c(0.2, 0.6, 0.2) 
# Sort according  dimension 1:
sorted <- mds[order(mds[, 1]), ] 
# Determine the range for dimension 1.
range <- sorted[, 1][c(1, nrow(mds))] 
# Find the cutpoints to split the axis:
cutpoints <- cumsum(prop) * diff(range) + range[1]
cutpoints <- cutpoints[-length(cutpoints)]

# Identify lists to the left:  
lleft <- rownames(sorted[sorted[, 1] < cutpoints[1], ])
# Identify lists to the right
lright <- rownames(sorted[sorted[, 1] > cutpoints[2], ]) 

lleft
[1] "LT1" "LT3" "KT1" "CT1"
lright
[1] "GRIT2"     "ABMR_RATs" "TCMR_RATs" "Rej_RATs"  "GRIT3"    
graph +
  geom_vline(xintercept = cutpoints, color = "red", linetype = "dashed", 
             linewidth = 0.75)