Similarity between two groups of terms

Zuguang Gu ( z.gu@dkfz.de )

2024-02-06

The methods of group similarity implemented in simona are mainly from the supplementary file of the paper “Mazandu et al., Gene Ontology semantic similarity tools: survey on features and challenges for biological knowledge discovery. Briefings in Bioinformatics 2017”. Original denotations have been slightly modified to make them more consistent. Also more explanations have been added in this vignette.

There are two groups of terms denoted as \(T_p\) and \(T_q\) represented as two sets:

\[ T_p = \{ a_1, a_2, ...\} \\ T_q = \{ b_1, b_2, ... \} \]

where \(a_i\) is a term in set \(T_p\) and \(b_j\) is a term in set \(T_q\).

The wrapper function group_sim() calculates semantic similarities between two groups of terms with a specific method. Note the method name can be partially matched.

group_sim(dag, group1, group2, method = ..., control = list(...))

Some of the group similarity methods have no assumption of which similarity measure between single terms to use. If there are annotation already provided in the DAG object, by default Sim_Lin_1998 is used, or else Sim_WP_1994 is used. The term similarity method can be set via the term_sim_method parameter in control. Additionally parameters for a specific term_sim_method can also be set in control.

group_sim(dag, group1, group2, method = ..., 
    control = list(term_sim_method = ...))

All supported group similarity methods are:

library(simona)
all_group_sim_methods()
##  [1] "GroupSim_pairwise_avg"            "GroupSim_pairwise_max"           
##  [3] "GroupSim_pairwise_BMA"            "GroupSim_pairwise_BMM"           
##  [5] "GroupSim_pairwise_ABM"            "GroupSim_pairwise_HDF"           
##  [7] "GroupSim_pairwise_MHDF"           "GroupSim_pairwise_VHDF"          
##  [9] "GroupSim_pairwise_Froehlich_2007" "GroupSim_pairwise_Joeng_2014"    
## [11] "GroupSim_SimALN"                  "GroupSim_SimGIC"                 
## [13] "GroupSim_SimDIC"                  "GroupSim_SimUIC"                 
## [15] "GroupSim_SimUI"                   "GroupSim_SimDB"                  
## [17] "GroupSim_SimUB"                   "GroupSim_SimNTO"                 
## [19] "GroupSim_SimCOU"                  "GroupSim_SimCOT"                 
## [21] "GroupSim_SimLP"                   "GroupSim_Ye_2005"                
## [23] "GroupSim_SimCHO"                  "GroupSim_SimALD"                 
## [25] "GroupSim_Jaccard"                 "GroupSim_Dice"                   
## [27] "GroupSim_Overlap"                 "GroupSim_Kappa"

Pairwise term similarity-based methods

GroupSim_pairwise_avg

Denote \(S(a, b)\) as the semantic similarity between term \(a\) and \(b\) where \(a\) is from group \(p\) and \(b\) is from group \(q\), The similarity between group \(p\) and group \(q\) is the average similarity of every pair of individual terms in the two groups:

\[ \mathrm{GroupSim}(p, q) = \frac{1}{|T_p|*|T_q|} \sum_{a \in T_p, b \in T_q}S(a, b) \]

The term semantic similarity method and the IC method can be set via control argument, for example:

group_sim(dag, group1, group2, method = "GroupSim_pairwise_avg"
    control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.

Other parameters for the term_sim_method can also be set in the control list.

Paper link: https://doi.org/10.1093/bioinformatics/btg153.

GroupSim_pairwise_max

The similarity is defined as the maximal \(S(a, b)\) among all pairs of terms in group \(p\) and \(q\):

\[ \mathrm{GroupSim}(p, q) = \max_{a \in T_p, b \in T_q}S(a, b) \]

The term semantic similarity method and the IC method can be set via control argument, for example:

group_sim(dag, group1, group2, method = "GroupSim_pairwise_max"
    control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.

Other parameters for the term_sim_method can also be set in the control list.

Paper link: https://doi.org/10.1109/TCBB.2005.50.

GroupSim_pairwise_BMA

BMA stands for “best-match average”. First define similarity of a term \(x\) to a group of terms \(T\) as

\[ S(x, T) = \max_{y \in T} S(x, y) \]

which corresponds to the most similar term in \(T\) to \(x\). Then the BMA similarity is calculated as:

\[ \mathrm{GroupSim}(p, q) = \frac{1}{2}\left( \frac{1}{|T_p|}\sum_{a \in T_p} S(a, T_q) + \frac{1}{|T_q|}\sum_{b \in T_q} S(b, T_p) \right) \]

The term semantic similarity method and the IC method can be set via control argument, for example:

group_sim(dag, group1, group2, method = "GroupSim_pairwise_BMA"
    control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.

Other parameters for the term_sim_method can also be set in the control list.

Paper link: https://doi.org/10.1155/2012/975783.

GroupSim_pairwise_BMM

BMM stands for “best-match max”. It is defined as:

\[ \mathrm{GroupSim}(p, q) = \max \left \{ \frac{1}{|T_p|}\sum_{a \in T_p} S(a, T_q), \frac{1}{|T_q|}\sum_{b \in T_q} S(b, T_p) \right \} \]

The term semantic similarity method and the IC method can be set via control argument, for example:

group_sim(dag, group1, group2, method = "GroupSim_pairwise_BMM"
    control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.

Other parameters for the term_sim_method can also be set in the control list.

Paper link: https://doi.org/10.1186/1471-2105-7-302.

GroupSim_pairwise_ABM

ABM stands for “average best-match”. It is defined as:

\[ \mathrm{GroupSim}(p, q) = \frac{1}{|T_q| + |T_q|} \left( \sum_{a \in T_p} S(a, T_q) + \sum_{b \in T_q} S(b, T_p) \right) \]

The term semantic similarity method and the IC method can be set via control argument, for example:

group_sim(dag, group1, group2, method = "GroupSim_pairwise_ABM"
    control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.

Other parameters for the term_sim_method can also be set in the control list.

Paper link: https://doi.org/10.1186/1471-2105-14-284.

GroupSim_pairwise_HDF

First define the distance of a term \(x\) to a group of terms \(T\):

\[D(x, T) = 1 - S(x, T)\]

Then the Hausdorff distance between two groups are:

\[ \mathrm{HDF}(p, q) = \max \left\{ \max_{a \in T_p} D(a, T_q), \max_{b \in T_q} D(b, T_q) \right\} \]

This final similarity is:

\[ \mathrm{GroupSim}(p, q) = 1 - \mathrm{HDF}(p, q) \]

The term semantic similarity method and the IC method can be set via control argument, for example:

group_sim(dag, group1, group2, method = "GroupSim_pairwise_HDF"
    control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.

Other parameters for the term_sim_method can also be set in the control list.

GroupSim_pairwise_MHDF

Instead of using the maximal distance from a group to the other group, MHDF uses mean distance:

\[ \mathrm{MHDF}(p, q) = \max \left\{ \frac{1}{|T_p|} \sum_{a \in T_p} D(a, T_q), \frac{1}{|T_q|} \sum_{b \in T_q} D(b, T_q) \right\} \]

This final similarity is:

\[ \mathrm{GroupSim}(p, q) = 1 - \mathrm{MHDF}(p, q) \]

The term semantic similarity method and the IC method can be set via control argument, for example:

group_sim(dag, group1, group2, method = "GroupSim_pairwise_MHDF"
    control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.

Other parameters for the term_sim_method can also be set in the control list.

Paper link: https://doi.org/10.1109/ICPR.1994.576361.

GroupSim_pairwise_VHDF

It is defined as:

\[ \mathrm{VHDF}(p, q) = \frac{1}{2} \left( \sqrt{\frac{1}{|T_p|} \sum_{a \in T_p} D^2(a, T_q)} + \sqrt{\frac{1}{|T_q|} \sum_{b \in T_q} D^2(b, T_q)} \right) \]

This final similarity is:

\[ \mathrm{GroupSim}(p, q) = 1 - \mathrm{VHDF}(p, q) \]

The term semantic similarity method and the IC method can be set via control argument, for example:

group_sim(dag, group1, group2, method = "GroupSim_pairwise_VHDF"
    control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.

Other parameters for the term_sim_method can also be set in the control list.

Paper link: https://doi.org/10.1073/pnas.0702965104.

GroupSim_pairwise_Froehlich_2007

The similarity is:

\[ \mathrm{GroupSim}(p, q) = \exp(-\mathrm{HDF}(p, q)) \]

The term semantic similarity method and the IC method can be set via control argument, for example:

group_sim(dag, group1, group2, method = "GroupSim_pairwise_Froehlich_2007"
    control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.

Other parameters for the term_sim_method can also be set in the control list.

Paper link: https://doi.org/10.1186/1471-2105-8-166.

GroupSim_pairwise_Joeng_2014

Similar to VHDF, but it directly uses the similarity:

\[ \mathrm{GroupSim}(p, q) = \frac{1}{2} \left( \sqrt{\frac{1}{|T_p|} \sum_{a \in T_p} S^2(a, T_q)} + \sqrt{\frac{1}{|T_q|} \sum_{b \in T_q} S^2(b, T_q)} \right) \]

The term semantic similarity method and the IC method can be set via control argument, for example:

group_sim(dag, group1, group2, method = "GroupSim_pairwise_Joeng_2014"
    control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.

Other parameters for the term_sim_method can also be set in the control list.

Paper link: https://doi.org/10.1109/tcbb.2014.2343963.

Pairwise edge-based methods

GroupSim_SimALN

It is based on the average distance between every pair of terms in the two groups:

\[ \mathrm{GroupSim}(p, q) = \exp\left(-\frac{1}{|T_p|*|T_q|} \sum_{a \in T_p, b \in T_q} D_\mathrm{sp}(a, b)\right) \]

Or use the longest distance between two terms:

\[ \mathrm{GroupSim}(p, q) = \exp\left(-\frac{1}{|T_p|*|T_q|} \sum_{a \in T_p, b \in T_q} \mathrm{len}(a, b)\right) \]

There is a parameter distance which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":

group_sim(dag, group1, group2, method = "GroupSim_SimALN",
    control = list(distance = "shortest_distances_via_NCA"))

Paper link: https://doi.org/10.1109/CBMS.2008.27.

Groupwise IC-based methods

This category of methods depend on the IC of terms in the two groups as well as their ancestor terms.

GroupSim_SimGIC, GroupSim_SimDIC and GroupSim_SimUIC,

Denote \(A\) and \(B\) as the two sets of ancestors of terms in group \(p\) and \(q\) respectively:

\[ \begin{align*} \mathcal{A}_p &= \bigcup_{a \in T_p} \mathcal{A}_a \\ \mathcal{A}_q &= \bigcup_{b \in T_q} \mathcal{A}_b \\ \end{align*} \]

The GroupSim_SimGIC, GroupSim_SimDIC and GroupSim_SimUIC are very similar. They are based on the IC of the ancestor terms, defined as:

\[ \begin{align*} \mathrm{GroupSim}_\mathrm{SimGIC}(p, q) &= \frac{\sum\limits_{x \in \mathcal{A}_p \cap \mathcal{A}_q} \mathrm{IC}(x)}{\sum\limits_{x \in \mathcal{A}_p \cup \mathcal{A}_q} \mathrm{IC}(x)} \\ \mathrm{GroupSim}_\mathrm{SimDIC}(p, q) &= \frac{2 * \sum\limits_{x \in \mathcal{A}_p \cap \mathcal{A}_q} \mathrm{IC}(x)}{\sum\limits_{x \in \mathcal{A}_p} \mathrm{IC}(x) + \sum\limits_{x \in \mathcal{A}_q} \mathrm{IC}(x)} \\ \mathrm{GroupSim}_\mathrm{SimUIC}(p, q) &= \frac{2 * \sum\limits_{x \in \mathcal{A}_p \cap \mathcal{A}_q} \mathrm{IC}(x)}{\max\left\{\sum\limits_{x \in \mathcal{A}_p} \mathrm{IC}(x), \sum\limits_{x \in \mathcal{A}_q} \mathrm{IC}(x) \right\}} \\ \end{align*} \]

IC method can be set via the control argument. By default if there is annotation associated, IC_annotation is used, or else IC_offspring is used.

group_sim(dag, group1, group2, method = "GroupSim_SimGIC",
    control = list(IC_method = ...))

GroupSim_SimUI, GroupSim_SimDB, GroupSim_SimUB and GroupSim_SimNTO

These four methods are based on the counts of ancestor terms:

\[ \begin{align*} \mathrm{GroupSim}_\mathrm{SimUI}(p, q) &= \frac{|\mathcal{A}_p \cap \mathcal{A}_q|}{|\mathcal{A}_p \cup \mathcal{A}_q|} \\ \mathrm{GroupSim}_\mathrm{SimDB}(p, q) &= \frac{2*|\mathcal{A}_p \cap \mathcal{A}_q|}{|\mathcal{A}_p| + |\mathcal{A}_q|} \\ \mathrm{GroupSim}_\mathrm{SimUB}(p, q) &= \frac{|\mathcal{A}_p \cap \mathcal{A}_q|}{\max\{|\mathcal{A}_p|, |\mathcal{A}_q|\}} \\ \mathrm{GroupSim}_\mathrm{SimNTO}(p, q) &= \frac{|\mathcal{A}_p \cap \mathcal{A}_q|}{\min\{|\mathcal{A}_p|, |\mathcal{A}_q|\}} \end{align*} \]

group_sim(dag, group1, group2, method = "GroupSim_SimUI")

GroupSim_SimCOU

Let’s write \(\mathcal{A}_p\) and \(\mathcal{A}_q\) as two vectors \(\mathbf{v_p}\) and \(\mathbf{v_q}\). Taking \(\mathbf{v_p}\) as an example, it is \(\mathbf{v_p} = (w_1, ..., w_n)\) where \(n\) is the number of total terms in the DAG. The value \(w_i\) is assigned to the corresponding term \(t_i\) and is defined as:

\[ \mathcal{w}_{i} = \left\{ \begin{array}{ll} \mathrm{IC}(t_i) & \textrm{if} t_i \in \mathcal{A}_p \\ 0 & \textrm{otherwise} \end{array} \right. \]

The semantic similarity is defined as the cosine similarity between the two vectors:

\[ \mathrm{GroupSim}(a, b) = \frac{ \mathbf{v_p} \cdot \mathbf{v_q} }{\left \| \mathbf{v_p} \right \| \cdot \left \| \mathbf{v_q} \right \|} \]

It can also be written as:

\[ \mathrm{GroupSim}(a, b) = \frac{\sum\limits_{x \in \mathcal{A}_p \cap \mathcal{A}_q}\mathrm{IC}(x)^2}{\sqrt{\sum\limits_{x \in \mathcal{A}_p}\mathrm{IC}(x)^2} \cdot \sqrt{\sum\limits_{x \in \mathcal{A}_q}\mathrm{IC}(x)^2}} \]

IC method can be set via the control argument. By default if there is annotation associated, IC_annotation is used, or else IC_offspring is used.

group_sim(dag, group1, group2, method = "GroupSim_SimCOU",
    control = list(IC_method = ...))

GroupSim_SimCOT

The semantic similarity is defined as:

\[ \begin{align*} \mathrm{GroupSim}(a, b) &= \frac{ \mathbf{v_p} \cdot \mathbf{v_q} }{\left \| \mathbf{v_p} \right \|^2 + \left \| \mathbf{v_q} \right \|^2 - \mathbf{v_p} \cdot \mathbf{v_q}} \\ &= \frac{\sum\limits_{x \in \mathcal{A}_p \cap \mathcal{A}_q}\mathrm{IC}(x)^2}{\sum\limits_{x \in \mathcal{A}_p \cup \mathcal{A}_q}\mathrm{IC}(x)^2} \end{align*} \]

IC method can be set via the control argument. By default if there is annotation associated, IC_annotation is used, or else IC_offspring is used.

group_sim(dag, group1, group2, method = "GroupSim_SimCOT",
    control = list(IC_method = ...))

Groupwise edge-based methods

GroupSim_SimLP

It is the largest depth of terms in \(\mathcal{A}_p \cap \mathcal{A}_q\).

\[ \mathrm{GroupSim}(p, q) = \max\{\delta(t): t \in \mathcal{A}_p \cap \mathcal{A}_q\} \]

group_sim(dag, group1, group2, method = "GroupSim_SimLP")

Link: https://bioconductor.org/packages/release/bioc/vignettes/GOstats/inst/doc/GOvis.html#go-induced-distances.

GroupSim_Ye_2005

It is a normalized version of GroupSim_SimLP:

\[ \begin{align*} \mathrm{GroupSim}(p, q) &= \max\left\{\frac{\delta(t) - \delta_\mathrm{min}}{\delta_\mathrm{max} - \delta_\mathrm{min}}: t \in \mathcal{A}_p \cap \mathcal{A}_q\right\} \\ &= \max\left\{\frac{\delta(t) }{\delta_\mathrm{max}}: t \in \mathcal{A}_p \cap \mathcal{A}_q\right\} \end{align*} \]

Since the minimal depth is zero for root.

group_sim(dag, group1, group2, method = "GroupSim_Ye_2005")

Paper link: https://doi.org/10.1038/msb4100034.

Annotated items-based methods

This category of methods consider the items annotated to the two groups of terms.

GroupSim_SimCHO

It is based on the annotated items. Denote \(\sigma(t)\) as the total number of annotated items of \(t\) (after merging all its offspring terms). The similarity is calculated as:

\[ \mathrm{GroupSim}(p, q) = \frac{\log(C_{pq})}{\log(C_\mathrm{min}/C_\mathrm{max})} \]

where \(C_{pq} = \min\{\sigma(t): t \in T_p \cap T_q \}\), \(C_\mathrm{min}\) is the minimal number of annotated items in the DAG which in most cases is 1, \(C_\mathrm{max}\) is the maximal number of annotated items, which is the total number of items annotated to the complete DAG.

The similarity can also be written in form of \(\mathrm{IC}_\mathrm{anno}\):

\[ \mathrm{GroupSim}(p, q) = \frac{\max\limits_{x \in T_p \cup T_q}\mathrm{IC}(x)}{\mathrm{IC}_\mathrm{max}} \]

group_sim(dag, group1, group2, method = "GroupSim_SimCHO")

GroupSim_SimALD

The similarity is calculated as:

\[ \mathrm{GroupSim}(p, q) = \max\left\{ 1 - \frac{\sigma(x)}{C_\mathrm{max}}: x \in T_p \cap T_q \right\} \]

group_sim(dag, group1, group2, method = "GroupSim_SimALD")

Set-based methods

Since \(T_p\) and \(T_q\) are two sets, the Kappa coeffcient, Jaccard coeffcient, Dice coeffcient and overlap coeffcient can be naturally used.

group_sim(dag, group1, group2, method = "GroupSim_Jaccard", 
    control = list(universe = ...))
group_sim(dag, group1, group2, method = "GroupSim_Dice", 
    control = list(universe = ...))
group_sim(dag, group1, group2, method = "GroupSim_Overlap", 
    control = list(universe = ...))
group_sim(dag, group1, group2, method = "GroupSim_Kappa", 
    control = list(universe = ...))

Session info

sessionInfo()
## R version 4.3.2 Patched (2023-11-13 r85521)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] org.Hs.eg.db_3.18.0  AnnotationDbi_1.64.1 IRanges_2.36.0      
## [4] S4Vectors_0.40.2     Biobase_2.62.0       BiocGenerics_0.48.1 
## [7] igraph_2.0.1.1       simona_1.0.10        knitr_1.45          
## 
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.42.0         circlize_0.4.15         shape_1.4.6            
##  [4] rjson_0.2.21            xfun_0.41               bslib_0.6.1            
##  [7] GlobalOptions_0.1.2     bitops_1.0-7            vctrs_0.6.5            
## [10] tools_4.3.2             curl_5.2.0              parallel_4.3.2         
## [13] Polychrome_1.5.1        RSQLite_2.3.5           highr_0.10             
## [16] cluster_2.1.6           blob_1.2.4              pkgconfig_2.0.3        
## [19] RColorBrewer_1.1-3      scatterplot3d_0.3-44    GenomeInfoDbData_1.2.11
## [22] lifecycle_1.0.4         compiler_4.3.2          textshaping_0.3.7      
## [25] Biostrings_2.70.2       codetools_0.2-19        ComplexHeatmap_2.18.0  
## [28] clue_0.3-65             GenomeInfoDb_1.38.5     httpuv_1.6.14          
## [31] htmltools_0.5.7         sass_0.4.8              RCurl_1.98-1.14        
## [34] yaml_2.3.8              later_1.3.2             crayon_1.5.2           
## [37] jquerylib_0.1.4         GO.db_3.18.0            ellipsis_0.3.2         
## [40] cachem_1.0.8            iterators_1.0.14        foreach_1.5.2          
## [43] mime_0.12               digest_0.6.34           fastmap_1.1.1          
## [46] grid_4.3.2              colorspace_2.1-0        cli_3.6.2              
## [49] magrittr_2.0.3          promises_1.2.1          bit64_4.0.5            
## [52] rmarkdown_2.25          XVector_0.42.0          httr_1.4.7             
## [55] matrixStats_1.2.0       bit_4.0.5               ragg_1.2.7             
## [58] png_0.1-8               GetoptLong_1.0.5        memoise_2.0.1          
## [61] shiny_1.8.0             evaluate_0.23           doParallel_1.0.17      
## [64] rlang_1.1.3             Rcpp_1.0.12             xtable_1.8-4           
## [67] glue_1.7.0              DBI_1.2.1               xml2_1.3.6             
## [70] jsonlite_1.8.8          R6_2.5.1                systemfonts_1.0.5      
## [73] zlibbioc_1.48.0