The methods of group similarity implemented in simona are mainly from the supplementary file of the paper “Mazandu et al., Gene Ontology semantic similarity tools: survey on features and challenges for biological knowledge discovery. Briefings in Bioinformatics 2017”. Original denotations have been slightly modified to make them more consistent. Also more explanations have been added in this vignette.
There are two groups of terms denoted as \(T_p\) and \(T_q\) represented as two sets:
\[ T_p = \{ a_1, a_2, ...\} \\ T_q = \{ b_1, b_2, ... \} \]
where \(a_i\) is a term in set \(T_p\) and \(b_j\) is a term in set \(T_q\).
The wrapper function group_sim()
calculates semantic similarities between two groups of terms with a specific method. Note the method name can be partially matched.
group_sim(dag, group1, group2, method = ..., control = list(...))
Some of the group similarity methods have no assumption of which similarity measure between single terms to use. If there are annotation already provided in the DAG object, by default Sim_Lin_1998 is used, or else Sim_WP_1994 is used. The term similarity method can be set via the term_sim_method
parameter in control
. Additionally parameters for a specific term_sim_method
can also be set in control
.
group_sim(dag, group1, group2, method = ..., control = list(term_sim_method = ...))
All supported group similarity methods are:
## [1] "GroupSim_pairwise_avg" "GroupSim_pairwise_max"
## [3] "GroupSim_pairwise_BMA" "GroupSim_pairwise_BMM"
## [5] "GroupSim_pairwise_ABM" "GroupSim_pairwise_HDF"
## [7] "GroupSim_pairwise_MHDF" "GroupSim_pairwise_VHDF"
## [9] "GroupSim_pairwise_Froehlich_2007" "GroupSim_pairwise_Joeng_2014"
## [11] "GroupSim_SimALN" "GroupSim_SimGIC"
## [13] "GroupSim_SimDIC" "GroupSim_SimUIC"
## [15] "GroupSim_SimUI" "GroupSim_SimDB"
## [17] "GroupSim_SimUB" "GroupSim_SimNTO"
## [19] "GroupSim_SimCOU" "GroupSim_SimCOT"
## [21] "GroupSim_SimLP" "GroupSim_Ye_2005"
## [23] "GroupSim_SimCHO" "GroupSim_SimALD"
## [25] "GroupSim_Jaccard" "GroupSim_Dice"
## [27] "GroupSim_Overlap" "GroupSim_Kappa"
Denote \(S(a, b)\) as the semantic similarity between term \(a\) and \(b\) where \(a\) is from group \(p\) and \(b\) is from group \(q\), The similarity between group \(p\) and group \(q\) is the average similarity of every pair of individual terms in the two groups:
\[ \mathrm{GroupSim}(p, q) = \frac{1}{|T_p|*|T_q|} \sum_{a \in T_p, b \in T_q}S(a, b) \]
The term semantic similarity method and the IC method can be set via control
argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_avg" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: https://doi.org/10.1093/bioinformatics/btg153.
The similarity is defined as the maximal \(S(a, b)\) among all pairs of terms in group \(p\) and \(q\):
\[ \mathrm{GroupSim}(p, q) = \max_{a \in T_p, b \in T_q}S(a, b) \]
The term semantic similarity method and the IC method can be set via control
argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_max" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: https://doi.org/10.1109/TCBB.2005.50.
BMA stands for “best-match average”. First define similarity of a term \(x\) to a group of terms \(T\) as
\[ S(x, T) = \max_{y \in T} S(x, y) \]
which corresponds to the most similar term in \(T\) to \(x\). Then the BMA similarity is calculated as:
\[ \mathrm{GroupSim}(p, q) = \frac{1}{2}\left( \frac{1}{|T_p|}\sum_{a \in T_p} S(a, T_q) + \frac{1}{|T_q|}\sum_{b \in T_q} S(b, T_p) \right) \]
The term semantic similarity method and the IC method can be set via control
argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_BMA" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: https://doi.org/10.1155/2012/975783.
BMM stands for “best-match max”. It is defined as:
\[ \mathrm{GroupSim}(p, q) = \max \left \{ \frac{1}{|T_p|}\sum_{a \in T_p} S(a, T_q), \frac{1}{|T_q|}\sum_{b \in T_q} S(b, T_p) \right \} \]
The term semantic similarity method and the IC method can be set via control
argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_BMM" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: https://doi.org/10.1186/1471-2105-7-302.
ABM stands for “average best-match”. It is defined as:
\[ \mathrm{GroupSim}(p, q) = \frac{1}{|T_q| + |T_q|} \left( \sum_{a \in T_p} S(a, T_q) + \sum_{b \in T_q} S(b, T_p) \right) \]
The term semantic similarity method and the IC method can be set via control
argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_ABM" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: https://doi.org/10.1186/1471-2105-14-284.
First define the distance of a term \(x\) to a group of terms \(T\):
\[D(x, T) = 1 - S(x, T)\]
Then the Hausdorff distance between two groups are:
\[ \mathrm{HDF}(p, q) = \max \left\{ \max_{a \in T_p} D(a, T_q), \max_{b \in T_q} D(b, T_q) \right\} \]
This final similarity is:
\[ \mathrm{GroupSim}(p, q) = 1 - \mathrm{HDF}(p, q) \]
The term semantic similarity method and the IC method can be set via control
argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_HDF" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Instead of using the maximal distance from a group to the other group, MHDF uses mean distance:
\[ \mathrm{MHDF}(p, q) = \max \left\{ \frac{1}{|T_p|} \sum_{a \in T_p} D(a, T_q), \frac{1}{|T_q|} \sum_{b \in T_q} D(b, T_q) \right\} \]
This final similarity is:
\[ \mathrm{GroupSim}(p, q) = 1 - \mathrm{MHDF}(p, q) \]
The term semantic similarity method and the IC method can be set via control
argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_MHDF" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: https://doi.org/10.1109/ICPR.1994.576361.
It is defined as:
\[ \mathrm{VHDF}(p, q) = \frac{1}{2} \left( \sqrt{\frac{1}{|T_p|} \sum_{a \in T_p} D^2(a, T_q)} + \sqrt{\frac{1}{|T_q|} \sum_{b \in T_q} D^2(b, T_q)} \right) \]
This final similarity is:
\[ \mathrm{GroupSim}(p, q) = 1 - \mathrm{VHDF}(p, q) \]
The term semantic similarity method and the IC method can be set via control
argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_VHDF" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: https://doi.org/10.1073/pnas.0702965104.
The similarity is:
\[ \mathrm{GroupSim}(p, q) = \exp(-\mathrm{HDF}(p, q)) \]
The term semantic similarity method and the IC method can be set via control
argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_Froehlich_2007" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: https://doi.org/10.1186/1471-2105-8-166.
Similar to VHDF, but it directly uses the similarity:
\[ \mathrm{GroupSim}(p, q) = \frac{1}{2} \left( \sqrt{\frac{1}{|T_p|} \sum_{a \in T_p} S^2(a, T_q)} + \sqrt{\frac{1}{|T_q|} \sum_{b \in T_q} S^2(b, T_q)} \right) \]
The term semantic similarity method and the IC method can be set via control
argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_Joeng_2014" control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: https://doi.org/10.1109/tcbb.2014.2343963.
It is based on the average distance between every pair of terms in the two groups:
\[ \mathrm{GroupSim}(p, q) = \exp\left(-\frac{1}{|T_p|*|T_q|} \sum_{a \in T_p, b \in T_q} D_\mathrm{sp}(a, b)\right) \]
Or use the longest distance between two terms:
\[ \mathrm{GroupSim}(p, q) = \exp\left(-\frac{1}{|T_p|*|T_q|} \sum_{a \in T_p, b \in T_q} \mathrm{len}(a, b)\right) \]
There is a parameter distance which takes value of "longest_distances_via_LCA"
(the default) or "shortest_distances_via_NCA"
:
group_sim(dag, group1, group2, method = "GroupSim_SimALN", control = list(distance = "shortest_distances_via_NCA"))
Paper link: https://doi.org/10.1109/CBMS.2008.27.
This category of methods depend on the IC of terms in the two groups as well as their ancestor terms.
Denote \(A\) and \(B\) as the two sets of ancestors of terms in group \(p\) and \(q\) respectively:
\[ \begin{align*} \mathcal{A}_p &= \bigcup_{a \in T_p} \mathcal{A}_a \\ \mathcal{A}_q &= \bigcup_{b \in T_q} \mathcal{A}_b \\ \end{align*} \]
The GroupSim_SimGIC, GroupSim_SimDIC and GroupSim_SimUIC are very similar. They are based on the IC of the ancestor terms, defined as:
\[ \begin{align*} \mathrm{GroupSim}_\mathrm{SimGIC}(p, q) &= \frac{\sum\limits_{x \in \mathcal{A}_p \cap \mathcal{A}_q} \mathrm{IC}(x)}{\sum\limits_{x \in \mathcal{A}_p \cup \mathcal{A}_q} \mathrm{IC}(x)} \\ \mathrm{GroupSim}_\mathrm{SimDIC}(p, q) &= \frac{2 * \sum\limits_{x \in \mathcal{A}_p \cap \mathcal{A}_q} \mathrm{IC}(x)}{\sum\limits_{x \in \mathcal{A}_p} \mathrm{IC}(x) + \sum\limits_{x \in \mathcal{A}_q} \mathrm{IC}(x)} \\ \mathrm{GroupSim}_\mathrm{SimUIC}(p, q) &= \frac{2 * \sum\limits_{x \in \mathcal{A}_p \cap \mathcal{A}_q} \mathrm{IC}(x)}{\max\left\{\sum\limits_{x \in \mathcal{A}_p} \mathrm{IC}(x), \sum\limits_{x \in \mathcal{A}_q} \mathrm{IC}(x) \right\}} \\ \end{align*} \]
IC method can be set via the control
argument. By default if there is annotation associated, IC_annotation is used, or else IC_offspring is used.
group_sim(dag, group1, group2, method = "GroupSim_SimGIC", control = list(IC_method = ...))
These four methods are based on the counts of ancestor terms:
\[ \begin{align*} \mathrm{GroupSim}_\mathrm{SimUI}(p, q) &= \frac{|\mathcal{A}_p \cap \mathcal{A}_q|}{|\mathcal{A}_p \cup \mathcal{A}_q|} \\ \mathrm{GroupSim}_\mathrm{SimDB}(p, q) &= \frac{2*|\mathcal{A}_p \cap \mathcal{A}_q|}{|\mathcal{A}_p| + |\mathcal{A}_q|} \\ \mathrm{GroupSim}_\mathrm{SimUB}(p, q) &= \frac{|\mathcal{A}_p \cap \mathcal{A}_q|}{\max\{|\mathcal{A}_p|, |\mathcal{A}_q|\}} \\ \mathrm{GroupSim}_\mathrm{SimNTO}(p, q) &= \frac{|\mathcal{A}_p \cap \mathcal{A}_q|}{\min\{|\mathcal{A}_p|, |\mathcal{A}_q|\}} \end{align*} \]
group_sim(dag, group1, group2, method = "GroupSim_SimUI")
Let’s write \(\mathcal{A}_p\) and \(\mathcal{A}_q\) as two vectors \(\mathbf{v_p}\) and \(\mathbf{v_q}\). Taking \(\mathbf{v_p}\) as an example, it is \(\mathbf{v_p} = (w_1, ..., w_n)\) where \(n\) is the number of total terms in the DAG. The value \(w_i\) is assigned to the corresponding term \(t_i\) and is defined as:
\[ \mathcal{w}_{i} = \left\{ \begin{array}{ll} \mathrm{IC}(t_i) & \textrm{if} t_i \in \mathcal{A}_p \\ 0 & \textrm{otherwise} \end{array} \right. \]
The semantic similarity is defined as the cosine similarity between the two vectors:
\[ \mathrm{GroupSim}(a, b) = \frac{ \mathbf{v_p} \cdot \mathbf{v_q} }{\left \| \mathbf{v_p} \right \| \cdot \left \| \mathbf{v_q} \right \|} \]
It can also be written as:
\[ \mathrm{GroupSim}(a, b) = \frac{\sum\limits_{x \in \mathcal{A}_p \cap \mathcal{A}_q}\mathrm{IC}(x)^2}{\sqrt{\sum\limits_{x \in \mathcal{A}_p}\mathrm{IC}(x)^2} \cdot \sqrt{\sum\limits_{x \in \mathcal{A}_q}\mathrm{IC}(x)^2}} \]
IC method can be set via the control
argument. By default if there is annotation associated, IC_annotation is used, or else IC_offspring is used.
group_sim(dag, group1, group2, method = "GroupSim_SimCOU", control = list(IC_method = ...))
The semantic similarity is defined as:
\[ \begin{align*} \mathrm{GroupSim}(a, b) &= \frac{ \mathbf{v_p} \cdot \mathbf{v_q} }{\left \| \mathbf{v_p} \right \|^2 + \left \| \mathbf{v_q} \right \|^2 - \mathbf{v_p} \cdot \mathbf{v_q}} \\ &= \frac{\sum\limits_{x \in \mathcal{A}_p \cap \mathcal{A}_q}\mathrm{IC}(x)^2}{\sum\limits_{x \in \mathcal{A}_p \cup \mathcal{A}_q}\mathrm{IC}(x)^2} \end{align*} \]
IC method can be set via the control
argument. By default if there is annotation associated, IC_annotation is used, or else IC_offspring is used.
group_sim(dag, group1, group2, method = "GroupSim_SimCOT", control = list(IC_method = ...))
It is the largest depth of terms in \(\mathcal{A}_p \cap \mathcal{A}_q\).
\[ \mathrm{GroupSim}(p, q) = \max\{\delta(t): t \in \mathcal{A}_p \cap \mathcal{A}_q\} \]
group_sim(dag, group1, group2, method = "GroupSim_SimLP")
It is a normalized version of GroupSim_SimLP:
\[ \begin{align*} \mathrm{GroupSim}(p, q) &= \max\left\{\frac{\delta(t) - \delta_\mathrm{min}}{\delta_\mathrm{max} - \delta_\mathrm{min}}: t \in \mathcal{A}_p \cap \mathcal{A}_q\right\} \\ &= \max\left\{\frac{\delta(t) }{\delta_\mathrm{max}}: t \in \mathcal{A}_p \cap \mathcal{A}_q\right\} \end{align*} \]
Since the minimal depth is zero for root.
group_sim(dag, group1, group2, method = "GroupSim_Ye_2005")
Paper link: https://doi.org/10.1038/msb4100034.
This category of methods consider the items annotated to the two groups of terms.
It is based on the annotated items. Denote \(\sigma(t)\) as the total number of annotated items of \(t\) (after merging all its offspring terms). The similarity is calculated as:
\[ \mathrm{GroupSim}(p, q) = \frac{\log(C_{pq})}{\log(C_\mathrm{min}/C_\mathrm{max})} \]
where \(C_{pq} = \min\{\sigma(t): t \in T_p \cap T_q \}\), \(C_\mathrm{min}\) is the minimal number of annotated items in the DAG which in most cases is 1, \(C_\mathrm{max}\) is the maximal number of annotated items, which is the total number of items annotated to the complete DAG.
The similarity can also be written in form of \(\mathrm{IC}_\mathrm{anno}\):
\[ \mathrm{GroupSim}(p, q) = \frac{\max\limits_{x \in T_p \cup T_q}\mathrm{IC}(x)}{\mathrm{IC}_\mathrm{max}} \]
group_sim(dag, group1, group2, method = "GroupSim_SimCHO")
The similarity is calculated as:
\[ \mathrm{GroupSim}(p, q) = \max\left\{ 1 - \frac{\sigma(x)}{C_\mathrm{max}}: x \in T_p \cap T_q \right\} \]
group_sim(dag, group1, group2, method = "GroupSim_SimALD")
Since \(T_p\) and \(T_q\) are two sets, the Kappa coeffcient, Jaccard coeffcient, Dice coeffcient and overlap coeffcient can be naturally used.
group_sim(dag, group1, group2, method = "GroupSim_Jaccard", control = list(universe = ...)) group_sim(dag, group1, group2, method = "GroupSim_Dice", control = list(universe = ...)) group_sim(dag, group1, group2, method = "GroupSim_Overlap", control = list(universe = ...)) group_sim(dag, group1, group2, method = "GroupSim_Kappa", control = list(universe = ...))
## R version 4.3.2 Patched (2023-11-13 r85521)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] org.Hs.eg.db_3.18.0 AnnotationDbi_1.64.1 IRanges_2.36.0
## [4] S4Vectors_0.40.2 Biobase_2.62.0 BiocGenerics_0.48.1
## [7] igraph_2.0.1.1 simona_1.0.10 knitr_1.45
##
## loaded via a namespace (and not attached):
## [1] KEGGREST_1.42.0 circlize_0.4.15 shape_1.4.6
## [4] rjson_0.2.21 xfun_0.41 bslib_0.6.1
## [7] GlobalOptions_0.1.2 bitops_1.0-7 vctrs_0.6.5
## [10] tools_4.3.2 curl_5.2.0 parallel_4.3.2
## [13] Polychrome_1.5.1 RSQLite_2.3.5 highr_0.10
## [16] cluster_2.1.6 blob_1.2.4 pkgconfig_2.0.3
## [19] RColorBrewer_1.1-3 scatterplot3d_0.3-44 GenomeInfoDbData_1.2.11
## [22] lifecycle_1.0.4 compiler_4.3.2 textshaping_0.3.7
## [25] Biostrings_2.70.2 codetools_0.2-19 ComplexHeatmap_2.18.0
## [28] clue_0.3-65 GenomeInfoDb_1.38.5 httpuv_1.6.14
## [31] htmltools_0.5.7 sass_0.4.8 RCurl_1.98-1.14
## [34] yaml_2.3.8 later_1.3.2 crayon_1.5.2
## [37] jquerylib_0.1.4 GO.db_3.18.0 ellipsis_0.3.2
## [40] cachem_1.0.8 iterators_1.0.14 foreach_1.5.2
## [43] mime_0.12 digest_0.6.34 fastmap_1.1.1
## [46] grid_4.3.2 colorspace_2.1-0 cli_3.6.2
## [49] magrittr_2.0.3 promises_1.2.1 bit64_4.0.5
## [52] rmarkdown_2.25 XVector_0.42.0 httr_1.4.7
## [55] matrixStats_1.2.0 bit_4.0.5 ragg_1.2.7
## [58] png_0.1-8 GetoptLong_1.0.5 memoise_2.0.1
## [61] shiny_1.8.0 evaluate_0.23 doParallel_1.0.17
## [64] rlang_1.1.3 Rcpp_1.0.12 xtable_1.8-4
## [67] glue_1.7.0 DBI_1.2.1 xml2_1.3.6
## [70] jsonlite_1.8.8 R6_2.5.1 systemfonts_1.0.5
## [73] zlibbioc_1.48.0