1 Introduction

Methods to find similarities have been developed for several purposes, being Jaccard and Dice similarities the most known. In bioinformatics much of the research on the topic is centered around Gene Ontologies because they provide controlled vocabularies, as part of their mission:

The mission of the GO Consortium is to develop an up-to-date, comprehensive, computational model of biological systems, from the molecular level to larger pathways, cellular and organism-level systems.

However, there is another resource of similarities between genes: metabolic pathways. Metabolic pathways describe the relationship between genes, proteins, lipids and other elements of the cells. A pathway describes, to some extent, the function in which it is involved in the cell. There exists several databases about which gene belong to which pathway. Together with pathways, gene sets related to a function or to a phenotype are a source of information of the genes function. With this package we provide the methods to calculate functional similarities based on this information.

Here we provides functions to calculate functional similarities distances for pathways, gene sets, genes and clusters of genes. The name BioCor stands from biological correlation, shortened to BioCor, because as said we look if some genes are in the same pathways or gene sets as other genes.

BioCor is different from GeneOverlap because here we use the Dice index instead of the Jaccard index (although we provide a function to change from one to the other, see this section)and that package only allows to compare pathways but not genes or groups of genes. But GeneOverlap provides some functionalities to plot the similarity scores and provides the associated p-value to the comparison of pathways.

The development of this package aimed initially to improve clustering of genes by functionality in weighted gene co-expression networks using WGCNA. The package has some functions to combine similarities in order to integrate with WGCNA. For other uses you can check the advanced vignette..

2 Citation

You can cite the package as:

citation("BioCor")

3 Installation

The BioCor package is available at Bioconductor and can be downloaded and installed via BiocManager:

install.packages("BiocManager")
BiocManager::install("BioCor")

You can install the latest version of BioCor from Github with:

library("devtools")
install_github("llrs/BioCor")

4 Using BioCor

4.1 Preparation

We can load the package and prepare the data for which we want to calculate the similarities:

library("BioCor")
## Load libraries with the data of the pathways
library("org.Hs.eg.db")
library("reactome.db")
genesKegg <- as.list(org.Hs.egPATH)
genesReact <- as.list(reactomeEXTID2PATHID)
# Remove genes and pathways which are not from human pathways 
genesReact <- lapply(genesReact, function(x){
    unique(grep("R-HSA-", x, value = TRUE))
    })
genesReact <- genesReact[lengths(genesReact) >= 1] 

To avoid having biased data it is important to have all the data about the pathways and genes associated to all pathways for the organism under study. Here we assume that we are interested in human pathways. We use this two databases KEGG and Reactome as they are easy to obtain the data. However KEGG database is no longer free for large retrievals therefore it is not longer updated in the Bioconductor annotation packages.

However, one can use any list where the names of the list are the genes and the elements of the list the pathways or groups where the gene belong. One could also read from a GMT file or use GeneSetCollections in addition or instead of those associations from a pathway database and convert it to list using:

library("GSEABase")
paths2Genes <- geneIds(getGmt("/path/to/file.symbol.gmt",
                 geneIdType=SymbolIdentifier()))

genes <- unlist(paths2Genes, use.names = FALSE)
pathways <- rep(names(paths2Genes), lengths(paths2Genes))
genes2paths <- split(pathways, genes) # List of genes and the gene sets

With genes2paths we have the information ready to use.

4.2 Pathway similarities

We can compute similarities (Dice similarity, see question 1 of FAQ) between two pathways or between several pathways and combine them, or not:

(paths <- sample(unique(unlist(genesReact)), 2))
## [1] "R-HSA-8963693" "R-HSA-418594"
pathSim(paths[1], paths[2], genesReact)
## [1] 0

(pathways <- sample(unique(unlist(genesReact)), 10))
##  [1] "R-HSA-9673163" "R-HSA-2682334" "R-HSA-5619110" "R-HSA-9706374"
##  [5] "R-HSA-9627069" "R-HSA-937042"  "R-HSA-8953897" "R-HSA-5250913"
##  [9] "R-HSA-5250941" "R-HSA-9636667"
mpathSim(pathways, genesReact)
##               R-HSA-9673163 R-HSA-2682334 R-HSA-5619110 R-HSA-9706374
## R-HSA-9673163             1    0.00000000             0    0.00000000
## R-HSA-2682334             0    1.00000000             0    0.02040816
## R-HSA-5619110             0    0.00000000             1    0.00000000
## R-HSA-9706374             0    0.02040816             0    1.00000000
## R-HSA-9627069             0    0.00000000             0    0.00000000
## R-HSA-937042              0    0.00000000             0    0.00000000
## R-HSA-8953897             0    0.00000000             0    0.00000000
## R-HSA-5250913             0    0.01010101             0    0.00000000
## R-HSA-5250941             0    0.00000000             0    0.00000000
## R-HSA-9636667             0    0.00000000             0    0.00000000
##               R-HSA-9627069 R-HSA-937042 R-HSA-8953897 R-HSA-5250913
## R-HSA-9673163   0.000000000   0.00000000   0.000000000    0.00000000
## R-HSA-2682334   0.000000000   0.00000000   0.000000000    0.01010101
## R-HSA-5619110   0.000000000   0.00000000   0.000000000    0.00000000
## R-HSA-9706374   0.000000000   0.00000000   0.000000000    0.00000000
## R-HSA-9627069   1.000000000   0.00000000   0.007255139    0.00000000
## R-HSA-937042    0.000000000   1.00000000   0.009685230    0.00000000
## R-HSA-8953897   0.007255139   0.00968523   1.000000000    0.13882863
## R-HSA-5250913   0.000000000   0.00000000   0.138828633    1.00000000
## R-HSA-5250941   0.000000000   0.00000000   0.133764833    0.75576037
## R-HSA-9636667   0.000000000   0.00000000   0.000000000    0.00000000
##               R-HSA-5250941 R-HSA-9636667
## R-HSA-9673163     0.0000000             0
## R-HSA-2682334     0.0000000             0
## R-HSA-5619110     0.0000000             0
## R-HSA-9706374     0.0000000             0
## R-HSA-9627069     0.0000000             0
## R-HSA-937042      0.0000000             0
## R-HSA-8953897     0.1337648             0
## R-HSA-5250913     0.7557604             0
## R-HSA-5250941     1.0000000             0
## R-HSA-9636667     0.0000000             1

When the method to combine the similarities is set to NULL mpathSim() returns a matrix of pathway similarities, otherwise it combines the values. In the next section we can see the methods to combine pathway similarities.

4.2.1 Combining values

To combine values we provide a function with several methods:

sim <- mpathSim(pathways, genesReact)
methodsCombineScores <- c("avg", "max", "rcmax", "rcmax.avg", "BMA",
                          "reciprocal")
sapply(methodsCombineScores, BioCor::combineScores, scores = sim)
##        avg        max      rcmax  rcmax.avg        BMA reciprocal 
##  0.1215161  1.0000000  1.0000000  1.0000000  1.0000000  1.0000000

We can also specify the method to combine the similarities in mpathSim(), geneSim(), mgeneSim(), clusterSim(), mclusterSim(), clusterGeneSim() and mclusterGeneSim(), argument method. By default the method is set to “max” to combine pathways (except in mpathSim where the default is to show all the pathway similarities) and “BMA” to combine similarities of genes or for cluster analysis. This function is adapted from GOSemSim package.

The function combineScoresPar() allows to use a parallel background (using BiocParallel) to combine the scores. It is recommended to use a parallel background if you calculate more than 300 gene similarities. It also have an argument in case you want to calculate the similarity scores of several sets.

4.3 Gene similarities

To compare the function of two genes there is the geneSim function and mgeneSim function for several comparisons. In this example we compare the genes BRCA1 and BRCA2 and NAT2, which are the genes 672, 675 and 10 respectively in ENTREZID:

geneSim("672", "675", genesKegg)
## [1] 0.0824295
geneSim("672", "675", genesReact)
## [1] 1

mgeneSim(c("BRCA1" = "672", "BRCA2" = "675", "NAT2" = "10"), genesKegg)
##           BRCA1       BRCA2        NAT2
## BRCA1 1.0000000 0.082429501 0.000000000
## BRCA2 0.0824295 1.000000000 0.008241758
## NAT2  0.0000000 0.008241758 1.000000000
mgeneSim(c("BRCA1" = "672", "BRCA2" = "675", "NAT2" = "10"), genesReact)
##           BRCA1     BRCA2      NAT2
## BRCA1 1.0000000 1.0000000 0.2225994
## BRCA2 1.0000000 1.0000000 0.2225994
## NAT2  0.2225994 0.2225994 1.0000000

Note that for the same genes each database or list provided has different annotations, which result on different similarity scores. In this example BRCA1 has 3 and 27 pathways in KEGG and Reactome respectively and BRCA2 has 1 and 55 pathways in KEGG and Reactome respectively which results on different scores.

4.4 Gene cluster similarities

There are two methods:

  • Combining all the pathways for each cluster and compare between them.
  • Calculate the similarity between genes of a cluster and the other cluster.

4.4.1 By pathways

As explained, in this method all the pathways of a cluster are compared with all the pathways of the other cluster. If a method to combine pathways similarities is not provided, all pathway similarities are returned:

clusterSim(c("672", "675"), c("100", "10", "1"), genesKegg)
## [1] 0.04210526
clusterSim(c("672", "675"), c("100", "10", "1"), genesKegg, NULL)
##            00230       01100       05340 00232 00983
## 04120 0.00000000 0.000000000 0.011764706     0     0
## 03440 0.04210526 0.006908463 0.000000000     0     0
## 05200 0.00000000 0.008241758 0.005540166     0     0
## 05212 0.00000000 0.001666667 0.019047619     0     0

clusters <- list(cluster1 = c("672", "675"),
                 cluster2 = c("100", "10", "1"),
                 cluster3 = c("18", "10", "83"))
mclusterSim(clusters, genesKegg, "rcmax.avg")
##            cluster1   cluster2   cluster3
## cluster1 1.00000000 0.01587957 0.00256157
## cluster2 0.01587957 1.00000000 0.56412591
## cluster3 0.00256157 0.56412591 1.00000000
mclusterSim(clusters, genesKegg, "max")
##             cluster1   cluster2    cluster3
## cluster1 1.000000000 0.04210526 0.008241758
## cluster2 0.042105263 1.00000000 1.000000000
## cluster3 0.008241758 1.00000000 1.000000000

4.4.2 By genes

In this method first the similarities between each gene is calculated, then the similarity between each group of genes is calculated. Requiring two methods to combine values, the first one to combine pathways similarities and the second one to combine genes similarities. If only one is provided it returns the matrix of similarities of the genes of each cluster:

clusterGeneSim(c("672", "675"), c("100", "10", "1"), genesKegg)
## [1] 0.02605425
clusterGeneSim(c("672", "675"), c("100", "10", "1"), genesKegg, "max")
##            100          10  1
## 672 0.01176471 0.000000000 NA
## 675 0.04210526 0.008241758 NA

mclusterGeneSim(clusters, genesKegg, c("max", "rcmax.avg"))
##             cluster1   cluster2    cluster3
## cluster1 1.000000000 0.02605425 0.006181319
## cluster2 0.026054248 1.00000000 1.000000000
## cluster3 0.006181319 1.00000000 1.000000000
mclusterGeneSim(clusters, genesKegg, c("max", "max"))
##             cluster1   cluster2    cluster3
## cluster1 1.000000000 0.04210526 0.008241758
## cluster2 0.042105263 1.00000000 1.000000000
## cluster3 0.008241758 1.00000000 1.000000000

Note the differences between mclusterGeneSim() and mclusterSim() in the similarity values of the clusters. If we set method = c("max", "max") in mclusterGeneSim() then the similarity between the clusters is the same as clusterSim().

4.5 Converting similarities

If needed, Jaccard similarity can be calculated from Dice similarity using D2J():

D2J(sim)
##               R-HSA-9673163 R-HSA-2682334 R-HSA-5619110 R-HSA-9706374
## R-HSA-9673163             1   0.000000000             0    0.00000000
## R-HSA-2682334             0   1.000000000             0    0.01030928
## R-HSA-5619110             0   0.000000000             1    0.00000000
## R-HSA-9706374             0   0.010309278             0    1.00000000
## R-HSA-9627069             0   0.000000000             0    0.00000000
## R-HSA-937042              0   0.000000000             0    0.00000000
## R-HSA-8953897             0   0.000000000             0    0.00000000
## R-HSA-5250913             0   0.005076142             0    0.00000000
## R-HSA-5250941             0   0.000000000             0    0.00000000
## R-HSA-9636667             0   0.000000000             0    0.00000000
##               R-HSA-9627069 R-HSA-937042 R-HSA-8953897 R-HSA-5250913
## R-HSA-9673163   0.000000000   0.00000000   0.000000000   0.000000000
## R-HSA-2682334   0.000000000   0.00000000   0.000000000   0.005076142
## R-HSA-5619110   0.000000000   0.00000000   0.000000000   0.000000000
## R-HSA-9706374   0.000000000   0.00000000   0.000000000   0.000000000
## R-HSA-9627069   1.000000000   0.00000000   0.003640777   0.000000000
## R-HSA-937042    0.000000000   1.00000000   0.004866180   0.000000000
## R-HSA-8953897   0.003640777   0.00486618   1.000000000   0.074592075
## R-HSA-5250913   0.000000000   0.00000000   0.074592075   1.000000000
## R-HSA-5250941   0.000000000   0.00000000   0.071676301   0.607407407
## R-HSA-9636667   0.000000000   0.00000000   0.000000000   0.000000000
##               R-HSA-5250941 R-HSA-9636667
## R-HSA-9673163     0.0000000             0
## R-HSA-2682334     0.0000000             0
## R-HSA-5619110     0.0000000             0
## R-HSA-9706374     0.0000000             0
## R-HSA-9627069     0.0000000             0
## R-HSA-937042      0.0000000             0
## R-HSA-8953897     0.0716763             0
## R-HSA-5250913     0.6074074             0
## R-HSA-5250941     1.0000000             0
## R-HSA-9636667     0.0000000             1

Also if one has a Jaccard similarity and wants a Dice similarity, can use the J2D() function.

5 High volumes of gene similarities

We can compute the whole similarity of genes in KEGG or Reactome by using :

## Omit those genes without a pathway
nas <- sapply(genesKegg, function(y){all(is.na(y)) | is.null(y)})
genesKegg2 <- genesKegg[!nas]
m <- mgeneSim(names(genesKegg2), genesKegg2, method  = "max")

It takes around 5 hours in one core but it requires high memory available.

If one doesn’t have such a memory available can compute the similarities by pieces, and then fit it in another matrix with:

sim <- AintoB(m, B)

Usually B is a matrix of size length(genes), see ?AintoB().

6 An example of usage

In this example I show how to use BioCor to analyse a list of genes by functionality. With a list of genes we are going to see how similar are those genes:

genes.id <- c("10", "15", "16", "18", "2", "9", "52", "3855", "3880", "644", 
              "81327", "9128", "2073", "2893", "5142", "60", "210", "81", 
              "1352", "88", "672", "675")
genes.id <- mapIds(org.Hs.eg.db, keys = genes.id, keytype = "ENTREZID", 
                   column = "SYMBOL")
## 'select()' returned 1:1 mapping between keys and columns
genes <- names(genes.id)
names(genes) <- genes.id
react <- mgeneSim(genes, genesReact)
## Warning in mgeneSim(genes, genesReact): Some genes are not in the list
## provided.
## We remove genes which are not in list (hence the warning):
nan <- genes %in% names(genesReact)
react <- react[nan, nan]
hc <- hclust(as.dist(1 - react))
plot(hc, main = "Similarities between genes")
Dendrogram of the similarities of genes according to Reactome.

Figure 1: Gene clustering by similarities

Now we can see the relationship between the genes. We can group them for a cluster analysis to visualize the relationship between the clusters:

mycl <- cutree(hc, h = 0.2)
clusters <- split(genes[nan], as.factor(mycl))
# Removing clusters of just one gene
(clusters <- clusters[lengths(clusters) >= 2])
## $`1`
##   NAT2  AANAT   NAT1  BLVRA   ALAD  COX10 
##   "10"   "15"    "9"  "644"  "210" "1352" 
## 
## $`2`
## AARS1  ACTB BRCA1 
##  "16"  "60" "672" 
## 
## $`3`
##   ABAT  GRIA4  ACTN2 
##   "18" "2893"   "88" 
## 
## $`4`
##   A2M ACTN4 
##   "2"  "81" 
## 
## $`5`
##   KRT7  KRT19 
## "3855" "3880" 
## 
## $`8`
##  ERCC5  BRCA2 
## "2073"  "675"
names(clusters) <- paste0("cluster", names(clusters))
## Remember we can use two methods to compare clusters
sim_clus1 <- mclusterSim(clusters, genesReact)
plot(hclust(as.dist(1 - sim_clus1)), 
     main = "Similarities between clusters by pathways")
Dendrogram of clusters of genes according to Reactome.

Figure 2: Clustering using clusterSim

sim_clus2 <- mclusterGeneSim(clusters, genesReact)
plot(hclust(as.dist(1 - sim_clus2)), 
     main ="Similarities between clusters by genes")
Dendrogram of clusters according to similarities between genes from Reactome pathways.

Figure 3: Clustering using clusterGeneSim