1 Introduction

BiocSet is a package that represents element sets in a tibble format with the BiocSet class. Element sets are read in and converted into a tibble format. From here, typical dplyr operations can be performed on the element set. BiocSet also provides functionality for mapping different ID types and providing reference urls for elements and sets.

2 Installation

Install the most recent version from Bioconductor:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("BiocSet")

The development version is also available for install from GitHub:

BiocManager::install("Kayla-Morrell/BiocSet")

Then load BiocSet:

library(BiocSet)

3 BiocSet

3.1 Input and Output

BiocSet can create a BiocSet object using two different input methods. The first is to input named character vectors of element sets. BiocSet returns three tibbles, es_element which contains the elements, es_set which contains the sets and es_elementset which contains elements and sets together.

tbl <- BiocSet(set1 = letters, set2 = LETTERS)
tbl
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 52 × 1
#>   element
#>   <chr>  
#> 1 a      
#> 2 b      
#> 3 c      
#> # … with 49 more rows
#> 
#> es_set():
#> # A tibble: 2 × 1
#>   set  
#>   <chr>
#> 1 set1 
#> 2 set2 
#> 
#> es_elementset() <active>:
#> # A tibble: 52 × 2
#>   element set  
#>   <chr>   <chr>
#> 1 a       set1 
#> 2 b       set1 
#> 3 c       set1 
#> # … with 49 more rows

The second method of creating a BiocSet object would be to read in a .gmt file. Using import(), a path to a downloaded .gmt file is read in and a BiocSet object is returned. The example below uses a hallmark element set downloaded from GSEA, which is also included with this package. This BiocSet includes a source column within the es_elementset tibble for reference as to where the element set came from.

gmtFile <- system.file(package = "BiocSet",
                        "extdata",
                        "hallmark.gene.symbol.gmt")
tbl2 <- import(gmtFile)
tbl2
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 4,386 × 1
#>   element
#>   <chr>  
#> 1 JUNB   
#> 2 CXCL2  
#> 3 ATF3   
#> # … with 4,383 more rows
#> 
#> es_set():
#> # A tibble: 50 × 2
#>   set                              source                                       
#>   <chr>                            <chr>                                        
#> 1 HALLMARK_TNFA_SIGNALING_VIA_NFKB http://www.broadinstitute.org/gsea/msigdb/ca…
#> 2 HALLMARK_HYPOXIA                 http://www.broadinstitute.org/gsea/msigdb/ca…
#> 3 HALLMARK_CHOLESTEROL_HOMEOSTASIS http://www.broadinstitute.org/gsea/msigdb/ca…
#> # … with 47 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 7,324 × 2
#>   element set                             
#>   <chr>   <chr>                           
#> 1 JUNB    HALLMARK_TNFA_SIGNALING_VIA_NFKB
#> 2 CXCL2   HALLMARK_TNFA_SIGNALING_VIA_NFKB
#> 3 ATF3    HALLMARK_TNFA_SIGNALING_VIA_NFKB
#> # … with 7,321 more rows

export() allows for a BiocSet object to be exported into a temporary file with the extention .gmt.

fl <- tempfile(fileext = ".gmt")
gmt <- export(tbl2, fl)
gmt
#> GMTFile object
#> resource: /tmp/RtmpM1Rp8a/fileb2bd1351c0896.gmt

We also support importing files to create an OBOSet object which extends the BiocSet class. The default only displays the most basic amount of columns needed to perform certain tasks, but the argument extract_tag = everything allows for additional columns to be imported as well. The following workflow will demonstrate the use of both export and import to create a smaller subset of the large obo file that is accessable from GeneOntology.

download.file("http://current.geneontology.org/ontology/go.obo", "obo_file.obo")

foo <- import("obo_file.obo", extract_tag = "everything")

small_tst <- es_element(foo)[1,] %>%
    unnest("ancestors") %>%
    select("element", "ancestors") %>%
    unlist() %>%
    unique()

small_oboset <- foo %>% filter_elementset(element %in% small_tst)

fl <- tempfile(fileext = ".obo")
export(small_oboset, fl)

Then you can import this smaller obo file which we utilize here, in tests, and example code.

oboFile <- system.file(package = "BiocSet",
                        "extdata",
                        "sample_go.obo")
obo <- import(oboFile)
obo
#> class: OBOSet
#> 
#> es_element():
#> # A tibble: 8 × 3
#>   element    name                             obsolete
#>   <chr>      <chr>                            <lgl>   
#> 1 GO:0033955 mitochondrial DNA inheritance    FALSE   
#> 2 GO:0000002 mitochondrial genome maintenance FALSE   
#> 3 GO:0007005 mitochondrion organization       FALSE   
#> # … with 5 more rows
#> 
#> es_set():
#> # A tibble: 8 × 2
#>   set        name                            
#>   <chr>      <chr>                           
#> 1 GO:0000002 mitochondrial genome maintenance
#> 2 GO:0006996 organelle organization          
#> 3 GO:0007005 mitochondrion organization      
#> # … with 5 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 36 × 3
#>   element    set        is_a 
#>   <chr>      <chr>      <lgl>
#> 1 GO:0033955 GO:0000002 TRUE 
#> 2 GO:0000002 GO:0000002 FALSE
#> 3 GO:0007005 GO:0006996 TRUE 
#> # … with 33 more rows

3.2 Implemented functions

A feature available to BiocSet is the ability to activate different tibbles to perform certain functions on. When a BiocSet is created, the tibble es_elementset is automatically activated and all functions will be performed on this tibble. BiocSet adopts the use of many common dplyr functions such as filter(), select(), mutate(), summarise(), and arrange(). With each of the functions the user is able to pick a different tibble to activate and work on by using ‘verb_tibble’. After the function is executed than the ‘active’ tibble is returned back to the tibble that was active before the function call. Some examples are shown below of how these functions work.

tbl <- BiocSet(set1 = letters, set2 = LETTERS)
tbl
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 52 × 1
#>   element
#>   <chr>  
#> 1 a      
#> 2 b      
#> 3 c      
#> # … with 49 more rows
#> 
#> es_set():
#> # A tibble: 2 × 1
#>   set  
#>   <chr>
#> 1 set1 
#> 2 set2 
#> 
#> es_elementset() <active>:
#> # A tibble: 52 × 2
#>   element set  
#>   <chr>   <chr>
#> 1 a       set1 
#> 2 b       set1 
#> 3 c       set1 
#> # … with 49 more rows
tbl %>% filter_element(element == "a" | element == "A")
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 2 × 1
#>   element
#>   <chr>  
#> 1 a      
#> 2 A      
#> 
#> es_set():
#> # A tibble: 2 × 1
#>   set  
#>   <chr>
#> 1 set1 
#> 2 set2 
#> 
#> es_elementset() <active>:
#> # A tibble: 2 × 2
#>   element set  
#>   <chr>   <chr>
#> 1 a       set1 
#> 2 A       set2
tbl %>% mutate_set(pval = rnorm(1:2))
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 52 × 1
#>   element
#>   <chr>  
#> 1 a      
#> 2 b      
#> 3 c      
#> # … with 49 more rows
#> 
#> es_set():
#> # A tibble: 2 × 2
#>   set     pval
#>   <chr>  <dbl>
#> 1 set1  -0.741
#> 2 set2   1.20 
#> 
#> es_elementset() <active>:
#> # A tibble: 52 × 2
#>   element set  
#>   <chr>   <chr>
#> 1 a       set1 
#> 2 b       set1 
#> 3 c       set1 
#> # … with 49 more rows
tbl %>% arrange_elementset(desc(element))
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 52 × 1
#>   element
#>   <chr>  
#> 1 a      
#> 2 b      
#> 3 c      
#> # … with 49 more rows
#> 
#> es_set():
#> # A tibble: 2 × 1
#>   set  
#>   <chr>
#> 1 set1 
#> 2 set2 
#> 
#> es_elementset() <active>:
#> # A tibble: 52 × 2
#>   element set  
#>   <chr>   <chr>
#> 1 z       set1 
#> 2 y       set1 
#> 3 x       set1 
#> # … with 49 more rows

3.3 Set operations

BiocSet also allows for common set operations to be performed on the BiocSet object. union() and intersection() are the two set operations available in BiocSet. We demonstate how a user can find the union between two BiocSet objects or within a single BiocSet object. Intersection is used in the same way.

# union of two BiocSet objects
es1 <- BiocSet(set1 = letters[c(1:3)], set2 = LETTERS[c(1:3)])
es2 <- BiocSet(set1 = letters[c(2:4)], set2 = LETTERS[c(2:4)])
union(es1, es2)
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 8 × 1
#>   element
#>   <chr>  
#> 1 a      
#> 2 b      
#> 3 c      
#> # … with 5 more rows
#> 
#> es_set():
#> # A tibble: 2 × 1
#>   set  
#>   <chr>
#> 1 set1 
#> 2 set2 
#> 
#> es_elementset() <active>:
#> # A tibble: 8 × 2
#>   element set  
#>   <chr>   <chr>
#> 1 a       set1 
#> 2 b       set1 
#> 3 c       set1 
#> # … with 5 more rows

# union within a single BiocSet object
es3 <- BiocSet(set1 = letters[c(1:10)], set2 = letters[c(4:20)])
union_single(es3)
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 20 × 1
#>   element
#>   <chr>  
#> 1 a      
#> 2 b      
#> 3 c      
#> # … with 17 more rows
#> 
#> es_set():
#> # A tibble: 1 × 1
#>   set  
#>   <chr>
#> 1 union
#> 
#> es_elementset() <active>:
#> # A tibble: 20 × 2
#>   element set  
#>   <chr>   <chr>
#> 1 a       union
#> 2 b       union
#> 3 c       union
#> # … with 17 more rows

4 Case study I

Next, we demonstrate the a couple uses of BiocSet with an experiment dataset airway from the package airway. This data is from an RNA-Seq experiment on airway smooth muscle (ASM) cell lines.

The first step is to load the library and the necessary data.

library(airway)
data("airway")
se <- airway

The function go_sets() discovers the keys from the org object and uses AnnotationDbi::select to create a mapping to GO ids. go_sets() also allows the user to indicate which evidence type or ontology type they would like when selecting the GO ids. The default is using all evidence types and all ontology types. We represent these identifieres as a BiocSet object. Using the go_sets function we are able to map the Ensembl ids and GO ids from the genome wide annotation for Human data in the org.Hs.eg.db package. The Ensembl ids are treated as elements while the GO ids are treated as sets.

library(org.Hs.eg.db)
go <- go_sets(org.Hs.eg.db, "ENSEMBL")
go
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 23,067 × 1
#>   element        
#>   <chr>          
#> 1 ENSG00000151729
#> 2 ENSG00000025708
#> 3 ENSG00000068305
#> # … with 23,064 more rows
#> 
#> es_set():
#> # A tibble: 18,593 × 1
#>   set       
#>   <chr>     
#> 1 GO:0000002
#> 2 GO:0000003
#> 3 GO:0000009
#> # … with 18,590 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 322,605 × 2
#>   element         set       
#>   <chr>           <chr>     
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000025708 GO:0000002
#> 3 ENSG00000068305 GO:0000002
#> # … with 322,602 more rows

# an example of subsetting by evidence type
go_sets(org.Hs.eg.db, "ENSEMBL", evidence = c("IPI", "TAS"))
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 17,796 × 1
#>   element        
#>   <chr>          
#> 1 ENSG00000151729
#> 2 ENSG00000168685
#> 3 ENSG00000114030
#> # … with 17,793 more rows
#> 
#> es_set():
#> # A tibble: 4,306 × 2
#>   set        evidence    
#>   <chr>      <named list>
#> 1 GO:0000002 <chr [1]>   
#> 2 GO:0000018 <chr [1]>   
#> 3 GO:0000019 <chr [1]>   
#> # … with 4,303 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 56,928 × 2
#>   element         set       
#>   <chr>           <chr>     
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000168685 GO:0000018
#> 3 ENSG00000114030 GO:0000018
#> # … with 56,925 more rows

Some users may not be interested in reporting the non-descriptive elements. We demonstrate subsetting the airway data to include non-zero assays and then filter out the non-descriptive elements.

se1 = se[rowSums(assay(se)) != 0,]
go %>% filter_element(element %in% rownames(se1))
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 16,633 × 1
#>   element        
#>   <chr>          
#> 1 ENSG00000151729
#> 2 ENSG00000025708
#> 3 ENSG00000068305
#> # … with 16,630 more rows
#> 
#> es_set():
#> # A tibble: 18,593 × 1
#>   set       
#>   <chr>     
#> 1 GO:0000002
#> 2 GO:0000003
#> 3 GO:0000009
#> # … with 18,590 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 252,331 × 2
#>   element         set       
#>   <chr>           <chr>     
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000025708 GO:0000002
#> 3 ENSG00000068305 GO:0000002
#> # … with 252,328 more rows

It may also be of interest to users to know how many elements are in each set. Using the count function we are able to calculate the elements per set.

go %>% group_by(set) %>% dplyr::count()
#> Warning: The `add` argument of `group_by()` is deprecated as of dplyr 1.0.0.
#> Please use the `.add` argument instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
#> # A tibble: 18,593 × 2
#> # Groups:   set [18,593]
#>    set            n
#>    <chr>      <int>
#>  1 GO:0000002    12
#>  2 GO:0000003     4
#>  3 GO:0000009     2
#>  4 GO:0000010     2
#>  5 GO:0000012    10
#>  6 GO:0000014     9
#>  7 GO:0000015     4
#>  8 GO:0000016     1
#>  9 GO:0000017     2
#> 10 GO:0000018     6
#> # … with 18,583 more rows

It may also be helpful to remove sets that are empty. Since we have shown how to calculate the number of elements per set, we know that this data set does not contain any empty sets. We decide to demonstrate regardless for those users that may need this functionality.

drop <- es_activate(go, elementset) %>% group_by(set) %>%
    dplyr::count() %>% filter(n == 0) %>% pull(set)
go %>% filter_set(!(set %in% drop))
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 23,067 × 1
#>   element        
#>   <chr>          
#> 1 ENSG00000151729
#> 2 ENSG00000025708
#> 3 ENSG00000068305
#> # … with 23,064 more rows
#> 
#> es_set():
#> # A tibble: 18,593 × 1
#>   set       
#>   <chr>     
#> 1 GO:0000002
#> 2 GO:0000003
#> 3 GO:0000009
#> # … with 18,590 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 322,605 × 2
#>   element         set       
#>   <chr>           <chr>     
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000025708 GO:0000002
#> 3 ENSG00000068305 GO:0000002
#> # … with 322,602 more rows

To simplify mapping we created a couple map functions. map_unique() is used when there is known 1:1 mapping, This takes four arguements, a BiocSet object, an AnnotationDbi object, the id to map from, and the id to map to. map_multiple() needs a fifth argument to indicate how the function should treat an element when there is multiple mapping. Both functions utilize mapIds from AnnotationDbi and return a BiocSet object. In the example below we show how to use map_unique to map go’s ids from Ensembl to gene symbols.

go %>% map_unique(org.Hs.eg.db, "ENSEMBL", "SYMBOL")
#> 'select()' returned 1:many mapping between keys and columns
#> Joining, by = "element"
#> `mutate_if()` ignored the following grouping variables:
#> Column `element`
#> Joining, by = "element"
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 20,316 × 1
#>   element
#>   <chr>  
#> 1 AKT3   
#> 2 LONP1  
#> 3 MEF2A  
#> # … with 20,313 more rows
#> 
#> es_set():
#> # A tibble: 18,593 × 1
#>   set       
#>   <chr>     
#> 1 GO:0000002
#> 2 GO:0000003
#> 3 GO:0000009
#> # … with 18,590 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 285,761 × 2
#>   element set       
#>   <chr>   <chr>     
#> 1 AKT3    GO:0000002
#> 2 LONP1   GO:0000002
#> 3 MEF2A   GO:0000002
#> # … with 285,758 more rows

Another functionality of BiocSet is the ability to add information to the tibbles. Using the GO.db library we are able to map definitions to the GO ids. From there we can add the mapping to the tibble using map_add() and the mutate function.

library(GO.db)
map <- map_add_set(go, GO.db, "GOID", "DEFINITION")
go %>% mutate_set(definition = map)
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 23,067 × 1
#>   element        
#>   <chr>          
#> 1 ENSG00000151729
#> 2 ENSG00000025708
#> 3 ENSG00000068305
#> # … with 23,064 more rows
#> 
#> es_set():
#> # A tibble: 18,593 × 2
#>   set        definition                                                         
#>   <chr>      <chr>                                                              
#> 1 GO:0000002 The maintenance of the structure and integrity of the mitochondria…
#> 2 GO:0000003 The production of new individuals that contain some portion of gen…
#> 3 GO:0000009 Catalysis of the transfer of a mannose residue to an oligosacchari…
#> # … with 18,590 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 322,605 × 2
#>   element         set       
#>   <chr>           <chr>     
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000025708 GO:0000002
#> 3 ENSG00000068305 GO:0000002
#> # … with 322,602 more rows

The library KEGGREST is a client interface to the KEGG REST server. KEGG contains pathway maps that represent interaction, reaction and relation networks for various biological processes and diseases. BiocSet has a function that utilizes KEGGREST to develop a BiocSet object that contains the elements for every pathway map in KEGG.

Due to limiations of the KEGGREST package, kegg_sets can take some time to run depending on the amount of pathways for the species of interest. Therefore we demonstrate using BiocFileCache to make the data available to the user.

library(BiocFileCache)
rname <- "kegg_hsa"
exists <- NROW(bfcquery(query=rname, field="rname")) != 0L
if (!exists)
{
    kegg <- kegg_sets("hsa")
    fl <- bfcnew(rname = rname, ext = ".gmt")
    export(kegg_sets("hsa"), fl)
}
kegg <- import(bfcrpath(rname=rname))

Within the kegg_sets() function we remove pathways that do not contain any elements. We then mutate the element tibble using the map_add function to contain both Ensembl and Entrez ids.

map <- map_add_element(kegg, org.Hs.eg.db, "ENTREZID", "ENSEMBL")
#> 'select()' returned 1:many mapping between keys and columns
kegg <- kegg %>% mutate_element(ensembl = map)

Since we are working with ASM data we thought we would subset the airway data to contain only the elements in the asthma pathway. This filter is performed on the KEGG id, which for asthma is “hsa05310”.

asthma <- kegg %>% filter_set(set == "hsa05310")

se <- se[rownames(se) %in% es_element(asthma)$ensembl,]

se
#> class: RangedSummarizedExperiment 
#> dim: 7843 8 
#> metadata(1): ''
#> assays(1): counts
#> rownames(7843): ENSG00000000419 ENSG00000000938 ... ENSG00000273079
#>   ENSG00000273085
#> rowData names(0):
#> colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
#> colData names(9): SampleName cell ... Sample BioSample

The filtering can also be done for multiple pathways.

pathways <- c("hsa05310", "hsa04110", "hsa05224", "hsa04970")
multipaths <- kegg %>% filter_set(set %in% pathways)

multipaths
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 8,092 × 2
#>   element ensembl        
#>   <chr>   <chr>          
#> 1 3101    ENSG00000160883
#> 2 3098    ENSG00000156515
#> 3 3099    ENSG00000159399
#> # … with 8,089 more rows
#> 
#> es_set():
#> # A tibble: 4 × 2
#>   set      source
#>   <chr>    <chr> 
#> 1 hsa04110 <NA>  
#> 2 hsa04970 <NA>  
#> 3 hsa05224 <NA>  
#> # … with 1 more row
#> 
#> es_elementset() <active>:
#> # A tibble: 395 × 2
#>   element set     
#>   <chr>   <chr>   
#> 1 595     hsa04110
#> 2 894     hsa04110
#> 3 896     hsa04110
#> # … with 392 more rows

5 Case study II

This example will start out the same way that Case Study I started, by loading in the airway dataset, but we will also do some reformating of the data. The end goal is to be able to perform a Gene Set Enrichment Analysis on the data and return a BiocSet object of the gene sets.

data("airway")
airway$dex <- relevel(airway$dex, "untrt")

Similar to other analyses we perform a differential expression analysis on the airway data using the library DESeq2 and then store the results into a tibble.

library(DESeq2)
library(tibble)
des <- DESeqDataSet(airway, design = ~ cell + dex)
des <- DESeq(des)
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
res <- results(des)

tbl <- res %>% 
    as.data.frame() %>%
    as_tibble(rownames = "ENSEMBL") 

Since we want to use limma::goana() to perform the GSEA we will need to have ENTREZ identifiers in the data, as well as filter out some NA information. Later on this will be considered our ‘element’ tibble.

tbl <- tbl %>% 
    mutate(
        ENTREZID = mapIds(
            org.Hs.eg.db, ENSEMBL, "ENTREZID", "ENSEMBL"
        ) %>% unname()
    )
#> 'select()' returned 1:many mapping between keys and columns

tbl <- tbl %>% filter(!is.na(padj), !is.na(ENTREZID))
tbl
#> # A tibble: 15,115 × 8
#>    ENSEMBL      baseMean log2FoldChange  lfcSE   stat   pvalue     padj ENTREZID
#>    <chr>           <dbl>          <dbl>  <dbl>  <dbl>    <dbl>    <dbl> <chr>   
#>  1 ENSG0000000…    709.         -0.381  0.101  -3.79   1.52e-4  1.28e-3 7105    
#>  2 ENSG0000000…    520.          0.207  0.112   1.84   6.53e-2  1.97e-1 8813    
#>  3 ENSG0000000…    237.          0.0379 0.143   0.264  7.92e-1  9.11e-1 57147   
#>  4 ENSG0000000…     57.9        -0.0882 0.287  -0.307  7.59e-1  8.95e-1 55732   
#>  5 ENSG0000000…   5817.          0.426  0.0883  4.83   1.38e-6  1.82e-5 3075    
#>  6 ENSG0000000…   1282.         -0.241  0.0887 -2.72   6.58e-3  3.28e-2 2519    
#>  7 ENSG0000000…    610.         -0.0476 0.167  -0.286  7.75e-1  9.03e-1 2729    
#>  8 ENSG0000000…    369.         -0.500  0.121  -4.14   3.48e-5  3.42e-4 4800    
#>  9 ENSG0000000…    183.         -0.124  0.180  -0.689  4.91e-1  7.24e-1 90529   
#> 10 ENSG0000000…   2814.         -0.0411 0.103  -0.400  6.89e-1  8.57e-1 57185   
#> # … with 15,105 more rows

Now that the data is ready for GSEA we can go ahead and use goana() and make the results into a tibble. This tibble will be considered our ‘set’ tibble.

library(limma)
#> 
#> Attaching package: 'limma'
#> The following object is masked from 'package:DESeq2':
#> 
#>     plotMA
#> The following object is masked from 'package:BiocGenerics':
#> 
#>     plotMA
go_ids <- goana(tbl$ENTREZID[tbl$padj < 0.05], tbl$ENTREZID, "Hs") %>%
    as.data.frame() %>%
    as_tibble(rownames = "GOALL")
go_ids
#> # A tibble: 21,282 × 6
#>    GOALL      Term                                    Ont       N    DE     P.DE
#>    <chr>      <chr>                                   <chr> <dbl> <dbl>    <dbl>
#>  1 GO:0008150 biological_process                      BP    12328  3396 2.33e-58
#>  2 GO:0001867 complement activation, lectin pathway   BP        5     3 1.03e- 1
#>  3 GO:0001868 regulation of complement activation, l… BP        2     1 4.37e- 1
#>  4 GO:0001869 negative regulation of complement acti… BP        2     1 4.37e- 1
#>  5 GO:0002252 immune effector process                 BP      351   114 8.59e- 4
#>  6 GO:0002253 activation of immune response           BP      177    62 1.75e- 3
#>  7 GO:0002376 immune system process                   BP     1627   555 1.64e-18
#>  8 GO:0002682 regulation of immune system process     BP      888   320 2.94e-14
#>  9 GO:0002683 negative regulation of immune system p… BP      280   106 1.05e- 6
#> 10 GO:0002684 positive regulation of immune system p… BP      518   187 7.07e- 9
#> # … with 21,272 more rows

The last thing we need to do is create a tibble that we will consider our ‘elementset’ tibble. This tibble will be a mapping of all the elements and sets.

foo <- AnnotationDbi::select(
    org.Hs.eg.db,
    tbl$ENTREZID,
    "GOALL",
    "ENTREZID") %>% as_tibble()
#> 'select()' returned many:many mapping between keys and columns
foo <- foo %>% dplyr::select(-EVIDENCEALL) %>% distinct()
foo <- foo %>% filter(ONTOLOGYALL == "BP") %>% dplyr::select(-ONTOLOGYALL)
foo
#> # A tibble: 1,055,680 × 2
#>    ENTREZID GOALL     
#>    <chr>    <chr>     
#>  1 7105     GO:0002221
#>  2 7105     GO:0002376
#>  3 7105     GO:0002682
#>  4 7105     GO:0002683
#>  5 7105     GO:0002753
#>  6 7105     GO:0002764
#>  7 7105     GO:0002831
#>  8 7105     GO:0002832
#>  9 7105     GO:0006950
#> 10 7105     GO:0006952
#> # … with 1,055,670 more rows

The function BiocSet_from_elementset() allows for users to create a BiocSet object from tibbles. This function is helpful when there is metadata contained in the tibble that should be in the BiocSet object. For this function to work properly, the columns that are being joined on need to be named correctly. For instance, in order to use this function on the tibbles we created we need to change the column in the ‘element’ tibble to ‘element’, the column in the ‘set’ tibble to ‘set’ and the same will be for the ‘elementset’ tibble. We demonstrate this below and then create the BiocSet object with the simple function.

foo <- foo %>% dplyr::rename(element = ENTREZID, set = GOALL)
tbl <- tbl %>% dplyr::rename(element = ENTREZID)
go_ids <- go_ids %>% dplyr::rename(set = GOALL)
es <- BiocSet_from_elementset(foo, tbl, go_ids)
#> more elements in 'element' than in 'elementset'
#> more elements in 'set' than in 'elementset'
es
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 12,336 × 8
#>   element ENSEMBL         baseMean log2FoldChange  lfcSE  stat   pvalue     padj
#>   <chr>   <chr>              <dbl>          <dbl>  <dbl> <dbl>    <dbl>    <dbl>
#> 1 3980    ENSG00000005156     206.         -0.393 0.172  -2.29 2.23e- 2  8.66e-2
#> 2 1890    ENSG00000025708     145.          1.23  0.199   6.18 6.58e-10  1.49e-8
#> 3 4205    ENSG00000068305    1566.          0.529 0.0919  5.76 8.45e- 9  1.64e-7
#> # … with 12,333 more rows
#> 
#> es_set():
#> # A tibble: 14,912 × 6
#>   set        Term                             Ont       N    DE       P.DE
#>   <chr>      <chr>                            <chr> <dbl> <dbl>      <dbl>
#> 1 GO:0000002 mitochondrial genome maintenance BP       20     6 0.383     
#> 2 GO:0000003 reproduction                     BP      986   307 0.00000386
#> 3 GO:0000012 single strand break repair       BP       11     1 0.958     
#> # … with 14,909 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 1,055,680 × 2
#>   element set       
#>   <chr>   <chr>     
#> 1 3980    GO:0000002
#> 2 1890    GO:0000002
#> 3 4205    GO:0000002
#> # … with 1,055,677 more rows

For those users that may need to put this metadata filled BiocSet object back into an object similar to GRanges or SummarizedExperiment, we have created functions that allow for the BiocSet object to be created into a tibble or data.frame.

tibble_from_element(es)
#> Joining, by = "set"
#> Joining, by = "element"
#> # A tibble: 12,328 × 14
#>    element   set   Term  Ont   N     DE    P.DE  ENSEMBL baseMean log2FoldChange
#>    <chr>     <lis> <lis> <lis> <lis> <lis> <lis> <list>  <list>   <list>        
#>  1 1         <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [1… <dbl [1]>     
#>  2 100       <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [4… <dbl [489]>   
#>  3 1000      <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [2… <dbl [222]>   
#>  4 10000     <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [1… <dbl [156]>   
#>  5 10001     <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [7… <dbl [78]>    
#>  6 10003     <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [9… <dbl [9]>     
#>  7 10004     <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [1… <dbl [19]>    
#>  8 100048912 <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [3… <dbl [36]>    
#>  9 10005     <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [1… <dbl [122]>   
#> 10 10006     <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [1… <dbl [139]>   
#> # … with 12,318 more rows, and 4 more variables: lfcSE <list>, stat <list>,
#> #   pvalue <list>, padj <list>

head(data.frame_from_elementset(es))
#> Joining, by = "set"
#> Joining, by = "element"
#>   element        set                             Term Ont  N DE      P.DE
#> 1    3980 GO:0000002 mitochondrial genome maintenance  BP 20  6 0.3825257
#> 2    1890 GO:0000002 mitochondrial genome maintenance  BP 20  6 0.3825257
#> 3    4205 GO:0000002 mitochondrial genome maintenance  BP 20  6 0.3825257
#> 4   10891 GO:0000002 mitochondrial genome maintenance  BP 20  6 0.3825257
#> 5   55186 GO:0000002 mitochondrial genome maintenance  BP 20  6 0.3825257
#> 6    4358 GO:0000002 mitochondrial genome maintenance  BP 20  6 0.3825257
#>           ENSEMBL  baseMean log2FoldChange      lfcSE      stat       pvalue
#> 1 ENSG00000005156  205.6627    -0.39272067 0.17180717 -2.285822 2.226466e-02
#> 2 ENSG00000025708  144.6965     1.23104228 0.19933257  6.175821 6.582046e-10
#> 3 ENSG00000068305 1565.5675     0.52905471 0.09186326  5.759155 8.453618e-09
#> 4 ENSG00000109819  152.4021    -0.23932270 0.17393025 -1.375969 1.688311e-01
#> 5 ENSG00000114120 2065.4720     0.14500117 0.09361789  1.548862 1.214150e-01
#> 6 ENSG00000115204 1091.6398    -0.09563418 0.09191349 -1.040480 2.981168e-01
#>           padj
#> 1 8.661787e-02
#> 2 1.494473e-08
#> 3 1.636969e-07
#> 4 3.845953e-01
#> 5 3.059216e-01
#> 6 5.485252e-01

6 Information look up

A final feature to BiocSet is the ability to add reference information about all of the elements/sets. A user could utilize the function url_ref() to add information to the BiocSet object. If a user has a question about a specific id then they can follow the reference url to get more informtion. Below is an example of using url_ref() to add reference urls to the go data set we worked with above.

url_ref(go)
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 23,067 × 2
#>   element         url                                                           
#>   <chr>           <chr>                                                         
#> 1 ENSG00000151729 https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=E…
#> 2 ENSG00000025708 https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=E…
#> 3 ENSG00000068305 https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=E…
#> # … with 23,064 more rows
#> 
#> es_set():
#> # A tibble: 18,593 × 2
#>   set        url                                                           
#>   <chr>      <chr>                                                         
#> 1 GO:0000002 http://amigo.geneontology.org/amigo/medial_search?q=GO:0000002
#> 2 GO:0000003 http://amigo.geneontology.org/amigo/medial_search?q=GO:0000003
#> 3 GO:0000009 http://amigo.geneontology.org/amigo/medial_search?q=GO:0000009
#> # … with 18,590 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 322,605 × 2
#>   element         set       
#>   <chr>           <chr>     
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000025708 GO:0000002
#> 3 ENSG00000068305 GO:0000002
#> # … with 322,602 more rows

7 Session info

sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] limma_3.49.4                tibble_3.1.5               
#>  [3] DESeq2_1.33.5               BiocFileCache_2.1.1        
#>  [5] dbplyr_2.1.1                GO.db_3.14.0               
#>  [7] org.Hs.eg.db_3.14.0         AnnotationDbi_1.55.2       
#>  [9] airway_1.13.0               SummarizedExperiment_1.23.5
#> [11] Biobase_2.53.0              GenomicRanges_1.45.0       
#> [13] GenomeInfoDb_1.29.10        IRanges_2.27.2             
#> [15] S4Vectors_0.31.5            BiocGenerics_0.39.2        
#> [17] MatrixGenerics_1.5.4        matrixStats_0.61.0         
#> [19] BiocSet_1.7.2               dplyr_1.0.7                
#> [21] BiocStyle_2.21.4           
#> 
#> loaded via a namespace (and not attached):
#>  [1] bitops_1.0-7           bit64_4.0.5            filelock_1.0.2        
#>  [4] RColorBrewer_1.1-2     httr_1.4.2             tools_4.1.1           
#>  [7] bslib_0.3.1            utf8_1.2.2             R6_2.5.1              
#> [10] colorspace_2.0-2       DBI_1.1.1              withr_2.4.2           
#> [13] tidyselect_1.1.1       ontologyIndex_2.7      bit_4.0.4             
#> [16] curl_4.3.2             compiler_4.1.1         cli_3.0.1             
#> [19] formatR_1.11           DelayedArray_0.19.4    bookdown_0.24         
#> [22] sass_0.4.0             scales_1.1.1           genefilter_1.75.1     
#> [25] rappdirs_0.3.3         stringr_1.4.0          digest_0.6.28         
#> [28] rmarkdown_2.11         XVector_0.33.0         pkgconfig_2.0.3       
#> [31] htmltools_0.5.2        fastmap_1.1.0          rlang_0.4.12          
#> [34] RSQLite_2.2.8          jquerylib_0.1.4        BiocIO_1.3.0          
#> [37] generics_0.1.0         jsonlite_1.7.2         BiocParallel_1.27.17  
#> [40] RCurl_1.98-1.5         magrittr_2.0.1         GenomeInfoDbData_1.2.7
#> [43] Matrix_1.3-4           munsell_0.5.0          Rcpp_1.0.7            
#> [46] fansi_0.5.0            lifecycle_1.0.1        stringi_1.7.5         
#> [49] yaml_2.2.1             zlibbioc_1.39.0        plyr_1.8.6            
#> [52] grid_4.1.1             blob_1.2.2             parallel_4.1.1        
#> [55] crayon_1.4.1           lattice_0.20-45        Biostrings_2.61.2     
#> [58] splines_4.1.1          annotate_1.71.0        KEGGREST_1.33.0       
#> [61] locfit_1.5-9.4         knitr_1.36             pillar_1.6.4          
#> [64] geneplotter_1.71.0     XML_3.99-0.8           glue_1.4.2            
#> [67] evaluate_0.14          BiocManager_1.30.16    png_0.1-7             
#> [70] vctrs_0.3.8            gtable_0.3.0           purrr_0.3.4           
#> [73] tidyr_1.1.4            assertthat_0.2.1       cachem_1.0.6          
#> [76] ggplot2_3.3.5          xfun_0.27              xtable_1.8-4          
#> [79] survival_3.2-13        memoise_2.0.0          ellipsis_0.3.2