Contents

library(gemma.R)
library(dplyr)
library(ggplot2)
library(ggrepel)
library(SummarizedExperiment)
library(pheatmap)
library(viridis)

1 About Gemma

Gemma is a web site, database and a set of tools for the meta-analysis, re-use and sharing of genomics data, currently primarily targeted at the analysis of gene expression profiles. Gemma contains data from thousands of public studies, referencing thousands of published papers. Every dataset in Gemma has passed a rigorous curation process that re-annotates the expression platform at the sequence level, which allows for more consistent cross-platform comparisons and meta-analyses.

For detailed information on the curation process, read this page or the latest publication.

2 Package cheat sheet

3 Installation instructions

3.1 Bioconductor

You can install gemma.R through Bioconductor with the following code:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("gemma.R")

4 Searching for datasets of interest in Gemma

The package includes various functions to search for datasets fitting desired criteria.

All datasets belonging to a taxon of interest can be accessed by using get_taxon_datasets while the function search_datasets can be used to further limit the results by a specified query containing key words, experiment names or ontology term URIs

get_taxon_datasets(taxon = 'human') %>%  # all human datasets 
    select(experiment.ShortName, experiment.Name,taxon.Name) %>% head
   experiment.ShortName
1:              GSE2018
2:              GSE4036
3:              GSE3489
4:              GSE1923
5:               GSE361
6:               GSE492
                                                                          experiment.Name
1:                                                            Human Lung Transplant - BAL
2:                                                                perro-affy-human-186940
3: Patterns of gene dysregulation in the frontal cortex of patients with HIV encephalitis
4: Identification of PDGF-dependent patterns of gene expression in U87 glioblastoma cells
5:                                                   Mammary epithelial cell transduction
6:                               Effect of prostaglandin analogs on aqueous humor outflow
   taxon.Name
1:      human
2:      human
3:      human
4:      human
5:      human
6:      human
search_datasets('bipolar',taxon = 'human') %>% # human datasets mentioning bipolar
    select(experiment.ShortName, experiment.Name,taxon.Name) %>% head
   experiment.ShortName
1:              GSE5389
2:              GSE4030
3:              GSE5388
4:              GSE7036
5:   McLean Hippocampus
6:           McLean_PFC
                                                                                                           experiment.Name
1:           Adult postmortem brain tissue (orbitofrontal cortex) from subjects with bipolar disorder and healthy controls
2:                                                                                                 bunge-affy-arabi-162779
3: Adult postmortem brain tissue (dorsolateral prefrontal cortex) from subjects with bipolar disorder and healthy controls
4:                                               Expression profiling in monozygotic twins discordant for bipolar disorder
5:                                                                                                      McLean Hippocampus
6:                                                                                                              McLean_PFC
   taxon.Name
1:      human
2:      human
3:      human
4:      human
5:      human
6:      human
search_datasets('http://purl.obolibrary.org/obo/DOID_3312', # ontology term URI for the bipolar disorder
                taxon = 'human') %>% 
    select(experiment.ShortName, experiment.Name,taxon.Name) %>% head
   experiment.ShortName
1:              GSE5389
2:              GSE5388
3:              GSE7036
4:   McLean Hippocampus
5:           McLean_PFC
6:     stanley_feinberg
                                                                                                           experiment.Name
1:           Adult postmortem brain tissue (orbitofrontal cortex) from subjects with bipolar disorder and healthy controls
2: Adult postmortem brain tissue (dorsolateral prefrontal cortex) from subjects with bipolar disorder and healthy controls
3:                                               Expression profiling in monozygotic twins discordant for bipolar disorder
4:                                                                                                      McLean Hippocampus
5:                                                                                                              McLean_PFC
6:                                                                     Stanley consortium collection Cerebellum - Feinberg
   taxon.Name
1:      human
2:      human
3:      human
4:      human
5:      human
6:      human

Note that a single call of these functions will only return 20 results by default and a 100 results maximum, controlled by the limit argument. In order to get all available results, offset argument should be used to make multiple calls.

To see how many available results are there, you can look at the attributes of the output objects where additional information from the API response is appended.

# a quick call with a limit of 1 to get the result count
result <- get_taxon_datasets(taxon = 'human',limit = 1) 
print(attributes(result)$totalElements)
[1] 5979

Since the maximum limit is 100 getting all results available will require multiple calls.

count = attributes(result)$totalElements
all_results <- lapply(seq(0, count, 100), function(offset){
    get_taxon_datasets(taxon = 'human',limit = 100, offset = offset)
}) %>% do.call(rbind,.) %>% 
    select(experiment.ShortName, experiment.Name,taxon.Name) %>% 
    head()
   experiment.ShortName
1:              GSE2018
2:              GSE4036
3:              GSE3489
4:              GSE1923
5:               GSE361
6:               GSE492
                                                                          experiment.Name
1:                                                            Human Lung Transplant - BAL
2:                                                                perro-affy-human-186940
3: Patterns of gene dysregulation in the frontal cortex of patients with HIV encephalitis
4: Identification of PDGF-dependent patterns of gene expression in U87 glioblastoma cells
5:                                                   Mammary epithelial cell transduction
6:                               Effect of prostaglandin analogs on aqueous humor outflow
   taxon.Name
1:      human
2:      human
3:      human
4:      human
5:      human
6:      human

See larger queries section for more details. To keep this vignette simpler we will keep using the first 20 results returned by default in examples below.

Information provided about the datasets by these functions include details about the quality and design of the study that can be used to judge if it is suitable for your use case. For instance geeq.batchEffect column will be set to -1 if Gemma’s preprocessing has detected batch effects that were unable to be resolved by batch correction and experiment.SampleCount will include the number of samples used in the experiment. More information about these and other fields can be found at the function documentation.

get_taxon_datasets(taxon = 'human') %>% # get human datasets
    filter(geeq.batchEffect !=-1 & experiment.SampleCount > 4) %>% # filter for datasets without batch effects and with a sample count more than 4
    select(experiment.ShortName, experiment.Name,taxon.Name) %>% head
   experiment.ShortName
1:              GSE2018
2:              GSE4036
3:              GSE3489
4:              GSE1923
5:               GSE361
6:               GSE492
                                                                          experiment.Name
1:                                                            Human Lung Transplant - BAL
2:                                                                perro-affy-human-186940
3: Patterns of gene dysregulation in the frontal cortex of patients with HIV encephalitis
4: Identification of PDGF-dependent patterns of gene expression in U87 glioblastoma cells
5:                                                   Mammary epithelial cell transduction
6:                               Effect of prostaglandin analogs on aqueous humor outflow
   taxon.Name
1:      human
2:      human
3:      human
4:      human
5:      human
6:      human

Gemma uses multiple ontologies when annotating datasets and using the term URIs instead of free text to search can lead to more specific results. search_annotations function allows searching for annotation terms that might be relevant to your query.

search_annotations('bipolar') %>% 
    head
   category.Name                         category.URI         value.Name
1:          <NA>                                 <NA>            Bipolar
2:       disease http://www.ebi.ac.uk/efo/EFO_0000408 bipolar I disorder
3:       disease http://www.ebi.ac.uk/efo/EFO_0000408   Bipolar Disorder
4:       disease http://www.ebi.ac.uk/efo/EFO_0000408 bipolar II disoder
5:       disease http://www.ebi.ac.uk/efo/EFO_0000408  Bipolar depressed
6:          <NA>                                 <NA>   Bipolar Disorder
                                   value.URI
1:                                      <NA>
2: http://purl.obolibrary.org/obo/DOID_14042
3:  http://purl.obolibrary.org/obo/DOID_3312
4:      http://www.ebi.ac.uk/efo/EFO_0009964
5:                                      <NA>
6:                                      <NA>

5 Downloading expression data

Upon identifying datasets of interest, more information about specific ones can be requested. In this example we will be using GSE46416 which includes samples taken from healthy donors along with manic/euthymic phase bipolar disorder patients.

The data associated with specific experiments can be accessed by using get_datasets_by_ids

get_datasets_by_ids("GSE46416") %>%
    select(experiment.ShortName, experiment.Name, experiment.ID, experiment.Description)
   experiment.ShortName
1:             GSE46416
                                                   experiment.Name
1: State- and trait-specific gene expression in euthymia and mania
   experiment.ID
1:          8997
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       experiment.Description
1: Gene expression profiles of bipolar disorder (BD) patients were assessed during both a manic and a euthymic phase and compared both intra-individually, and with the gene expression profiles of controls.\nLast Updated (by provider): Sep 05 2014\nContributors:  Christian C Witt Benedikt Brors Dilafruz Juraeva Jens Treutlein Carsten Sticht Stephanie H Witt Jana Strohmaier Helene Dukal Josef Frank Franziska Degenhardt Markus M Nöthen Sven Cichon Maren Lang Marcella Rietschel Sandra Meier Manuel Mattheisen

To access the expression data in a convenient form, you can use get_dataset_object. It is a high-level wrapper that combines various endpoint calls to return lists of annotated SummarizedExperiment or ExpressionSet objects that are compatible with other Bioconductor packages or a tidyverse-friendly long form tibble for downstream analyses. These include the expression matrix along with the experimental design, and ensure the sample names match between both when transforming/subsetting data.

dat <- get_dataset_object("GSE46416",
                          type = 'se') # SummarizedExperiment is the default output type

Note that the tidy format is less memory efficient but allows easy visualization and exploration with ggplot2 and the rest of the tidyverse.

To show how subsetting works, we’ll keep the “manic phase” data and the reference_subject_roles, which refers to the control samples in Gemma datasets.

# Check the levels of the disease factor
dat[[1]]$disease %>% unique()
[1] "euthymic phase,Bipolar Disorder" "reference subject role"         
[3] "bipolar disorder,manic phase"   
# Subset patients during manic phase and controls
manic <- dat[[1]][, dat[[1]]$disease == "bipolar disorder,manic phase" | 
        dat[[1]]$disease == "reference subject role"]
manic
class: SummarizedExperiment 
dim: 21410 21 
metadata(8): title abstract ... GemmaSuitabilityScore taxon
assays(1): counts
rownames(21410): 2315430 2315554 ... 7385683 7385696
rowData names(4): Probe GeneSymbol GeneName NCBIid
colnames(21): Control, 15 Control, 8 ... Control, 2_DE40 Bipolar
  disorder patient manic phase, 37
colData names(3): factorValues block disease

Let’s take a look at sample to sample correlation in our subset.

# Get Expression matrix
manicExpr <- assay(manic, "counts")


manicExpr %>% 
    cor %>% 
    pheatmap(col =viridis(10),border_color = NA,angle_col = 45,fontsize = 7)