The package EnrichmentBrowser implements an analysis pipeline for high-throughput gene expression data as measured with microarrays and RNA-seq. In a workflow-like manner, the package brings together a selection of established Bioconductor packages for gene expression data analysis. It integrates a wide range of gene set and network enrichment analysis methods and facilitates combination and exploration of results across methods.
Report issues on https://github.com/lgeistlinger/EnrichmentBrowser/issues
The package EnrichmentBrowser implements essential functionality for the enrichment analysis of gene expression data. The analysis combines the advantages of set-based and network-based enrichment analysis to derive high-confidence gene sets and biological pathways that are differentially regulated in the expression data under investigation. Besides, the package facilitates the visualization and exploration of such sets and pathways. The following instructions will guide you through an end-to-end expression data analysis workflow including:
Preparing the data
Preprocessing of the data
Differential expression (DE) analysis
Defining gene sets of interest
Executing individual enrichment methods
Combining the results of different methods
Visualize and explore the results
All of these steps are modular, i.e. each step can be executed individually and fine-tuned with several parameters. In case you are interested in a particular step, you can directly move on to the respective section. For example, if you have differential expression already calculated for each gene, and you are now interested in whether certain gene functions are enriched for differential expression, section 7.2 would be the one you should go for. The last section 9 also demonstrates how to wrap the whole workflow into a single function, making use of suitably chosen defaults.
Typically, the expression data is not already available in R but
rather has to be read in from a file. This can be done using the function
readSE, which reads the expression data (
exprs) along with the
phenotype data (
colData) and feature data (
rowData) into a
library(EnrichmentBrowser) data.dir <- system.file("extdata", package = "EnrichmentBrowser") exprs.file <- file.path(data.dir, "exprs.tab") cdat.file <- file.path(data.dir, "colData.tab") rdat.file <- file.path(data.dir, "rowData.tab") se <- readSE(exprs.file, cdat.file, rdat.file)
The man pages provide details on file format and the SummarizedExperiment data structure.
Note: Previous versions of the EnrichmentBrowser used the ExpressionSet data structure. The migration to SummarizedExperiment in the current release of the EnrichmentBrowser is done to reflect recent developments in Bioconductor, which discourages the use of ExpressionSet in favor of SummarizedExperiment. The major reasons are the compatibility of SummarizedExperiment with operations on genomic regions as well as efficient handling of big data.
To enable a smooth transition, all functions of the EnrichmentBrowser are still accepting also an ExpressionSet as input, but are consistently returning a SummarizedExperiment as output.
Furthermore, users can always coerce the SummarizedExperiment to ExpressionSet via
eset <- as(se, "ExpressionSet")
and vice versa
se <- as(eset, "SummarizedExperiment")
The two major data types processed by the EnrichmentBrowser are microarray (intensity measurements) and RNA-seq (read counts) data.
Although RNA-seq has become the de facto standard for transcriptomic profiling, it is important to know that many methods for differential expression and gene set enrichment analysis have been originally developed for microarray data.
However, differences in data distribution assumptions (microarray: quasi-normal, RNA-seq: negative binomial) made adaptations in differential expression analysis and, to some extent, also in gene set enrichment analysis if necessary.
Thus, we consider two example data sets – a microarray and an RNA-seq data set, and discuss similarities and differences between the respective analysis steps.
To demonstrate the package’s functionality for microarray data, we consider expression measurements of patients with acute lymphoblastic leukemia . A frequent chromosomal defect found among these patients is a translocation, in which parts of chromosome 9 and 22 swap places. This results in the oncogenic fusion gene BCR/ABL created by positioning the ABL1 gene on chromosome 9 to a part of the BCR gene on chromosome 22.
We load the ALL dataset
and select B-cell ALL patients with and without the BCR/ABL fusion as described previously .
ind.bs <- grep("^B", ALL$BT) ind.mut <- which(ALL$mol.biol %in% c("BCR/ABL", "NEG")) sset <- intersect(ind.bs, ind.mut) all.eset <- ALL[, sset]
We can now access the expression values, which are intensity measurements on a log-scale for 12,625 probes (rows) across 79 patients (columns).
dim(all.eset) #> Features Samples #> 12625 79 exprs(all.eset)[1:4,1:4] #> 01005 01010 03002 04007 #> 1000_at 7.597323 7.479445 7.567593 7.905312 #> 1001_at 5.046194 4.932537 4.799294 4.844565 #> 1002_f_at 3.900466 4.208155 3.886169 3.416923 #> 1003_s_at 5.903856 6.169024 5.860459 5.687997
As we often have more than one probe per gene, we summarize the gene expression values as the average of the corresponding probe values.
allSE <- probe2gene(all.eset) head(rownames(allSE)) #>  "5595" "7075" "1557" "643" "1843" "4319"
Note, that the mapping from the probe to gene is done automatically as long
as you have the corresponding annotation package, here the
hgu95av2.db package, installed. Otherwise, the mapping
can be manually defined in the
rowData(se) #> DataFrame with 1000 rows and 1 column #> ENTREZID #> <character> #> 3075 3075 #> 572 572 #> 4267 4267 #> 26 26 #> 51384 51384 #> ... ... #> 5295 5295 #> 2966 2966 #> 9140 9140 #> 5558 5558 #> 1956 1956
To demonstrate the functionality of the package for RNA-seq data, we consider transcriptome profiles of four primary human airway smooth muscle cell lines in two conditions: control and treatment with dexamethasone .
We load the airway dataset
For further analysis, we remove genes with very low read counts and measurements that are not mapped to an ENSEMBL gene ID.
airSE <- airway[grep("^ENSG", rownames(airway)),] airSE <- airSE[rowSums(assay(airSE)) > 4,] dim(airSE) #>  25133 8 assay(airSE)[1:4,1:4] #> SRR1039508 SRR1039509 SRR1039512 SRR1039513 #> ENSG00000000003 679 448 873 408 #> ENSG00000000419 467 515 621 365 #> ENSG00000000457 260 211 263 164 #> ENSG00000000460 60 55 40 35
Normalization of high-throughput expression data is essential to make
results within and between experiments comparable. Microarray (intensity
measurements) and RNA-seq (read counts) data typically show distinct
features that need to be normalized for. The function
commonly used functionality from limma for microarray
normalization and from EDASeq for RNA-seq normalization.
For specific needs that deviate from these standard normalizations, the
user should always refer to more specific functions/packages.
Microarray data is expected to be single-channel. For two-color arrays,
it is expected that normalization within arrays has been already carried
out, e.g. using from
normalizeWithinArrays from limma.
A default quantile normalization based on
limma can be carried out via
allSE <- normalize(allSE, norm.method = "quantile")
par(mfrow=c(1,2)) boxplot(assay(allSE, "raw")) boxplot(assay(allSE, "norm"))