R version: R version 3.6.3 (2020-02-29)
Bioconductor version: 3.10
Package version: 2.6.1
In this article we introduce a complete workflow for a typical (Affymetrix) microarray analysis. Data import, preprocessing, differential expression and enrichment analysis are discussed.
The data set used (1) is from a paper studying the differences in gene expression in inflamed and non-inflamed tissue. 14 patients suffering from Ulcerative colitis (UC) and 15 patients with Crohn’s disease (CD) were tested, and from each patient inflamed and non-inflamed colonic mucosa tissue was obtained via a biopsy. This is a typical clinical data set consisting of 58 arrays in total. Our aim is to analyze differential expression (DE) between the tissues. Our results show a substantial overlap with the results of the original paper.
The workflow is wrapped in a package called maEndToEnd
.
The maEndToEnd
package can currently be obtained from GitHub and is
available via the current development version of Bioconductor (3.8) (see here: http://bioconductor.org/packages/devel/workflows/html/maEndToEnd.html).
You can install the package via the BiocManager.
if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install("maEndToEnd", version = "devel")
Currently, the workflow is available in the development version of Bioconductor (3.8), which will become the release version in October 2018.
For details on how to use this version of Bioconductor see:
#In order to download the package from GitHub, we need the "install_github"
#function from the "remotes" package. We download the latest developer
#version of "remotes" from GitHub with the devtool::install_github
#function; note that this is necessary as the current "remotes" version on
#CRAN doesn't allow us to correctly download the "maEndToEnd" package:
install.packages("devtools")
library(devtools)
devtools::install_github("r-lib/remotes")
library(remotes)
packageVersion("remotes") # has to be 1.1.1.9000 or later
remotes::install_github("b-klaus/maEndToEnd", ref="master")
Once the workflow package has been successfully installed, we can use
a call to library()
in order to load it. This will also load
all the other packages neccessary to run the workflow.
suppressPackageStartupMessages({library("maEndToEnd")})
No methods found in package 'RSQLite' for request: 'dbListFields' when loading 'oligo'
groupGOTerms: GOBPTerm, GOMFTerm, GOCCTerm environments built.
Below, you find a list of packages that are required by the workflow. Some Helper/Styling packages have been commented here, as they are not strictly neccesary to execute the workflow.
#General Bioconductor packages
library(Biobase)
library(oligoClasses)
#Annotation and data import packages
library(ArrayExpress)
library(pd.hugene.1.0.st.v1)
library(hugene10sttranscriptcluster.db)
#Quality control and pre-processing packages
library(oligo)
library(arrayQualityMetrics)
#Analysis and statistics packages
library(limma)
library(topGO)
library(ReactomePA)
library(clusterProfiler)
#Plotting and color options packages
library(gplots)
library(ggplot2)
library(geneplotter)
library(RColorBrewer)
library(pheatmap)
#Formatting/documentation packages
#library(rmarkdown)
#library(BiocStyle)
library(dplyr)
library(tidyr)
#Helpers:
library(stringr)
library(matrixStats)
library(genefilter)
library(openxlsx)
#library(devtools)
The first step of the analysis is to download the raw data CEL files. These files are produced by the array scanner software and contain the measured probe intensities. The data we use have been deposited at ArrayExpress and have the accession code E-MTAB-2967.
We will store these files in the directory raw_data_dir which defaults to a temporary directory.
raw_data_dir <- tempdir()
if (!dir.exists(raw_data_dir)) {
dir.create(raw_data_dir)
}
Each ArrayExpress data set has a landing page summarizing the data set,
and we use the getAE
function from the ArrayExpress Bioconductor package to obtain
the ftp links to the raw data files (Data from Palmieri et. al. on ArrayEpress).
With the code below, we download the raw data (also including annotation data) from
ArrayExpress (2)
by using the getAE
-function. The data are saved in the raw_data_dir
created above. The names of the downloaded files are returned as a list.
anno_AE <- getAE("E-MTAB-2967", path = raw_data_dir, type = "raw")
We will now have a closer look at the data we downloaded from ArrayExpress
Each dataset at ArrayExpress is stored according to the MAGE-TAB (MicroArray Gene Expression Tabular) specifications as a collection of tables bundled with the raw data. The MAGE-TAB format specifies up to five different types of files:
Other than the raw data files, the IDF and the SDRF file are important for us. The IDF file contains top level information about the experiment including title, description, submitter contact details and protocols. The SDRF file contains essential information on the experimental samples, e.g. the experimental group(s) they belong to.
Before we move on to the actual raw data import, we will briefly introduce the
ExpressionSet
class contained in the Biobase package.
It is commonly used to store microarray data in Bioconductor.
Genomic data can be very complex, usually consisting of a number of different components, e.g. information on the experimental samples, annotation of genomic features measured as well as the experimental data itself. In Bioconductor, the approach is taken that these components should be stored in a single structure to easily manage the data.
The package Biobase contains standardized data structures
to represent genomic data. The ExpressionSet
class is designed
to combine several different sources of information (i.e. as contained in the
various MAGE-TAB files) into a single convenient
structure. An ExpressionSet can be manipulated (e.g., subsetted, copied),
and is the input to or output of many Bioconductor functions.
The data in an ExpressionSet consist of:
The ExpressionSet class coordinates all of these data, so that one does not
have to worry about the details. However, one should keep in mind that the rownames of the phenoData
have to match the column names of the assay data, while the row names of the assay data have to match the
row names of the featureData
.
This is illustrated in Figure 1.