R version: R version 4.1.1 (2021-08-10)
Bioconductor version: 3.14
Package version: 2.14.0

1 Introduction

In this article we introduce a complete workflow for a typical (Affymetrix) microarray analysis. Data import, preprocessing, differential expression and enrichment analysis are discussed.

The data set used (1) is from a paper studying the differences in gene expression in inflamed and non-inflamed tissue. 14 patients suffering from Ulcerative colitis (UC) and 15 patients with Crohn’s disease (CD) were tested, and from each patient inflamed and non-inflamed colonic mucosa tissue was obtained via a biopsy. This is a typical clinical data set consisting of 58 arrays in total. Our aim is to analyze differential expression (DE) between the tissues. Our results show a substantial overlap with the results of the original paper.

2 Workflow package installation

The workflow is wrapped in a package called maEndToEnd.

The maEndToEnd package can currently be obtained from GitHub and is available via the current development version of Bioconductor (3.8) (see here: http://bioconductor.org/packages/devel/workflows/html/maEndToEnd.html).

2.1 Workflow package installation from Bioconductor

You can install the package via the BiocManager.

if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("maEndToEnd", version = "devel")

Currently, the workflow is available in the development version of Bioconductor (3.8), which will become the release version in October 2018.

For details on how to use this version of Bioconductor see:

http://bioconductor.org/developers/how-to/useDevel/

2.2 Workflow package installation from Github

#In order to download the package from GitHub, we need the "install_github" 
#function from the "remotes" package. We download the latest developer 
#version of "remotes" from GitHub with the devtool::install_github
#function; note that this is necessary as the current "remotes" version on
#CRAN doesn't allow us to correctly download the "maEndToEnd" package: 

install.packages("devtools")
library(devtools)

devtools::install_github("r-lib/remotes")
library(remotes)
packageVersion("remotes") # has to be 1.1.1.9000 or later

remotes::install_github("b-klaus/maEndToEnd", ref="master")

2.3 Workflow package import

Once the workflow package has been successfully installed, we can use a call to library() in order to load it. This will also load all the other packages neccessary to run the workflow.

suppressPackageStartupMessages({library("maEndToEnd")})
   
   groupGOTerms:    GOBPTerm, GOMFTerm, GOCCTerm environments built.

2.4 List of packages required for the workflow

Below, you find a list of packages that are required by the workflow. Some Helper/Styling packages have been commented here, as they are not strictly neccesary to execute the workflow.

#General Bioconductor packages
    library(Biobase)
    library(oligoClasses)
     
#Annotation and data import packages
    library(ArrayExpress)
    library(pd.hugene.1.0.st.v1)
    library(hugene10sttranscriptcluster.db)
     
#Quality control and pre-processing packages
    library(oligo)
    library(arrayQualityMetrics)
     
#Analysis and statistics packages
    library(limma)
    library(topGO)
    library(ReactomePA)
    library(clusterProfiler)
     
#Plotting and color options packages
    library(gplots)
    library(ggplot2)
    library(geneplotter)
    library(RColorBrewer)
    library(pheatmap)
    library(enrichplot)
     
#Formatting/documentation packages
   #library(rmarkdown)
   #library(BiocStyle)
    library(dplyr)
    library(tidyr)

#Helpers:
    library(stringr)
    library(matrixStats)
    library(genefilter)
    library(openxlsx)
   #library(devtools)

3 Downloading the raw data from ArrayExpress

The first step of the analysis is to download the raw data CEL files. These files are produced by the array scanner software and contain the measured probe intensities. The data we use have been deposited at ArrayExpress and have the accession code E-MTAB-2967.

We will store these files in the directory raw_data_dir which defaults to a temporary directory.

raw_data_dir <- tempdir()

if (!dir.exists(raw_data_dir)) {
    dir.create(raw_data_dir)
}

Each ArrayExpress data set has a landing page summarizing the data set, and we use the getAEfunction from the ArrayExpress Bioconductor package to obtain the ftp links to the raw data files (Data from Palmieri et. al. on ArrayEpress).

With the code below, we download the raw data (also including annotation data) from ArrayExpress (2) by using the getAE-function. The data are saved in the raw_data_dir created above. The names of the downloaded files are returned as a list.

anno_AE <- getAE("E-MTAB-2967", path = raw_data_dir, type = "raw")

We will now have a closer look at the data we downloaded from ArrayExpress

4 Background information on the data

4.1 Information stored in ArrayExpress

Each dataset at ArrayExpress is stored according to the MAGE-TAB (MicroArray Gene Expression Tabular) specifications as a collection of tables bundled with the raw data. The MAGE-TAB format specifies up to five different types of files:

  • Investigation Description Format (IDF)
  • Array Design Format (ADF)
  • Sample and Data Relationship Format (SDRF)
  • raw data files
  • processed data files

Other than the raw data files, the IDF and the SDRF file are important for us. The IDF file contains top level information about the experiment including title, description, submitter contact details and protocols. The SDRF file contains essential information on the experimental samples, e.g. the experimental group(s) they belong to.

Before we move on to the actual raw data import, we will briefly introduce the ExpressionSet class contained in the Biobase package. It is commonly used to store microarray data in Bioconductor.

4.2 Bioconductor ExpressionSets

Genomic data can be very complex, usually consisting of a number of different components, e.g. information on the experimental samples, annotation of genomic features measured as well as the experimental data itself. In Bioconductor, the approach is taken that these components should be stored in a single structure to easily manage the data.

The package Biobase contains standardized data structures to represent genomic data. The ExpressionSet class is designed to combine several different sources of information (i.e. as contained in the various MAGE-TAB files) into a single convenient structure. An ExpressionSet can be manipulated (e.g., subsetted, copied), and is the input to or output of many Bioconductor functions.

The data in an ExpressionSet consist of:

  • assayData: Expression data from microarray experiments with microarray probes in rows and sample identifiers in columns
  • metaData
    • phenoData: A description of the samples in the experiment with sample identifiers in rows and description elements in columns; holds the content of the SDRF file
    • featureData: metadata about the features on the chip
      or technology used for the experiment with same rows as assayData by default and freely assignable columns
    • further annotations for the features, for example gene annotations from biomedical databases (annotation).
  • experimentData: A flexible structure to describe the experiment.

The ExpressionSet class coordinates all of these data, so that one does not have to worry about the details. However, one should keep in mind that the rownames of the phenoData have to match the column names of the assay data, while the row names of the assay data have to match the row names of the featureData. This is illustrated in Figure 1.

Structure of Bioconductor’s ExpressionSet class.

Figure 1: Structure of Bioconductor’s ExpressionSet class

You can use the functions pData and fData to extract the sample and feature annotation, respectively, from an ExpressionSet. The function exprs will return the expression data itself as a matrix.

5 Import of annotation data and microarray expression data as “ExpressionSet”

We import the SDRF file with the read.delim function from the raw data folder in order to obtain the sample annotation.

The sample names are given in the column Array.Data.File of the SDRF data table and will be used as rownames for the SDRF file.

We turn the SDRF table into an AnnotatedDataFrame from the Biobase package that we will need later to create an ExpressionSet for our data (3).

sdrf_location <- file.path(raw_data_dir, "E-MTAB-2967.sdrf.txt")
SDRF <- read.delim(sdrf_location)

rownames(SDRF) <- SDRF$Array.Data.File
SDRF <- AnnotatedDataFrame(SDRF)

We now create the Expression Set object raw_data, which contains array data, pheno data (from the SDRF file) as well as the information of the chip annotation package used.

The analysis of Affymetrix arrays starts with CEL files. These are the result of the processing of the raw image files using the Affymetrix software and contain estimated probe intensity values. Each CEL file additionally contains some metadata, such as a chip identifier.

We use the function read.celfiles from the oligo package (4) to import the files:

raw_data <- oligo::read.celfiles(filenames = file.path(raw_data_dir, 
                                                SDRF$Array.Data.File),
                                    verbose = FALSE, phenoData = SDRF)
stopifnot(validObject(raw_data))

This automatically creates an ExpressionSet, fills the sections “array data” with the data from the CEL files and uses the correct chip annotation package, in this case pd.hugene.1.0.st.v1 (the chip-type is also stored in the .CEL files).

Furthermore, we specified our AnnotatedDataFrameSDRF” created earlier from the SDRF file as phenoData. Thus, we had to make sure to import the CEL files in the order that corresponds to the SDRF table — to enforce this, we used the column Array.Data.File of the SDRF table as the filenames argument.

Finally, we checked whether the object created is valid (e.g. sample names match between the different tables).

We now have a first look on the r