R version: R Under development (unstable) (2023-11-11 r85510)
Bioconductor version: 3.19
Package version: 2.23.1

1 Introduction

In this article we introduce a complete workflow for a typical (Affymetrix) microarray analysis. Data import, preprocessing, differential expression and enrichment analysis are discussed.

The data set used (1) is from a paper studying the differences in gene expression in inflamed and non-inflamed tissue. 14 patients suffering from Ulcerative colitis (UC) and 15 patients with Crohn’s disease (CD) were tested, and from each patient inflamed and non-inflamed colonic mucosa tissue was obtained via a biopsy. This is a typical clinical data set consisting of 58 arrays in total. Our aim is to analyze differential expression (DE) between the tissues. Our results show a substantial overlap with the results of the original paper.

2 Workflow package installation from Bioconductor

The workflow is wrapped in a package called maEndToEnd.
You can install the package via the BiocManager.

if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("maEndToEnd")

Currently, the workflow has been available since Bioconductor version 3.8, which was first released in October 2018.

2.1 Workflow package import

Once the workflow package has been successfully installed, we can use a call to library() in order to load it. This will also load all the other packages neccessary to run the workflow.

suppressPackageStartupMessages({library("maEndToEnd")})
   
   groupGOTerms:    GOBPTerm, GOMFTerm, GOCCTerm environments built.

2.2 List of packages required for the workflow

Below, you find a list of packages that are required by the workflow. Some Helper/Styling packages have been commented here, as they are not strictly neccesary to execute the workflow.

#General Bioconductor packages
    library(Biobase)
    library(oligoClasses)
     
#Annotation and data import packages
    library(ArrayExpress)
    library(pd.hugene.1.0.st.v1)
    library(hugene10sttranscriptcluster.db)
     
#Quality control and pre-processing packages
    library(oligo)
    library(arrayQualityMetrics)
     
#Analysis and statistics packages
    library(limma)
    library(topGO)
    library(ReactomePA)
    library(clusterProfiler)
     
#Plotting and color options packages
    library(gplots)
    library(ggplot2)
    library(geneplotter)
    library(RColorBrewer)
    library(pheatmap)
    library(enrichplot)
     
#Formatting/documentation packages
   #library(rmarkdown)
   #library(BiocStyle)
    library(dplyr)
    library(tidyr)

#Helpers:
    library(stringr)
    library(matrixStats)
    library(genefilter)
    library(openxlsx)
   #library(devtools)

3 Downloading the raw data from ArrayExpress

The first step of the analysis is to download the raw data CEL files. These files are produced by the array scanner software and contain the measured probe intensities. The data we use have been deposited at ArrayExpress and have the accession code E-MTAB-2967.

We will store these files in the directory raw_data_dir which defaults to a cache location determined by R. Alternatively, you can provide any location you like here.

raw_data_dir <- tools::R_user_dir(package = "maEndToEnd", which = "cache")

if (!dir.exists(raw_data_dir)) {
    dir.create(raw_data_dir)
}

Each ArrayExpress data set has a landing page summarizing the data set, and we use the getAEfunction from the ArrayExpress Bioconductor package to obtain the ftp links to the raw data files (Data from Palmieri et. al. on ArrayEpress).

With the code below, we download the raw data (also including annotation data) from ArrayExpress (2) by using the getAE-function. The data are saved in the raw_data_dir created above. The names of the downloaded files are returned as a list.

anno_AE <- getAE("E-MTAB-2967", path = raw_data_dir, type = "raw")
download.file(url = "https://www.ebi.ac.uk/biostudies/files/E-MTAB-2967/E-MTAB-2967.sdrf.txt",
              destfile = file.path(raw_data_dir, "E-MTAB-2967.sdrf.txt"), 
              mode = "wb")

We will now have a closer look at the data we downloaded from ArrayExpress

4 Background information on the data

4.1 Information stored in ArrayExpress

Each dataset at ArrayExpress is stored according to the MAGE-TAB (MicroArray Gene Expression Tabular) specifications as a collection of tables bundled with the raw data. The MAGE-TAB format specifies up to five different types of files:

  • Investigation Description Format (IDF)
  • Array Design Format (ADF)
  • Sample and Data Relationship Format (SDRF)
  • raw data files
  • processed data files

Other than the raw data files, the IDF and the SDRF file are important for us. The IDF file contains top level information about the experiment including title, description, submitter contact details and protocols. The SDRF file contains essential information on the experimental samples, e.g. the experimental group(s) they belong to.

Before we move on to the actual raw data import, we will briefly introduce the ExpressionSet class contained in the Biobase package. It is commonly used to store microarray data in Bioconductor.

4.2 Bioconductor ExpressionSets

Genomic data can be very complex, usually consisting of a number of different components, e.g. information on the experimental samples, annotation of genomic features measured as well as the experimental data itself. In Bioconductor, the approach is taken that these components should be stored in a single structure to easily manage the data.

The package Biobase contains standardized data structures to represent genomic data. The ExpressionSet class is designed to combine several different sources of information (i.e. as contained in the various MAGE-TAB files) into a single convenient structure. An ExpressionSet can be manipulated (e.g., subsetted, copied), and is the input to or output of many Bioconductor functions.

The data in an ExpressionSet consist of:

  • assayData: Expression data from microarray experiments with microarray probes in rows and sample identifiers in columns
  • metaData
    • phenoData: A description of the samples in the experiment with sample identifiers in rows and description elements in columns; holds the content of the SDRF file
    • featureData: metadata about the features on the chip
      or technology used for the experiment with same rows as assayData by default and freely assignable columns
    • further annotations for the features, for example gene annotations from biomedical databases (annotation).
  • experimentData: A flexible structure to describe the experiment.

The ExpressionSet class coordinates all of these data, so that one does not have to worry about the details. However, one should keep in mind that the rownames of the phenoData have to match the column names of the assay data, while the row names of the assay data have to match the row names of the featureData. This is illustrated in Figure 1.