R version: R Under development (unstable) (2025-03-13 r87965)
Bioconductor version: 3.21
Package version: 2.27.0

1 Introduction

In this article we introduce a complete workflow for a typical (Affymetrix) microarray analysis. Data import, preprocessing, differential expression and enrichment analysis are discussed.

The data set used (1) is from a paper studying the differences in gene expression in inflamed and non-inflamed tissue. 14 patients suffering from Ulcerative colitis (UC) and 15 patients with Crohn’s disease (CD) were tested, and from each patient inflamed and non-inflamed colonic mucosa tissue was obtained via a biopsy. This is a typical clinical data set consisting of 58 arrays in total. Our aim is to analyze differential expression (DE) between the tissues. Our results show a substantial overlap with the results of the original paper.

2 Workflow package installation from Bioconductor

The workflow is wrapped in a package called maEndToEnd.
You can install the package via the BiocManager.

if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("maEndToEnd")

Currently, the workflow has been available since Bioconductor version 3.8, which was first released in October 2018.

2.1 Workflow package import

Once the workflow package has been successfully installed, we can use a call to library() in order to load it. This will also load all the other packages neccessary to run the workflow.

suppressPackageStartupMessages({library("maEndToEnd")})
   
   groupGOTerms:    GOBPTerm, GOMFTerm, GOCCTerm environments built.

2.2 List of packages required for the workflow

Below, you find a list of packages that are required by the workflow. Some Helper/Styling packages have been commented here, as they are not strictly neccesary to execute the workflow.

#General Bioconductor packages
    library(Biobase)
    library(oligoClasses)
     
#Annotation and data import packages
    library(ArrayExpress)
    library(pd.hugene.1.0.st.v1)
    library(hugene10sttranscriptcluster.db)
     
#Quality control and pre-processing packages
    library(oligo)
    library(arrayQualityMetrics)
     
#Analysis and statistics packages
    library(limma)
    library(topGO)
    library(ReactomePA)
    library(clusterProfiler)
     
#Plotting and color options packages
    library(gplots)
    library(ggplot2)
    library(geneplotter)
    library(RColorBrewer)
    library(pheatmap)
    library(enrichplot)
     
#Formatting/documentation packages
   #library(rmarkdown)
   #library(BiocStyle)
    library(dplyr)
    library(tidyr)

#Helpers:
    library(stringr)
    library(matrixStats)
    library(genefilter)
    library(openxlsx)
   #library(devtools)

3 Downloading the raw data from ArrayExpress

The first step of the analysis is to download the raw data CEL files. These files are produced by the array scanner software and contain the measured probe intensities. The data we use have been deposited at ArrayExpress and have the accession code E-MTAB-2967.

We will store these files in the directory raw_data_dir which defaults to a cache location determined by R. Alternatively, you can provide any location you like here.

raw_data_dir <- tools::R_user_dir(package = "maEndToEnd", which = "cache")

if (!dir.exists(raw_data_dir)) {
    dir.create(raw_data_dir)
}

Each ArrayExpress data set has a landing page summarizing the data set, and we use the getAEfunction from the ArrayExpress Bioconductor package to obtain the ftp links to the raw data files (Data from Palmieri et. al. on ArrayEpress).

With the code below, we download the raw data (also including annotation data) from ArrayExpress (2) by using the getAE-function. The data are saved in the raw_data_dir created above. The names of the downloaded files are returned as a list.

anno_AE <- getAE("E-MTAB-2967", path = raw_data_dir, type = "raw")
download.file(url = "https://www.ebi.ac.uk/biostudies/files/E-MTAB-2967/E-MTAB-2967.sdrf.txt",
              destfile = file.path(raw_data_dir, "E-MTAB-2967.sdrf.txt"), 
              mode = "wb")

We will now have a closer look at the data we downloaded from ArrayExpress

4 Background information on the data

4.1 Information stored in ArrayExpress

Each dataset at ArrayExpress is stored according to the MAGE-TAB (MicroArray Gene Expression Tabular) specifications as a collection of tables bundled with the raw data. The MAGE-TAB format specifies up to five different types of files:

  • Investigation Description Format (IDF)
  • Array Design Format (ADF)
  • Sample and Data Relationship Format (SDRF)
  • raw data files
  • processed data files

Other than the raw data files, the IDF and the SDRF file are important for us. The IDF file contains top level information about the experiment including title, description, submitter contact details and protocols. The SDRF file contains essential information on the experimental samples, e.g. the experimental group(s) they belong to.

Before we move on to the actual raw data import, we will briefly introduce the ExpressionSet class contained in the Biobase package. It is commonly used to store microarray data in Bioconductor.

4.2 Bioconductor ExpressionSets

Genomic data can be very complex, usually consisting of a number of different components, e.g. information on the experimental samples, annotation of genomic features measured as well as the experimental data itself. In Bioconductor, the approach is taken that these components should be stored in a single structure to easily manage the data.

The package Biobase contains standardized data structures to represent genomic data. The ExpressionSet class is designed to combine several different sources of information (i.e. as contained in the various MAGE-TAB files) into a single convenient structure. An ExpressionSet can be manipulated (e.g., subsetted, copied), and is the input to or output of many Bioconductor functions.

The data in an ExpressionSet consist of:

  • assayData: Expression data from microarray experiments with microarray probes in rows and sample identifiers in columns
  • metaData
    • phenoData: A description of the samples in the experiment with sample identifiers in rows and description elements in columns; holds the content of the SDRF file
    • featureData: metadata about the features on the chip
      or technology used for the experiment with same rows as assayData by default and freely assignable columns
    • further annotations for the features, for example gene annotations from biomedical databases (annotation).
  • experimentData: A flexible structure to describe the experiment.

The ExpressionSet class coordinates all of these data, so that one does not have to worry about the details. However, one should keep in mind that the rownames of the phenoData have to match the column names of the assay data, while the row names of the assay data have to match the row names of the featureData. This is illustrated in Figure 1.

Structure of Bioconductor’s ExpressionSet class.

Figure 1: Structure of Bioconductor’s ExpressionSet class

You can use the functions pData and fData to extract the sample and feature annotation, respectively, from an ExpressionSet. The function exprs will return the expression data itself as a matrix.

5 Import of annotation data and microarray expression data as “ExpressionSet”

We import the SDRF file with the read.delim function from the raw data folder in order to obtain the sample annotation.

The sample names are given in the column Array.Data.File of the SDRF data table and will be used as rownames for the SDRF file.

We turn the SDRF table into an AnnotatedDataFrame from the Biobase package that we will need later to create an ExpressionSet for our data (3).

sdrf_location <- file.path(raw_data_dir, "E-MTAB-2967.sdrf.txt")
SDRF <- read.delim(sdrf_location)

rownames(SDRF) <- SDRF$Array.Data.File
SDRF <- AnnotatedDataFrame(SDRF)

We now create the Expression Set object raw_data, which contains array data, pheno data (from the SDRF file) as well as the information of the chip annotation package used.

The analysis of Affymetrix arrays starts with CEL files. These are the result of the processing of the raw image files using the Affymetrix software and contain estimated probe intensity values. Each CEL file additionally contains some metadata, such as a chip identifier.

We use the function read.celfiles from the oligo package (4) to import the files:

raw_data <- oligo::read.celfiles(filenames = file.path(raw_data_dir, 
                                                SDRF$Array.Data.File),
                                    verbose = FALSE, phenoData = SDRF)
stopifnot(validObject(raw_data))

This automatically creates an ExpressionSet, fills the sections “array data” with the data from the CEL files and uses the correct chip annotation package, in this case pd.hugene.1.0.st.v1 (the chip-type is also stored in the .CEL files).

Furthermore, we specified our AnnotatedDataFrameSDRF” created earlier from the SDRF file as phenoData. Thus, we had to make sure to import the CEL files in the order that corresponds to the SDRF table — to enforce this, we used the column Array.Data.File of the SDRF table as the filenames argument.

Finally, we checked whether the object created is valid (e.g. sample names match between the different tables).

We now have a first look on the raw data.

The pData function of the Biobase package directly accesses the phenoData in the ExpressionSet raw_data. With the head() function, we can view the first six lines of the table. We have a look at the columns included and retain only those columns that are related to the experimental factors of interest.

head(Biobase::pData(raw_data))
               Source.Name Characteristics.individual. Characteristics.organism.
   164_I_.CEL        164_I                         164              Homo sapiens
   164_II.CEL       164_II                         164              Homo sapiens
   183_I.CEL         183_I                         183              Homo sapiens
   183_II.CEL       183_II                         183              Homo sapiens
   2114_I.CEL       2114_I                        2114              Homo sapiens
   2114_II.CEL     2114_II                        2114              Homo sapiens
               Characteristics.disease. Characteristics.organism.part.
   164_I_.CEL           Crohn's disease                          colon
   164_II.CEL           Crohn's disease                          colon
   183_I.CEL            Crohn's disease                          colon
   183_II.CEL           Crohn's disease                          colon
   2114_I.CEL           Crohn's disease                          colon
   2114_II.CEL          Crohn's disease                          colon
                Characteristics.phenotype. Material.Type Protocol.REF
   164_I_.CEL  non-inflamed colonic mucosa organism part P-MTAB-41361
   164_II.CEL      inflamed colonic mucosa organism part P-MTAB-41361
   183_I.CEL   non-inflamed colonic mucosa organism part P-MTAB-41361
   183_II.CEL      inflamed colonic mucosa organism part P-MTAB-41361
   2114_I.CEL  non-inflamed colonic mucosa organism part P-MTAB-41361
   2114_II.CEL     inflamed colonic mucosa organism part P-MTAB-41361
               Protocol.REF.1 Extract.Name Protocol.REF.2 Labeled.Extract.Name
   164_I_.CEL    P-MTAB-41363        164_I   P-MTAB-41364         164_I:Biotin
   164_II.CEL    P-MTAB-41363       164_II   P-MTAB-41364        164_II:Biotin
   183_I.CEL     P-MTAB-41363        183_I   P-MTAB-41364         183_I:Biotin
   183_II.CEL    P-MTAB-41363       183_II   P-MTAB-41364        183_II:Biotin
   2114_I.CEL    P-MTAB-41363       2114_I   P-MTAB-41364        2114_I:Biotin
   2114_II.CEL   P-MTAB-41363      2114_II   P-MTAB-41364       2114_II:Biotin
                 Label Protocol.REF.3 Assay.Name Technology.Type Array.Design.REF
   164_I_.CEL  biotin    P-MTAB-41366     164_I_     array assay       A-AFFY-141
   164_II.CEL  biotin    P-MTAB-41366     164_II     array assay       A-AFFY-141
   183_I.CEL   biotin    P-MTAB-41366      183_I     array assay       A-AFFY-141
   183_II.CEL  biotin    P-MTAB-41366     183_II     array assay       A-AFFY-141
   2114_I.CEL  biotin    P-MTAB-41366     2114_I     array assay       A-AFFY-141
   2114_II.CEL biotin    P-MTAB-41366    2114_II     array assay       A-AFFY-141
               Term.Source.REF Protocol.REF.4 Array.Data.File
   164_I_.CEL     ArrayExpress   P-MTAB-41367      164_I_.CEL
   164_II.CEL     ArrayExpress   P-MTAB-41367      164_II.CEL
   183_I.CEL      ArrayExpress   P-MTAB-41367       183_I.CEL
   183_II.CEL     ArrayExpress   P-MTAB-41367      183_II.CEL
   2114_I.CEL     ArrayExpress   P-MTAB-41367      2114_I.CEL
   2114_II.CEL    ArrayExpress   P-MTAB-41367     2114_II.CEL
                                                                                   Comment..ArrayExpress.FTP.file.
   164_I_.CEL  ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip
   164_II.CEL  ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip
   183_I.CEL   ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip
   183_II.CEL  ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip
   2114_I.CEL  ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip
   2114_II.CEL ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip
               Factor.Value.disease.     Factor.Value.phenotype.
   164_I_.CEL        Crohn's disease non-inflamed colonic mucosa
   164_II.CEL        Crohn's disease     inflamed colonic mucosa
   183_I.CEL         Crohn's disease non-inflamed colonic mucosa
   183_II.CEL        Crohn's disease     inflamed colonic mucosa
   2114_I.CEL        Crohn's disease non-inflamed colonic mucosa
   2114_II.CEL       Crohn's disease     inflamed colonic mucosa

The columns of interest for us are the following:

  • identifiers of the individuals, i.e. columns “Source.Name”, “Characteristics.individual.”
  • disease of the individual, i.e. “Factor.Value.disease.”
  • mucosa type, i.e. “Factor.Value.phenotype.”

We now subselect the corresponding columns:

Biobase::pData(raw_data) <- Biobase::pData(raw_data)[, c("Source.Name",
                                     "Characteristics.individual.",
                                     "Factor.Value.disease.",
                                     "Factor.Value.phenotype.")]

6 Quality control of the raw data

The first step after the initial data import is the quality control of the data. Here we check for outliers and try to see whether the data clusters as expected, e.g. by the experimental conditions. The expression intensity values are in the assayData sub-object “exprs” and can be accessed by the exprs(raw_data) function. The rows represent the microarray probes, i.e.  the single DNA locations on the chip, while the columns represent one microarray, i.e. a sample of inflamed and non-inflamed tissue of every patient, respectively.

Biobase::exprs(raw_data)[1:5, 1:5]
     164_I_.CEL 164_II.CEL 183_I.CEL 183_II.CEL 2114_I.CEL
   1       4496       5310      4492       4511       2872
   2        181        280       137        101         91
   3       4556       5104      4379       4608       2972
   4        167        217        99         79         82
   5         89        110        69         58         47

For quality control, we take the log2 of Biobase::exprs(raw_data), as expression data is commonly analyzed on a logarithmic scale.

We then perform a principal component analysis (PCA) and plot it (Figure 2). Every point in the plot represents one sample, with the colour indicating the mucosa type (inflamed vs non-inflamed) and the shape indicating the disease (UC or CD).

exp_raw <- log2(Biobase::exprs(raw_data))
PCA_raw <- prcomp(t(exp_raw), scale. = FALSE)

percentVar <- round(100*PCA_raw$sdev^2/sum(PCA_raw$sdev^2),1)
sd_ratio <- sqrt(percentVar[2] / percentVar[1])

dataGG <- data.frame(PC1 = PCA_raw$x[,1], PC2 = PCA_raw$x[,2],
                    Disease = pData(raw_data)$Factor.Value.disease.,
                    Phenotype = pData(raw_data)$Factor.Value.phenotype.,
                    Individual = pData(raw_data)$Characteristics.individual.)

ggplot(dataGG, aes(PC1, PC2)) +
      geom_point(aes(shape = Disease, colour = Phenotype)) +
  ggtitle("PCA plot of the log-transformed raw expression data") +
  xlab(paste0("PC1, VarExp: ", percentVar[1], "%")) +
  ylab(paste0("PC2, VarExp: ", percentVar[2], "%")) +
  theme(plot.title = element_text(hjust = 0.5))+
  coord_fixed(ratio = sd_ratio) +
  scale_shape_manual(values = c(4,15)) + 
  scale_color_manual(values = c("darkorange2", "dodgerblue4"))
PCA plot of the log–transformed raw expression data.

Figure 2: PCA plot of the log–transformed raw expression data

The PCA plot (Figure 2, performed on the log-intensity scale) of the raw data shows that the first principal component differentiates between the diseases. This means that the disease type is a major driver of gene expression differences. This might hinder our analysis, as we want to analyze the differential expression between inflamed and non-inflamed tissues, independently of the disease a person suffers from.

We also represent the probe intensities via a boxplot graph with one box per individual microarray. (Figure 3). Note that the oligo::boxplot function, i.e. the boxplot function of the oligo package, can take expression sets as argument. It accesses the expression data and performs a log2-transformation by default. We therefore can use raw_data as argument here.

oligo::boxplot(raw_data, target = "core", 
               main = "Boxplot of log2-intensitites for the raw data")
Intensity boxplots of the log2–transformed raw data.

Figure 3: Intensity boxplots of the log2–transformed raw data

When looking at the boxplot (Figure 3), we see that the intensity distributions of the individual arrays are quite different, indicating the need for an appropriate normalization, which we will discuss next.

Until now, we have only performed a very basic quality control; more elaborate quality control plots are available in the package arrayQualityMetrics (5). The package produces an html report, containing the quality control plots together with a description of their aims and an identification of possible outliers. We do not discuss this tool in detail here, but simply provide the code below, which creates a report for our raw data.

arrayQualityMetrics(expressionset = raw_data,
    outdir = tempdir(),
    force = TRUE, do.logtransform = TRUE,
    intgroup = c("Factor.Value.disease.", "Factor.Value.phenotype."))

7 Background adjustment, calibration, summarization and annotation

7.1 Background adjustment

After the initial import and quality assessment, the next step in processing of microarray data is background adjustment. This is essential because a proportion of the measured probe intensities are due to non-specific hybridization and the noise in the optical detection system. Therefore, observed intensities need to be adjusted to give accurate measurements of specific hybridization.

7.2 Across-array normalization (calibration)

Normalization across arrays is needed in order to be able to compare measurements from different array hybridizations due to many obscuring sources of variation. These include different efficiencies of reverse transcription, labeling or hybridization reactions, physical problems with the arrays, reagent batch effects, and laboratory conditions.

7.3 Summarization

After normalization, summarization is necessary to be done because on the Affymetrix platform, transcripts are represented by multiple probes, that is multiple locations on the array. For each gene, the background-adjusted and normalized intensities of all probes need to be summarized into one quantity that estimates an amount proportional to the amount of RNA transcript.

After the summarization step, the summarized data can be annotated with various information, e.g. gene symbols and ENSEMBL gene identifiers. There is an annotation database available from Bioconductor for our platform, namely the package hugene10sttranscriptcluster.db.

You can view its content like this:

head(ls("package:hugene10sttranscriptcluster.db"))
   [1] "hugene10sttranscriptcluster"           
   [2] "hugene10sttranscriptcluster.db"        
   [3] "hugene10sttranscriptclusterACCNUM"     
   [4] "hugene10sttranscriptclusterALIAS2PROBE"
   [5] "hugene10sttranscriptclusterCHR"        
   [6] "hugene10sttranscriptclusterCHRLENGTHS"

Additional information is available from the reference manual of the package. Essentially, the package provides a mapping from the transcript cluster identifiers to the various annotation data.

7.4 Old and new “probesets” of Affymetrix microarrays

Traditionally, Affymetrix arrays (the so-called 3’ IVT arrays) were probeset based: a certain fixed group of probes were part of a probeset which represented a certain gene or transcript (note however, that a gene can be represented by multiple probesets).

The more recent “Gene” and “Exon” Affymetrix arrays are exon based and hence there are two levels of summarization to get to the gene level. The “probeset” summarization leads to the exon level. The gene / transcript level is given by “transcript clusters”. Hence, the appropriate annotation package for our chip type is called hugene10sttranscriptcluster.db.

“Gene” arrays were created as affordable versions of the “Exon” arrays, by only taking the “good” probes from the Exon array. Initially on the Exon array, at least four probes were part of one “Exon”. With the thinned out “Gene” array, many probesets were made up of three or fewer probes. This is visualized in Figure 4: Single probesets are indicated by single colours; probesets representing one gene are indicated by a colour shade: e.g., all yellow probes belong to one Exon, and all yellow, orange and red probesets belong to one gene:

<img src=“/tmp/Rtmp1Mxwge/Rbuildc81e96654f98f/maEndToEnd/vignettes/MA-Workflow_files/figure-html/DifferenceBetweenExonAndGeneTypeArrays-1.png” alt=“Visualization of the difference between”Exon" type array (left) and “Gene” type array (right)." width=“100%” />

Figure 4: Visualization of the difference between “Exon” type array (left) and “Gene” type array (right)

On the left side, we see plenty of probes for each Exon / probeset (i.e. each colour): therefore, a summarization on the probeset / exon level makes sense. In the gene type array, however, only a small proportion of the original probes per probeset is included. Thus, a summarization on the probeset / exon level is not recommended for “Gene” arrays but nonetheless possible by using the hugene10stprobeset.db annotation package.

Note that furthermore, there are also no longer designated match/mismatch probes present on “Gene” and “Exon” type chips. The mismatch probe was initially intended as base-level for background correction, but hasn’t prevailed due to more elaborate background correction techniques that do not require a mismatch probe.

7.5 One-step preprocessing in oligo

The package oligo allows us to perform background correction, normalization and summarization in one single step using a deconvolution method for background correction, quantile normalization and the RMA (robust multichip average) algorithm for summarization.

This series of steps as a whole is commonly referred to as RMA algorithm, although strictly speaking RMA is merely a summarization method (68).

8 Relative Log Expression data quality analysis

Before calibrating and evaluating the data, we want to perform another quality control procedure, namely Relative Log Expression (RLE), as described in the article by Gandolfo et al (9). To this end, we first perform an RMA without prior normalization:

palmieri_eset <- oligo::rma(raw_data, target = "core", normalize = FALSE)
   Background correcting
   Calculating Expression

Further details on the RMA algorithm will be provided after RLE analysis, when the “full” RMA is carried out, including normalization.

8.1 Computing the RLE

The RLE is performed by calculating the median log2 intensity of every transcript across all arrays.

We do this by calculating the row medians of exprs(palmieri_eset), as the transcripts are represented by the rows and the single microarrays by the columns.

Note that we do not have to apply the log2 manually, as the output data of the RMA function is in log2 scale by default.

We then substract this transcript median intensity from every transcript intensity via the sweep function.

8.2 Plotting the RLE

We finally reshape the data into a format in which we can use to create a boxplot for each array, as before:

row_medians_assayData <- 
  Biobase::rowMedians(as.matrix(Biobase::exprs(palmieri_eset)))

RLE_data <- sweep(Biobase::exprs(palmieri_eset), 1, row_medians_assayData)

RLE_data <- as.data.frame(RLE_data)
RLE_data_gathered <- 
  tidyr::gather(RLE_data, patient_array, log2_expression_deviation)

ggplot2::ggplot(RLE_data_gathered, aes(patient_array,
                                       log2_expression_deviation)) + 
  geom_boxplot(outlier.shape = NA) + 
  ylim(c(-2, 2)) + 
  theme(axis.text.x = element_text(colour = "aquamarine4", 
                                  angle = 60, size = 6.5, hjust = 1 ,
                                  face = "bold"))
Boxplot for the RLE values

Figure 5: Boxplot for the RLE values

Note that the y-axis now displays for each microarray the deviation of expression intensity from the median expression of the respective single transcripts across arrays.

Boxes with a larger extension therefore indicate an unusually high deviation from the median in a lot of transcripts, suggesting that these arrays are different from most of the others in some way.

Boxes that are shifted in y-direction indicate a systematically higher or lower expression of the majority of transcripts in comparison to most of the other arrays. This could be caused by quality issues or batch effects.

Therefore, if shape and median of a given box varies too much from the bulk, they should be inspected and potentially removed.

By inspecting the boxplot in Figure 5, five arrays could be considered as outliers: 2826_I, 2826_II, 3262_II, 3302_II and 3332_II are negatively y-shifted.

We will keep these samples in mind for heatmap cluster analysis later on in the workflow. Arrays that are confirmed to be outliers by heatmap analysis could be removed for subsequent analysis.

9 RMA calibration of the data

Now, we can apply the full RMA algorithm to our data in order to background-correct, normalize and summarize:

palmieri_eset_norm <- oligo::rma(raw_data, target = "core")
   Background correcting
   Normalizing
   Calculating Expression

The parameter target defines the degree of summarization, the default option of which is “core”, using transcript clusters containing “safely” annotated genes. For summarization on the exon level (not recommended for Gene arrays), one can use “probeset” as the target option. Although other methods for background correction and normalization exist, RMA is usually a good default choice. RMA shares information across arrays and uses the versatile quantile normalization method that will make the array intensity distributions match. However, it is preferable to apply it only after outliers have been removed. The quantile normalization algorithm used by RMA works by replacing values by the average of identically ranked (within a single chip) values across arrays. A more detailed description can be found on the Wikipedia page.

An alternative to quantile normalization would be the vsn algorithm,that performs background correction and normalization by robustly shifting and scaling intensity values within arrays before log-transforming them. This is less “severe” than quantile normalization (10).

9.1 Some mathematical background on normalization (calibration) and background correction

A generic model for the value of the intensity \(Y\) of a single probe on a microarray is given by

\[ Y = B + \alpha \cdot S \]

where B is a random quantity due to background noise, usually composed of optical effects and non-specific binding, \(\alpha\) is a gain factor, and \(S\) is the amount of measured specific binding. The signal \(S\) is considered a random variable as well and accounts for measurement error and probe effects. The measurement error is typically assumed to be multiplicative so we can write:

\[ \log(S) = \theta + \varphi + \varepsilon \]

Here \(\theta\) represents the logarithm of the true abundance, \(\varphi\) is a probe-specific effect, and \(\varepsilon\) accounts for the nonspecific error. This is the additive-multiplicative-error model for microarray data used by RMA and also the vsn algorithm (10). The algorithms differ in the way that \(B\) is removed and an estimate of \(\theta\) is obtained.

9.2 Quality assessment of the calibrated data

We now produce a clustering heatmap and another PCA plot using the calibrated data.

PCA analysis

First, we perform a PCA analysis of the calibrated data analogously to the one with the raw data:

exp_palmieri <- Biobase::exprs(palmieri_eset_norm)
PCA <- prcomp(t(exp_palmieri), scale = FALSE)

percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1)
sd_ratio <- sqrt(percentVar[2] / percentVar[1])

dataGG <- data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2],
                    Disease = 
                     Biobase::pData(palmieri_eset_norm)$Factor.Value.disease.,
                    Phenotype = 
                     Biobase::pData(palmieri_eset_norm)$Factor.Value.phenotype.)


ggplot(dataGG, aes(PC1, PC2)) +
      geom_point(aes(shape = Disease, colour = Phenotype)) +
  ggtitle("PCA plot of the calibrated, summarized data") +
  xlab(paste0("PC1, VarExp: ", percentVar[1], "%")) +
  ylab(paste0("PC2, VarExp: ", percentVar[2], "%")) +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_fixed(ratio = sd_ratio) +
  scale_shape_manual(values = c(4,15)) + 
  scale_color_manual(values = c("darkorange2", "dodgerblue4"))
PCA plot of the calibrated, summarized data.

Figure 6: PCA plot of the calibrated, summarized data

In comparison to the first PCA analysis before RMA (Figure 2), we see that now the first principal component separates between the tissues types (Figure 6). This indicates that now differential expression between the tissue types is the dominant source of variation. Note that the second principal component separates the diseases.

Heatmap clustering analysis

We want to plot a heatmap with the sample-to-sample distances with the sample names as row-names. Also, we want to see how well the samples cluster for phenotype (inflamed and non-inflamed tissue) and disease (UC and CD), respectively. We use row annotation for that: it means that these features get a colour code and will be displayed to the left of each row.

phenotype_names <- ifelse(str_detect(pData
                                    (palmieri_eset_norm)$Factor.Value.phenotype.,
                             "non"), "non_infl.", "infl.")

disease_names <- ifelse(str_detect(pData
                                    (palmieri_eset_norm)$Factor.Value.disease.,
                             "Crohn"), "CD", "UC")

annotation_for_heatmap <- 
  data.frame(Phenotype = phenotype_names,  Disease = disease_names)

row.names(annotation_for_heatmap) <- row.names(pData(palmieri_eset_norm))

In order to map the sample-to-sample distances, we first compute the distances using the dist function. We need to transpose the expression values since the function computes the distances between the rows (i.e. genes in our case) by default. The default distance is the Euclidean one. However this can be changed and we choose the Manhattan distance here (it uses absolute distances along rectangular paths instead of squared distances of the direct path), as it is more robust. We set the diagonal of the distance matrix to NA in order to increase the contrast of the color coding. Those diagonal entries do not contain information since the distance of a sample to itself is always equal to zero.

dists <- as.matrix(dist(t(exp_palmieri), method = "manhattan"))

rownames(dists) <- row.names(pData(palmieri_eset_norm))
hmcol <- rev(colorRampPalette(RColorBrewer::brewer.pal(9, "YlOrRd"))(255))
colnames(dists) <- NULL
diag(dists) <- NA

ann_colors <- list(
  Phenotype = c(non_infl. = "chartreuse4", infl. = "burlywood3"),
  Disease = c(CD = "blue4", UC = "cadetblue2")
                   )
pheatmap(dists, col = (hmcol), 
         annotation_row = annotation_for_heatmap,
         annotation_colors = ann_colors,
         legend = TRUE, 
         treeheight_row = 0,
         legend_breaks = c(min(dists, na.rm = TRUE), 
                         max(dists, na.rm = TRUE)), 
         legend_labels = (c("small distance", "large distance")),
         main = "Clustering heatmap for the calibrated samples")