1 Introduction

Experiments comprising different omic data analyses are becoming common. In this cases, the researchers are interested in analyzing the different omic types independently as well as finding association between the different layers.

In this vignette, we present how address to an analysis of methylation and expression data using MEAL. We will show how to analyze each data type independently and how to integrate the results of both analyses. We will also exemplify how to run an analysis when we are interested in our target region.

We will use the data from brgedata package. This package contains data from a spanish children cohort. We will work with the methylation and the expression data encapsulated in a GenomicRatioSet and in an ExpressionSet respectively. In our analysis, we will evaluate the effect of sex on methylation and expression. We will also deeply explore changes in a region of chromosome 11 around MMP3, a gene differently expressed in men and women (chr11:102600000-103300000).

Let’s start by loading the required packages and data:

library(MEAL)
library(brgedata)
library(MultiDataSet)
library(missMethyl)
library(minfi)
library(GenomicRanges)
library(ggplot2)

From the six objects of brgedata, we will focus on those containing methylation and expression data: brge_methy and brge_gexp.

brge_methy is a GenomicRatioSet with methylation data corresponding to the Illumina Human Methylation 450K. It contains 392277 probes and 20 samples. The object contains 9 phenotypic variables (age, sex and cell types proportions):

data(brge_methy)
brge_methy
## class: GenomicRatioSet 
## dim: 392277 20 
## metadata(0):
## assays(1): Beta
## rownames(392277): cg13869341 cg24669183 ... cg26251715 cg25640065
## rowData names(14): Forward_Sequence SourceSeq ...
##   Regulatory_Feature_Group DHS
## colnames(20): x0017 x0043 ... x0077 x0079
## colData names(9): age sex ... Mono Neu
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19
## Preprocessing
##   Method: NA
##   minfi version: NA
##   Manifest version: NA
colData(brge_methy)
## DataFrame with 20 rows and 9 columns
##             age      sex           NK     Bcell       CD4T       CD8T
##       <numeric> <factor>    <numeric> <numeric>  <numeric>  <numeric>
## x0017         4   Female -5.81386e-19 0.1735078 0.20406869  0.1009741
## x0043         4   Female  1.64826e-03 0.1820172 0.16171272  0.1287722
## x0036         4   Male    1.13186e-02 0.1690173 0.15546368  0.1277417
## x0041         4   Female  8.50822e-03 0.0697158 0.00732789  0.0321861
## x0032         4   Male    0.00000e+00 0.1139780 0.22230399  0.0216090
## ...         ...      ...          ...       ...        ...        ...
## x0018         4   Female    0.0170284 0.0781698   0.112735 0.06796816
## x0057         4   Female    0.0000000 0.0797774   0.111072 0.00910489
## x0061         4   Female    0.0000000 0.1640266   0.224203 0.13125212
## x0077         4   Female    0.0000000 0.1122731   0.168056 0.07840593
## x0079         4   Female    0.0120148 0.0913650   0.205830 0.11475389
##               Eos      Mono       Neu
##         <numeric> <numeric> <numeric>
## x0017 0.00000e+00 0.0385654  0.490936
## x0043 0.00000e+00 0.0499542  0.491822
## x0036 0.00000e+00 0.1027105  0.459632
## x0041 0.00000e+00 0.0718280  0.807749
## x0032 2.60504e-18 0.0567246  0.575614
## ...           ...       ...       ...
## x0018 8.37770e-03 0.0579535  0.658600
## x0057 4.61887e-18 0.1015260  0.686010
## x0061 8.51074e-03 0.0382647  0.429575
## x0077 6.13397e-02 0.0583411  0.515284
## x0079 0.00000e+00 0.0750535  0.513597

brge_gexp is an ExpressionSet with the expression data corresponding to an Affimetrix GeneChip Human Gene 2.0 ST Array. It contains 67528 features and 100 samples as well as two phenotypic variables (age and sex):

data(brge_gexp)
brge_gexp
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 67528 features, 100 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: x0001 x0002 ... x0139 (100 total)
##   varLabels: age sex
##   varMetadata: labelDescription
## featureData
##   featureNames: TC01000001.hg.1 TC01000002.hg.1 ...
##     TCUn_gl000247000001.hg.1 (67528 total)
##   fvarLabels: transcript_cluster_id probeset_id ... notes (11 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
lapply(pData(brge_gexp), table)
## $age
## 
##  4 
## 99 
## 
## $sex
## 
## Female   Male 
##     48     51

Finally, we will create a GenomicRanges with our target region:

targetRange <- GRanges("chr11:102600000-103300000")

Let’s illustrate the use of this package by analyzing the effect of the sex in methylation and expression. First, a genome wide analysis will be performed and then the region funcionalities will be introduced. It should be noticed that methylation and expression analyses will include hypothesis testing and visualitzation.

2 Methylation Analysis

2.1 Running the analyses

This demonstration will show how to perform a probe methylation analysis. To run a more complete analysis, including DMR methods, see MEAL vignette.

The function runDiffMeanAnalysis detects probes differentially methylated using limma. This function accepts a GenomicRatioSet, an ExpressionSet or a SummarizedExperiment, so it can analyze methylation and gene expression data. We can pass our model to function runDiffMeanAnalysis with the parameter model. We are interested in evaluating the effect of sex and We will include cell counts as covariables to adjust their effect:

cellCounts <- colnames(colData(brge_methy))[3:9]
methRes <- runDiffMeanAnalysis(set = brge_methy, 
                       model = formula(paste("~ sex +", paste(cellCounts, collapse = "+"))))
methRes
## Object of class 'ResultSet'
##  . created with: DAProbe 
##  . sva:   
##  . #results: 1 ( error: 1 )
##  . featureData: 392277 probes x 19 variables
names(methRes)
## [1] "DiffMean"

The analysis generates a ResultSet object containing the results of our analysis. The first step after running the pipeline will be evaluate how this model fits the data. A common way to evaluate the goodness of fit of these models is by making a QQplot on the p-values of the linear regression. This plot compares the distribution of p-values obtained in the analysis with a theoretical distribution. Our plot also shows the lambda, a measure of inflation. If lambda is higher than 1, the p-values are smaller than expected. If it is lower than 1, p-values are bigger than expected. In both causes, the most common cause is that we should include more variables in the model.

We can get a QQplot with the function plot. When applied to a ResultSet, we can make different plots depending on the method. We can select the name of the result with the parameter rid and the kind of plot with the parameter type. We will set rid to “DiffMean” to use limma results and type to “qq” to get a QQplot:

plot(methRes, rid = "DiffMean", type = "qq")

When there are few CpGs modicated, we expect points in the QQplot lying on the theoretical line. However, as all the CpGs from the sexual chromosomes will be differentially methylated between males and females, we obtained a S-shape. In this case, lambda is lower than 1 (0.851).

There are different ways to address this issue. An option is to include other covariates in our model. When we do not have more covariates, as it is our case, we can apply Surrogate Variable Analysis (SVA) to the data. SVA is a statistical technique that tries to determine hidden covariates or SV (Surrogate Variables) based on the measurements. This method is very useful to correct inflation when we do not have the variables that are missing in the model. Our function runPipeline includes the parameter sva that runs SVA and includes these variables as covariates in the linear model. More information can be found in MEAL vignette.

We can get an overview of the results using a Manhattan plot. A Manhattan plot represents the p-value of each CpG versus their location in the genome. This representation is useful to find genomic regions exhibiting differentiated behaviours. We can get a Manhattan plot from a ResultSet using the function plot and setting type to “manhattan”. We will highlight the CpGs of our target region by passing a GenomicRanges to highlight argument. GenomicRanges passed to plot should have chromosomes coded as number:

targetRangeNum <- GRanges("11:102600000-103300000")
plot(methRes, rid = "DiffMean", type = "manhattan", main = "Differences in Means", highlight = targetRangeNum)