Contents

1 Introduction

The missMethyl package contains functions to analyse methylation data from Illumina’s HumanMethylation450 and MethylationEPIC beadchip. These arrays are a cost-effective alternative to whole genome bisulphite sequencing, and as such are widely used to profile DNA methylation. Specifically, missMethyl contains functions to perform SWAN normalisation (Maksimovic, Gordon, and Oshlack 2012),perform differential methylation analysis using RUVm (Maksimovic et al. 2015), differential variability analysis (Phipson and Oshlack 2014) and gene set enrichment analysis (Phipson, Maksimovic, and Oshlack 2016). As our lab’s research into specialised analyses of these arrays continues we anticipate that the package will be updated with new functions.

Raw data files are in IDAT format, which can be read into R using the minfi package (Aryee et al. 2014). Statistical analyses are usually performed on M-values, and \(\beta\) values are used for visualisation, both of which can be extracted from MethylSet data objects, which is a class of object created by minfi. For detecting differentially variable CpGs we recommend that the analysis is performed on M-values. All analyses described here are performed at the CpG site level.

1.1 Analysis of Illumina MethylationEPIC v2.0 beadchip datasets

The missMethyl package has been updated to support Illumina’s MethylationEPIC v2.0 beadchip. This array contains replicate probes which map to the same CpG dinucleotides. We recommend filtering these replicates prior to analysis with the missMethyl package using the DMRcate package. An example of this is shown in this vignette.

2 Reading data into R

We will use the data in the minfiData package to demonstrate the functions in missMethyl. The example dataset has 6 samples across two slides. There are 3 cancer samples and 3 normal sample. The sample information is in the targets file. An essential column in the targets file is the Basename column which tells us where the idat files to be read in are located. The R commands to read in the data are taken from the minfi User’s Guide. For additional details on how to read the IDAT files into R, as well as information regarding quality control please refer to the minfi User’s Guide.

library(missMethyl)
library(limma)
library(minfi)
library(minfiData)
baseDir <- system.file("extdata", package = "minfiData")
targets <- read.metharray.sheet(baseDir)
## [1] "/home/biocbuild/bbs-3.20-bioc/R/site-library/minfiData/extdata/SampleSheet.csv"
targets[,1:9]
##   Sample_Name Sample_Well Sample_Plate Sample_Group Pool_ID person age sex
## 1    GroupA_3          H5         <NA>       GroupA    <NA>    id3  83   M
## 2    GroupA_2          D5         <NA>       GroupA    <NA>    id2  58   F
## 3    GroupB_3          C6         <NA>       GroupB    <NA>    id3  83   M
## 4    GroupB_1          F7         <NA>       GroupB    <NA>    id1  75   F
## 5    GroupA_1          G7         <NA>       GroupA    <NA>    id1  75   F
## 6    GroupB_2          H7         <NA>       GroupB    <NA>    id2  58   F
##   status
## 1 normal
## 2 normal
## 3 cancer
## 4 cancer
## 5 normal
## 6 cancer
targets[,10:12]
##    Array      Slide
## 1 R02C02 5723646052
## 2 R04C01 5723646052
## 3 R05C02 5723646052
## 4 R04C02 5723646053
## 5 R05C02 5723646053
## 6 R06C02 5723646053
##                                                                                      Basename
## 1 /home/biocbuild/bbs-3.20-bioc/R/site-library/minfiData/extdata/5723646052/5723646052_R02C02
## 2 /home/biocbuild/bbs-3.20-bioc/R/site-library/minfiData/extdata/5723646052/5723646052_R04C01
## 3 /home/biocbuild/bbs-3.20-bioc/R/site-library/minfiData/extdata/5723646052/5723646052_R05C02
## 4 /home/biocbuild/bbs-3.20-bioc/R/site-library/minfiData/extdata/5723646053/5723646053_R04C02
## 5 /home/biocbuild/bbs-3.20-bioc/R/site-library/minfiData/extdata/5723646053/5723646053_R05C02
## 6 /home/biocbuild/bbs-3.20-bioc/R/site-library/minfiData/extdata/5723646053/5723646053_R06C02
rgSet <- read.metharray.exp(targets = targets)

The data is now an RGChannelSet object and needs to be normalised and converted to a MethylSet object.

2.1 Annotation for Illumina MethylationEPIC v2.0 beadchip datasets

When using MethylationEPIC v2.0 datasets, the array and annotation details in the RGChannelSet object must be set manually.

### This code is not run within this vignette

# If using EPIC_V2, please run these lines:
annotation(rgSet)["array"] = "IlluminaHumanMethylationEPICv2"
annotation(rgSet)["annotation"] = "20a1.hg38"

### End of not run

3 Subset-quantile within array normalization (SWAN)

SWAN (subset-quantile within array normalization) is a within-array normalization method for Illumina 450k & EPIC BeadChips. Technical differencs have been demonstrated to exist between the Infinium I and Infinium II assays on a single Illumina HumanMethylation array (Bibikova et al. 2011, @Dedeurwaerder2011). Using the SWAN method substantially reduces the technical variability between the assay designs whilst maintaining important biological differences. The SWAN method makes the assumption that the number of CpGs within the 50bp probe sequence reflects the underlying biology of the region being interrogated. Hence, the overall distribution of intensities of probes with the same number of CpGs in the probe body should be the same regardless of assay type. The method then uses a subset quantile normalization approach to adjust the intensities of each array (Maksimovic, Gordon, and Oshlack 2012).

SWAN can take a MethylSet, RGChannelSet or MethyLumiSet as input. It should be noted that, in order to create the normalization subset, SWAN randomly selects Infinium I and II probes that have one, two and three underlying CpGs; as such, we recommend using set.seed before to ensure that the normalized intensities will be identical, if the normalization is repeated.

The technical differences between Infinium I and II assay designs can result in aberrant \(\beta\) value distributions (Figure 1, panel “Raw”). Using SWAN corrects for the technical differences between the Infinium I and II assay designs and produces a smoother overall \(\beta\) value distribution (Figure 1, panel “SWAN”).

mSet <- preprocessRaw(rgSet)
mSetSw <- SWAN(mSet,verbose=TRUE)
## [SWAN] Preparing normalization subset
## 450k
## [SWAN] Normalizing methylated channel
## [SWAN] Normalizing array 1 of 6
## [SWAN] Normalizing array 2 of 6
## [SWAN] Normalizing array 3 of 6
## [SWAN] Normalizing array 4 of 6
## [SWAN] Normalizing array 5 of 6
## [SWAN] Normalizing array 6 of 6
## [SWAN] Normalizing unmethylated channel
## [SWAN] Normalizing array 1 of 6
## [SWAN] Normalizing array 2 of 6
## [SWAN] Normalizing array 3 of 6
## [SWAN] Normalizing array 4 of 6
## [SWAN] Normalizing array 5 of 6
## [SWAN] Normalizing array 6 of 6
par(mfrow=c(1,2), cex=1.25)
densityByProbeType(mSet[,1], main = "Raw")
densityByProbeType(mSetSw[,1], main = "SWAN")
Beta value dustributions. Density distributions of beta values before and after using SWAN.

Figure 1: Beta value dustributions
Density distributions of beta values before and after using SWAN.

4 Filter out poor quality probes

Poor quality probes can be filtered out based on the detection p-value. For this example, to retain a CpG for further analysis, we require that the detection p-value is less than 0.01 in all samples.

detP <- detectionP(rgSet)
keep <- rowSums(detP < 0.01) == ncol(rgSet)
mSetSw <- mSetSw[keep,]

5 Extracting \(\beta\) and M-values

Now that the data has been SWAN normalised we can extract \(\beta\) and M-values from the object. We prefer to add an offset to the methylated and unmethylated intensities when calculating M-values, hence we extract the methylated and unmethylated channels separately and perform our own calculation. For all subsequent analyses we use a random selection of 20000 CpGs to reduce computation time.

set.seed(10)
mset_reduced <- mSetSw[sample(1:nrow(mSetSw), 20000),]
meth <- getMeth(mset_reduced)
unmeth <- getUnmeth(mset_reduced)
Mval <- log2((meth + 100)/(unmeth + 100))
beta <- getBeta(mset_reduced)
dim(Mval)
## [1] 20000     6
par(mfrow=c(1,1))
plotMDS(Mval, labels=targets$Sample_Name, col=as.integer(factor(targets$status)))
legend("topleft",legend=c("Cancer","Normal"),pch=16,cex=1.2,col=1:2)
MDS plot. A multi-dimensional scaling (MDS) plot of cancer and normal samples.

Figure 2: MDS plot
A multi-dimensional scaling (MDS) plot of cancer and normal samples.

An MDS plot (Figure 2) is a good sanity check to make sure samples cluster together according to the main factor of interest, in this case, cancer and normal.

5.1 Filtering replicate probes in Illumina MethylationEPIC v2.0 beadchip datasets

The MethylationEPIC v2.0 array contains replicate probes which map to the same CpG dinucleotides. We recommend filtering these replicate probes using the DMRcate package as shown below. Please see the [DMRcate](https://bioconductor.org/packages/3.20/DMRcate) EPICv2 vignette for more details.

### This code is not run within this vignette

# For EPIC_V2, please run these lines to remove replicate and non-cg probes:
# remove non-cg probes
Mval <- Mval[grepl("^cg", rownames(Mval)),]
# select replicate probes with best sensitivity
Mval <- DMRcate::rmPosReps(Mval, filter.strategy="sensitivity")

### End of not run

6 Testing for differential methylation using limma

To test for differential methylation we use the limma package (Smyth 2005), which employs an empirical Bayes framework based on Guassian model theory. First we need to set up the design matrix. There are a number of ways to do this, the most straightforward is directly from the targets file. There are a number of variables, with the status column indicating cancer/normal samples. From the person column of the targets file, we see that the cancer/normal samples are matched, with 3 individuals each contributing both a cancer and normal sample. Since the limma model framework can handle any experimental design which can be summarised by a design matrix, we can take into account the paired nature of the data in the analysis. For more complicated experimental designs, please refer to the limma User’s Guide.

group <- factor(targets$status,levels=c("normal","cancer"))
id <- factor(targets$person)
design <- model.matrix(~id + group)
design
##   (Intercept) idid2 idid3 groupcancer
## 1           1     0     1           0
## 2           1     1     0           0
## 3           1     0     1           1
## 4           1     0     0           1
## 5           1     0     0           0
## 6           1     1     0           1
## attr(,"assign")
## [1] 0 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$id
## [1] "contr.treatment"
## 
## attr(,"contrasts")$group
## [1] "contr.treatment"

Now we can test for differential methylation using the lmFit and eBayes functions from limma. As input data we use the matrix of M-values.

fit.reduced <- lmFit(Mval,design)
fit.reduced <- eBayes(fit.reduced, robust=TRUE)

The numbers of hyper-methylated (1) and hypo-methylated (-1) can be displayed using the decideTests function in limma and the top 10 differentially methylated CpGs for cancer versus normal extracted using topTable.

summary(decideTests(fit.reduced))
##        (Intercept) idid2 idid3 groupcancer
## Down          6966     0   120         577
## NotSig        3366 20000 19875       18950
## Up            9668     0     5         473
top<-topTable(fit.reduced,coef=4)
top
##               logFC    AveExpr        t      P.Value  adj.P.Val        B
## cg16969623 4.738630 -1.3299003 18.19044 5.173550e-06 0.02961263 4.437708
## cg24115221 4.263797 -0.4610231 16.36669 9.020982e-06 0.02961263 4.071486
## cg13692446 4.358674  0.3082280 15.78961 1.089203e-05 0.02961263 3.939795
## cg11334818 4.034771  0.9277920 15.28520 1.291378e-05 0.02961263 3.817612
## cg10341556 4.978126 -1.5761985 15.42154 1.331473e-05 0.02961263 3.783887
## cg17815252 3.691483 -2.3570265 14.67851 1.596482e-05 0.02961263 3.661210
## cg24331301 3.847282 -1.4302450 14.47935 1.714739e-05 0.02961263 3.607480
## cg26976732 3.848104 -0.9274223 14.35266 1.795372e-05 0.02961263 3.572657
## cg26328335 3.726501  0.3548518 14.15063 1.933427e-05 0.02961263 3.516070
## cg05621343 4.271503 -0.5600706 14.17489 1.946740e-05 0.02961263 3.509129

Note that since we performed our analysis on M-values, the logFC and AveExpr columns are computed on the M-value scale. For interpretability and visualisation we can look at the \(\beta\) values. The \(\beta\) values for the top 4 differentially methylated CpGs shown in Figure 3.

cpgs <- rownames(top)
par(mfrow=c(2,2))
for(i in 1:4){
stripchart(beta[rownames(beta)==cpgs[i],]~design[,4],method="jitter",
group.names=c("Normal","Cancer"),pch=16,cex=1.5,col=c(4,2),ylab="Beta values",
vertical=TRUE,cex.axis=1.5,cex.lab=1.5)
title(cpgs[i],cex.main=1.5)
}
Top DM CpGs. The beta values for the top 4 differentially methylated CpGs.

Figure 3: Top DM CpGs
The beta values for the top 4 differentially methylated CpGs.

7 Removing unwanted variation when testing for differential methylation

Like other platforms, 450k array studies are subject to unwanted technical variation such as batch effects and other, often unknown, sources of variation. The adverse effects of unwanted variation have been extensively documented in gene expression array studies and have been shown to be able to both reduce power to detect true differences and to increase the number of false discoveries. As such, when it is apparent that data is significantly affected by unwanted variation, it is advisable to perform an adjustment to mitigate its effects.

missMethyl provides a limma inspired interface to functions from the CRAN package ruv, which enable the removal of unwanted variation when performing a differential methylation analysis (Maksimovic et al. 2015).

RUVfit uses the RUV-inverse method by default, as this does not require the user to specify a \(k\) parameter. The ridged version of RUV-inverse is also available by setting method = rinv. The RUV-2 and RUV-4 functions can also be used by setting method = ruv2 or method = ruv4, respectively, and specifying an appropriate value for k (number of components of unwanted variation to remove) where \(0 \leq k < no. samples\).

All of the methods rely on negative control features to accurately estimate the components of unwanted variation. Negative control features are probes/genes/etc. that are known a priori to not truly be associated with the biological factor of interest, but are affected by unwanted variation. For example, in a microarray gene expression study, these could be house-keeping genes or a set of spike-in controls. Negative control features are extensively discussed in Gagnon-Bartsch and Speed (2012) and Gagnon-Bartsch et al. (2013). Once the unwanted factors are accurately estimated from the data, they are adjusted for in the linear model that describes the differential analysis.

If the negative control features are not known a priori, they can be identified empirically. This can be achieved via a 2-stage approach, RUVm. Stage 1 involves performing a differential methylation analysis using RUV-inverse (by default) and the 613 Illumina negative controls (INCs) as negative control features. This will produce a list of CpGs ranked by p-value according to their level of association with the factor of interest. This list can then be used to identify a set of empirical control probes (ECPs), which will capture more of the unwanted variation than using the INCs alone. ECPs are selected by designating a proportion of the CpGs least associated with the factor of interest as negative control features; this can be done based on either an FDR cut-off or by taking a fixed percentage of probes from the bottom of the ranked list. Stage 2 involves performing a second differential methylation analysis on the original data using RUV-inverse (by default) and the ECPs. For simplicity, we are ignoring the paired nature of the cancer and normal samples in this example.

# get M-values for ALL probes
meth <- getMeth(mSet)
unmeth <- getUnmeth(mSet)
M <- log2((meth + 100)/(unmeth + 100))
### This code is not run within this vignette

# For EPIC_V2, please run these lines to remove replicate and non-cg probes:
# remove non-cg probes
M <- M[grepl("^cg", rownames(M)),]
# select replicate probes with best sensitivity
M <- DMRcate::rmPosReps(M, filter.strategy="sensitivity")

### End of not run
# setup the factor of interest
grp <- factor(targets$status, labels=c(0,1))
# extract Illumina negative control data
INCs <- getINCs(rgSet)
head(INCs)
##          5723646052_R02C02 5723646052_R04C01 5723646052_R05C02
## 13792480        -0.3299654        -1.0955482        -0.5266103
## 69649505        -1.0354488        -1.4943396        -1.0067050
## 34772371        -1.1286422        -0.2995603        -0.8192636
## 28715352        -0.5553373        -0.7599489        -0.7186973
## 74737439        -1.1169178        -0.8656399        -0.6429681
## 33730459        -0.7714684        -0.5622424        -0.7724825
##          5723646053_R04C02 5723646053_R05C02 5723646053_R06C02
## 13792480        -0.6374299         -1.116598        -0.4332793
## 69649505        -0.8854881         -1.586679        -0.9217329
## 34772371        -0.6895514         -1.161155        -0.6186795
## 28715352        -1.7903619         -1.348105        -1.0067259
## 74737439        -0.8872082         -1.064986        -0.9841833
## 33730459        -1.5623138         -2.079184        -1.0445246
# add negative control data to M-values
Mc <- rbind(M,INCs)
# create vector marking negative controls in data matrix
ctl1 <- rownames(Mc) %in% rownames(INCs)
table(ctl1)
## ctl1
##  FALSE   TRUE 
## 485512    613
rfit1 <- RUVfit(Y = Mc, X = grp, ctl = ctl1) # Stage 1 analysis
rfit2 <- RUVadj(Y = Mc, fit = rfit1)

Now that we have performed an initial differential methylation analysis to rank the CpGs with respect to their association with the factor of interest, we can designate the CpGs that are least associated with the factor of interest based on FDR-adjusted p-value as ECPs.

top1 <- topRUV(rfit2, num=Inf, p.BH = 1)
head(top1)
##                     F.p     F.p.BH       p_X1.1  p.BH_X1.1    b_X1.1     sigma2
## cg04743961 3.516091e-07 0.01017357 3.516091e-07 0.01017357 -4.838190 0.10749571
## cg07155336 3.583107e-07 0.01017357 3.583107e-07 0.01017357 -5.887409 0.16002329
## cg20925841 3.730375e-07 0.01017357 3.730375e-07 0.01017357 -4.790211 0.10714494
## cg03607359 4.721205e-07 0.01017357 4.721205e-07 0.01017357 -4.394397 0.09636129
## cg10566121 5.238865e-07 0.01017357 5.238865e-07 0.01017357 -4.787914 0.11780108
## cg07655636 6.080091e-07 0.01017357 6.080091e-07 0.01017357 -4.571758 0.11201883
##            var.b_X1.1 fit.ctl        mean
## cg04743961 0.07729156   FALSE -2.31731496
## cg07155336 0.11505994   FALSE -1.27676413
## cg20925841 0.07703935   FALSE -0.87168892
## cg03607359 0.06928569   FALSE -2.19187147
## cg10566121 0.08470133   FALSE  0.03138961
## cg07655636 0.08054378   FALSE -1.29294851
ctl2 <- rownames(M) %in% rownames(top1[top1$p.BH_X1.1 > 0.5,])
table(ctl2)
## ctl2
##  FALSE   TRUE 
## 172540 312972

We can then use the ECPs to perform a second differential methylation with RUV-inverse, which is adjusted for the unwanted variation estimated from the data.

# Perform RUV adjustment and fit
rfit3 <- RUVfit(Y = M, X = grp, ctl = ctl2) # Stage 2 analysis
rfit4 <- RUVadj(Y = M, fit = rfit3)
# Look at table of top results
topRUV(rfit4)
##              F.p F.p.BH p_X1.1 p.BH_X1.1    b_X1.1    sigma2 var.b_X1.1 fit.ctl
## cg07155336 1e-24  1e-24  1e-24     1e-24 -5.769286 0.1668414  0.1349771   FALSE
## cg06463958 1e-24  1e-24  1e-24     1e-24 -5.733093 0.1668414  0.1349771   FALSE
## cg00024472 1e-24  1e-24  1e-24     1e-24 -5.662959 0.1668414  0.1349771   FALSE
## cg02040433 1e-24  1e-24  1e-24     1e-24 -5.651399 0.1668414  0.1349771   FALSE
## cg13355248 1e-24  1e-24  1e-24     1e-24 -5.595396 0.1668414  0.1349771   FALSE
## cg02467990 1e-24  1e-24  1e-24     1e-24 -5.592707 0.1668414  0.1349771   FALSE
## cg00817367 1e-24  1e-24  1e-24     1e-24 -5.527501 0.1668414  0.1349771   FALSE
## cg11396157 1e-24  1e-24  1e-24     1e-24 -5.487992 0.1668414  0.1349771   FALSE
## cg16306898 1e-24  1e-24  1e-24     1e-24 -5.466780 0.1668414  0.1349771   FALSE
## cg03735888 1e-24  1e-24  1e-24     1e-24 -5.396242 0.1668414  0.1349771   FALSE
##                  mean
## cg07155336 -1.2767641
## cg06463958  0.2776252
## cg00024472 -2.4445762
## cg02040433 -1.2918259
## cg13355248 -0.8483387
## cg02467990 -0.4154370
## cg00817367 -0.2911294
## cg11396157 -1.3800170
## cg16306898 -2.1469768
## cg03735888 -1.1527557

7.1 Alternative approach for RUVm stage 1

If the number of samples in your experiment is greater than the number of Illumina negative controls on the array platform used - 613 for 450k, 411 for EPIC - stage 1 of RUVm will not work. In such cases, we recommend performing a standard limma analysis in stage 1.

# setup design matrix
des <- model.matrix(~grp)
des