1 Introduction

The ENmix package provides a set of quality control, preprocessing/correction and data analysis tools for Illumina Methylation Beadchips. It includes functions to read in raw idat data, background correction, dye bias correction, probe-type bias adjustment, along with a number of additional tools. These functions can be used to remove unwanted experimental noise and thus to improve accuracy and reproducibility of methylation measures. ENmix functions are flexible and transparent. Users have option to choose a single pipeline command to finish all data pre-processing steps (including quality control, background correction, dye-bias adjustment, between-array normalization and probe-type bias correction) or to use individual functions sequentially to perform data pre-processing in a more customized manner. In addition the ENmix package has selectable complementary functions for efficient data visualization (such as QC plots, data distribution plot, manhattan plot and Q-Q plot), quality control (identifing and filtering low quality data points, samples, probes, and outliers, along with imputation of missing values), identification of probes with multimodal distributions due to SNPs or other factors, exploration of data variance structure using principal component regression analysis plot, preparation of experimental factors related surrogate control variables to be adjusted in downstream statistical analysis, an efficient algorithm oxBS-MLE to estimate 5-methylcytosine and 5-hydroxymethylcytosine level; estimation of celltype proporitons; methlation age calculation and differentially methylated region (DMR) analysis.

Most ENmix package can also support the data structure used by several other related R packages, such as minfi, wateRmelon and ChAMP, providing straightforward integration of ENmix-corrected datasets for subsequent data analysis.

ENmix readidat function does not depend on array annotation R packages. It can directly read in Illuminal manifest file, which makes it easier to work with newer array, such as MethylationEPICv2.0 and mouse Beadchip.

The software is designed to support large scale data analysis, and provides multi-processor parallel computing options for most functions.

2 List of functions

Data acquisition

  • readidat(): Read idat files into R
  • readmanifest(): Read array manifest file into R

Quality control

  • QCinfo(): Extract and visualize QC information
  • plotCtrl(): Generate internal control plots
  • getCGinfo(): Extract CpG probe annotation information
  • calcdetP(): Compute detection P values
  • qcfilter(): Remove low quality values, samples or CpGs; remove outlier samples and perform imputation
  • nmode(): Identify “gap” probes, i.e. those with multimodal distribution from underlying caused by underlying SNPs
  • dupicc(): Calculate Introclass correlation coefficient (ICC) using data for duplicates
  • freqpoly(): Frequency polygon plot for single variable
  • multifreqpoly(): Frequency polygon plot for multiple variables

Preprocessing

  • mpreprocess(): Preprocessing pipeline
  • preprocessENmix(): ENmix background correction and dye bias correction
  • relic(): RELIC dye bias correction
  • norm.quantile(): Quantile normalization
  • rcp(): RCP probe design type bias correction

Differential methylated region (DMR) analysis

  • ipdmr(): ipDMR differentially methylated region analysis
  • combp(): Combp differentially methylated region analysis

Other functions

  • oxBS.MLE(): MLE estimates of 5-methylcytosine (5mC) and 5-hydroxymethylcytosine (5hmC)
  • estimateCellProp(): Estimate white blood cell type proportions
  • methyAge(): Calculate methylation age
  • predSex(): Estimate sample sex
  • ctrlsva(): Derive surrogate variables to control for experimental confounding using non-negative internal control probes
  • pcrplot(): Principal component regression plot
  • mhtplot(): P value manhattan plot
  • p.qqplot(): P value Q-Q plot
  • B2M(): Convert Beta value to M value
  • M2B(): Convert M value to Beta value

3 ENmix classes

ENmix organizes data with two different classes.

rgDataSet contains raw data (including internal control probes) from IDAT file, CpG annotation from Illumina manifest file and/or sample inforamtion (plate, array, and phenotypes) provided by users. Array intensity data is organized by probe (not CpG locus) at red and green channel.

methDataSet contains methylated and unmethylated intensity values (organized by CpG), CpG annotation from Illumina manifest file and/or sample inforamtion (plate, array, and phenotypes) provided by users.

The ENmix class hierarchy

4 Example Analysis

The following examples are brief demonstrations on how to perform DNA methylation analysis using ENmix functions.

4.1 Example 1: using pipeline

The pipeline function mpreprocess() can be used to perform quality control and data preprocessing.

rgSet <- readidat(datapath)
beta=mpreprocess(rgSet)

These can achieve the following tasks:

  • Read in raw idat files
  • Data preprocessing
    • background correction
    • dye-bias correction
    • between-array normalization
    • probe-type bias adjustment
  • Quality control
    • Identify and exclude low quality probes
    • Identify and exclude low quality samples
    • Identify and exclude outlier samples
    • Remove low quality and outlier data points
    • Imputation to replace missing values
  • Visualization
    • Data distribution plots before and after low quality sample removal
    • Plots of quality controls measures (detection P values, bisulfite control intensities) to guide selection of thresholds

The following is an executable example.

library(ENmix)
#read in data
require(minfiData)
#read in IDAT files
path <- file.path(find.package("minfiData"),"extdata")
rgSet <- readidat(path = path,recursive = TRUE)
#data pre-processing
beta=mpreprocess(rgSet,nCores=6)
#quality control, data pre-processing and imputation
beta=mpreprocess(rgSet,nCores=6,qc=TRUE,fqcfilter=TRUE,
     rmcr=TRUE,impute=TRUE)

#CpG information (from Illumina manifest file) is self-contained in rgDataSet or methDataSet
cginfo=getCGinfo(rgSet)
#It has the same infomation if extracted from methDataSet
meth=getmeth(rgSet)
cginfo1=rowData(meth)
#Probe information for internal controls
ictrl=getCGinfo(rgSet,type="ctrl")

4.2 Example 2: using individual function

The example code below is basically doing the same thing as in example 1, but has more customized options.

library(ENmix)
#read in data
path <- file.path(find.package("minfiData"),"extdata")
rgSet <- readidat(path = path,recursive = TRUE)
#QC info
qc<-QCinfo(rgSet)
#background correction and dye bias correction
#if qc info object is provided, the low quality or outlier samples
# and probes will be excluded before background correction
mdat<-preprocessENmix(rgSet, bgParaEst="oob", dyeCorr="RELIC", 
                     QCinfo=qc, nCores=6)
#between-array normalization
mdat<-norm.quantile(mdat, method="quantile1")
#probe-type bias adjustment
beta<-rcp(mdat,qcscore=qc)
#set low quality and outlier data points as NA (missing value)
#remove samples and probes with too many (eg. >5%) missing data
#perform imputation to replace missing values
beta <- qcfilter(beta,qcscore=qc,rmcr=TRUE,impute=TRUE)

4.3 Example 3: A more comprehensive example

This example is a brief demonstration of major functions in ENmix. Getting detailed information about data quality, SNP-like/gap probes, excluding user-specified probes and/or samples, principal component regression analysis, low quality or outlier sample or probe removal, sex prediction, cell type proportion estimation, methylation age estimation, differentially methylated region analysis and surrogate variables for batch effect, etc.

library(ENmix)
#read in data
path <- file.path(find.package("minfiData"),"extdata")
rgSet <- readidat(path = path,recursive = TRUE)

#attach some phenotype info for later use
sheet <- read.metharray.sheet(file.path(find.package("minfiData"),
         "extdata"),pattern = "csv$")
rownames(sheet)=paste(sheet$Slide,sheet$Array,sep="_")
colData(rgSet)=as(sheet[colnames(rgSet),],"DataFrame")

#generate internal control plots to inspect quality of experiments
plotCtrl(rgSet)

#generate quality control (QC) information and plots, 
#identify outlier samples in data distribution
#if set detPtype="negative", recommend to set depPthre to <= 10E-6
#if set detPtype="oob", depPthre of 0.01 or 0.05 may be sufficient
#see Heiss et al. PMID: 30678737 for details
qc<-QCinfo(rgSet,detPtype="negative",detPthre=0.000001)

###data preprocessing
#background correction and dye bias correction
mdat<-preprocessENmix(rgSet, bgParaEst="oob", dyeCorr="RELIC",
                      QCinfo=qc, nCores=6)
#between-array normalization
mdat<-norm.quantile(mdat, method="quantile1")
#attach phenotype data for later use
colData(mdat)=as(sheet[colnames(mdat),],"DataFrame")
#probe-type bias adjustment
beta<-rcp(mdat,qcscore=qc)

#Search for multimodal CpGs, so called gap probes
#beta matrix before qcfilter() step should be used for this purpose
nmode<-nmode(beta, minN = 3, modedist=0.2, nCores = 6)

#filter low quality and outlier data points for each probe
#rows and columns with too many missing values will be removed if specify
#Imputation will be performed to fill missing data if specify.
beta <- qcfilter(beta,qcscore=qc,rmcr=TRUE, rthre=0.05,
                   cthre=0.05,impute=TRUE)

#If for epigenetic mutation analysis, outliers should be kept 
beta <- qcfilter(beta,qcscore=qc,rmoutlier=FALSE,rmcr=TRUE, rthre=0.05,
#                   cthre=0.05,impute=FALSE)

##Principal component regression analysis plot to check data variance structure
#phenotypes to be studied in the plot
cov<-data.frame(group=colData(mdat)$Sample_Group,
    slide=factor(colData(mdat)$Slide))
#missing data is not allowed in the analysis!
pcrplot(beta, cov, npc=6)

#Non-negative control surrogate variables, which can be used in
# downstream association analysis to control for batch effects. 
csva<-ctrlsva(rgSet)

#estimate cell type proportions
#rgDataset or methDataSet can also be used for the estimation
celltype=estimateCellProp(userdata=beta,
         refdata="FlowSorted.Blood.450k",
         nonnegative = TRUE,normalize=TRUE)

#predic sex based on rgDataSet or methDataset
sex=predSex(rgSet)
sex=predSex(mdat)

#Methylation age calculation
mage=methyAge(beta)

#ICC analysis can be performed to study measurement reliability if
# have duplicates, see manual
#dupicc()

#After association analysis, the P values can be used for DMR analysis
#simulate a small example file in BED format
dat=simubed()
#using ipdmr method, low seed value (0.01 or 0.05) should be used in real study.
ipdmr(data=dat,seed=0.1)
#using comb-p method
combp(data=dat,seed=0.1)

5 Setting up the data

The first step is to import array raw data files (*.idat) to create an object of rgDataSet, which is similar to RGChannelSetExtended in minfi package, but with probe annotation integrated, and it has smaller data size (about 1/3 smaller). Alternatively, User can also use minfi package to read in idat files and create RGChannelSetExtended. Most functions in ENmix support RGChannelSetExtended and RGChannelSet.

Some array types and corresponding manifestfiles can be guessed by the program based on the number of probes per array. However, we recommend users provide the correct manifest file (.csv format) directly for their types of array, which can be easily downloaded from Illumina website, see below for some examples. This option also allows the function to read in data from newer array, such as mouse array.

library(ENmix)
rgSet <- readidat(path = "directory path for idat files", 
                 recursive = TRUE)

#Download manifestfile for HumanMethylation450 beadchip
system("wget https://webdata.illumina.com/downloads/productfiles/humanmethylation450/humanmethylation450_15017482_v1-2.csv")
mf="HumanMethylation450_15017482_v1-2.csv"
rgSet <- readidat(path = "path to idat files",manifestfile=mf,recursive = TRUE)

When methylation IDAT raw data files are not available, such as for some publicly available datasets, users can use methylated (M) and unmethylated (U) intensity data to create an object of MethylSet. MethylSet is also supported by most functions in ENmix.

M<-matrix_for_methylated_intensity
U<-matrix_for_unmethylated_intensity
pheno<-as.data.frame(cbind(colnames(M), colnames(M)))
names(pheno)<-c("Basename","filenames")
rownames(pheno)<-pheno$Basename
pheno<-AnnotatedDataFrame(data=pheno)
anno<-c("IlluminaHumanMethylation450k", "ilmn12.hg19")
names(anno)<-c("array", "annotation")
mdat<-MethylSet(Meth = M, Unmeth = U, annotation=anno, 
phenoData=pheno)

6 Quality Control

6.1 Internal control probes

Both Illumina 450k and EPIC array incorporate 15 different types of internal control probes (total of 848 probes for 450K and 635 probes for EPIC). The control plots from ENmix function plotCtrl() are similar to the control plots generated
by Illumina GenomeStudio software. See Illumina Infinium HD Methylation Assay for detailed description on how to interpret these control figures.

plotCtrl(rgSet)

Below is a list of internal control types and number of probes for each type.

Control types 450K EPIC v1 EPIC v2
Sample-Independent Controls
STAINING 4 6 4
EXTENSION 4 4 4
HYBRIDIZATION 3 3 3
TARGET REMOVAL 2 2 2
RESTORATION 1 1 1
Sample-Dependent Controls
BISULFITE CONVERSION I 12 10 10
BISULFITE CONVERSION II 4 4 4
SPECIFICITY I 12 12 12
SPECIFICITY II 3 3 3
NON-POLYMORPHIC 4 9 9
NORM_A 32 27 27
NORM_C 61 58 58
NORM_G 32 27 27
NORM_T 61 58 58
NEGATIVE 613 411 411