To install this package, start R (version “4.3”) and enter:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# The following initializes usage of Bioc devel
BiocManager::install(version='devel')
BiocManager::install("betaHMM")
DNA methylation, the addition of a methyl group to a cytosine-guanine dinucleotide (CpG) site, is influenced by environmental factors and serves as a disease biomarker. In diploid individuals, CpG sites are hypermethylated (both strands methylated), hypomethylated (neither strand methylated), or hemimethylated (one strand methylated). Identifying differentially methylated CpG sites (DMC) and regions (DMR) can reveal the impact of environmental stressors.
Methylation levels, called beta values, represent the proportion of methylated probes, usually modeled with beta distributions. They are often logit transformed into M-values for modeling. To directly model beta values and account for spatial correlation between CpG sites, we propose a homogeneous hidden Markov model (HMM) approach called betaHMM. It identifies DMCs and DMRs while considering spatial dependency and sample relationships. Simulation and prostate cancer data demonstrate its effectiveness. Through our submission of betaHMM to Bioconductor, we intend to contribute to the open-source software ecosystem, supporting robust and reproducible data analysis for both established and emerging biological assays.
This document gives a quick tour of the functionalities in betaHMM. See
help(package="betaHMM")
for further details and references provided by
citation("betaHMM")
.
Before starting the betaHMM walk through, the user should have a working R software environment installed on their machine. The betaHMM package has the following dependencies which, if not already installed on the machine will automatically be installed along with the package: stats, ggplot2, utils, scales, methods, pROC, foreach, doParallel, cowplot, dplyr, tidyr, stringr.
Assuming that the user has the betaHMM package installed, the user first needs to load the package:
library(betaHMM)
The betaHMM software package offers a pre-processed methylation dataset containing beta values obtained from DNA samples collected from four patients with high-grade prostate cancer. These samples encompass both benign and tumor prostate tissues and were subjected to methylation profiling using Infinium MethylationEPIC Beadchip technology. The dataset comprises DNA samples from R = 2 treatment conditions for each of N = 4 patients, with each DNA sample providing beta values for C = 694,820 CpG sites. This data collection was part of a study on prostate cancer methylomics (Silva et al. 2020). For testing purposes, a subset of CpG sites from chromosome 7 has been included in the package.
This package also provides a subset of the EPIC annotation file, which users need to input into the betaHMM function. Users can load these two datasets from the package and inspect the first six rows in the dataframes using the following procedure:
data(pca_methylation_data)
head(pca_methylation_data)
#> IlmnID FFPE_benign_1 FFPE_benign_2 FFPE_benign_3 FFPE_benign_4
#> 271510 cg00725145 0.5942203 0.6258024 0.5058501 0.5520584
#> 271511 cg05107246 0.8433906 0.8475270 0.6314132 0.7155429
#> 271512 cg15535638 0.3437175 0.4081485 0.3283133 0.3578261
#> 271513 cg20044143 0.5863234 0.6594629 0.5081511 0.6201563
#> 271514 cg07921164 0.8811453 0.8876218 0.8556576 0.8807811
#> 271515 cg07972624 0.6200234 0.6917817 0.8356504 0.8365774
#> FFPE_tumour_1 FFPE_tumour_2 FFPE_tumour_3 FFPE_tumour_4
#> 271510 0.4602464 0.5864220 0.5699625 0.5628296
#> 271511 0.7354857 0.6965985 0.7902308 0.7044976
#> 271512 0.3874172 0.5715608 0.4915726 0.4126223
#> 271513 0.4197936 0.6960895 0.6438289 0.7456794
#> 271514 0.8342399 0.8981871 0.8781351 0.7571686
#> 271515 0.8512817 0.8675258 0.8292696 0.6937511
data(annotation_data)
head(annotation_data)
#> IlmnID Genome_Build CHR MAPINFO UCSC_RefGene_Name
#> 54 cg17236668 37 7 933891 C7orf20
#> 88 cg02552646 37 7 65953949
#> 93 cg12409226 37 7 100076985 TSC22D4
#> 100 cg00376553 37 7 100073281 TSC22D4
#> 121 cg00039070 37 7 133167057 EXOC4;EXOC4
#> 127 cg13643060 37 7 95765717 SLC25A13;SLC25A13;SLC25A13
#> UCSC_RefGene_Accession UCSC_RefGene_Group
#> 54 NM_015949 Body
#> 88
#> 93 NM_030935 TSS200
#> 100 NM_030935 Body
#> 121 NM_021807;NM_001037126 Body;Body
#> 127 NM_014251;NM_001160210;NR_027662 Body;Body;Body
#> UCSC_CpG_Islands_Name Relation_to_UCSC_CpG_Island
#> 54 chr7:935031-935648 N_Shore
#> 88
#> 93 chr7:100075303-100075551 S_Shore
#> 100 chr7:100075303-100075551 N_Shelf
#> 121
#> 127
The initial phase of the workflow involves employing the Baum-Welch algorithm to estimate the model parameters. Subsequently, we apply the Viterbi algorithm to determine the most probable sequence of hidden states.
M <- 3 ## No. of methylation states in a DNA sample type
N <- 4 ## No. of patients
R <- 2 ## No. of treatment conditions
my.seed <- 321 ## set seed for reproducibility
betaHMM_out <- betaHMM(pca_methylation_data,
annotation_data,
M = 3,
N = 4,
R = 2,
parallel_process = FALSE,
seed = my.seed,
treatment_group = c("Benign","Tumour"))
The resulting output of a call to betaHMM is an S4 object of class betaHMMResults.
class(betaHMM_out)
#> [1] "betaHMMResults"
#> attr(,"package")
#> [1] "betaHMM"
The parameters estimated can be displayed using the following S4 methods:
## transition matrix estimated for all chromosomes
A(betaHMM_out)
#> $`chr 7`
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 0.563464443 0.30159595 0.02171567 0.05775670 0.01719193 0.003560975
#> [2,] 0.309176595 0.37144773 0.04563686 0.15721470 0.03832432 0.013790827
#> [3,] 0.014678238 0.02210834 0.27914989 0.07875379 0.16946328 0.154331183
#> [4,] 0.060779770 0.13012916 0.11362618 0.38853200 0.06917960 0.043989810
#> [5,] 0.009254578 0.01275665 0.12016980 0.03810062 0.31401902 0.020499434
#> [6,] 0.006178354 0.01753161 0.25864344 0.05347247 0.04454113 0.441342903
#> [7,] 0.008642219 0.02109744 0.03150352 0.13309251 0.22205179 0.007581204
#> [8,] 0.007778939 0.01579866 0.10164230 0.03379551 0.23464662 0.039059438
#> [9,] 0.004280800 0.01208498 0.06853067 0.02232936 0.18650997 0.031000600
#> [,7] [,8] [,9]
#> [1,] 0.004794773 0.01769074 0.01222881
#> [2,] 0.010722351 0.03687122 0.01681540
#> [3,] 0.012067699 0.17800292 0.09144466
#> [4,] 0.072879893 0.07716979 0.04371381
#> [5,] 0.051637094 0.26953129 0.16403151
#> [6,] 0.001813682 0.10854337 0.06793305
#> [7,] 0.420786918 0.10132311 0.05392130
#> [8,] 0.021503796 0.31350795 0.23226680
#> [9,] 0.016214745 0.30900249 0.35004638
## Shape parameters estimated for a certain chromosome
phi(betaHMM_out)
#> $sp_1
#> $sp_1$`chr 7`
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] 4.655330 3.637389 6.057968 2.130392 8.756417 9.171012 2.976585 21.62786
#> [2,] 5.291346 4.052073 4.326648 2.063767 17.471114 4.574163 6.667445 43.22361
#> [,9]
#> [1,] 66.98608
#> [2,] 95.94335
#>
#>
#> $sp_2
#> $sp_2$`chr 7`
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] 105.2895 38.06930 5.091516 8.264149 3.766216 2.215542 5.996411 3.461735
#> [2,] 131.6684 43.86579 4.208591 5.566511 4.864919 1.942536 4.331857 5.290168
#> [,9]
#> [1,] 4.524619
#> [2,] 5.633629
## Hidden states assigned to all CpG sites for a certain chromosome
head(hidden_states(betaHMM_out)[["chr 7"]])
#> [1] 5 5 3 3 8 5
A summary of the model parameters estimated for each chromosome can be obtained as below:
summary(betaHMM_out)
#> *************************************************
#> betaHMM workflow function: Parameter estimation
#> Number of hidden states estimated:9
#> No. of total CpG sites:38672
#> Number of treatment conditions :2
#> No. of patients in each treatment group:4,4
#> *************************************************
#>
#> Summary of each chromosome analysed
#>
#> Chromosome Number log-likelihood
#> "7" "355810.1"
#> Prop. of CpG sites in each hidden state
#> "0.08,0.075,0.121,0.085,0.176,0.067,0.042,0.201,0.153"
After estimating the parameters and hidden states, we proceed to calculate the AUC metric, which quantifies the dissimilarities between the cumulative distributions estimated for each hidden state within each chromosome. A user-defined threshold for the AUC metric is then applied to identify the most differentially methylated hidden states. Additionally, we utilize a user-defined threshold for the measure of uncertainty regarding membership in these highly differentially methylated hidden states to pinpoint the most differentially methylated CpG sites.
To access the dataframe containing information about CpG site locations,
methylation values, hidden state assignments, and a flag indicating DMC status,
you can use the S4 assay
command.
dmc_out <- dmc_identification(betaHMM_out)
dmc_df <- assay(dmc_out)
head(dmc_df)
#> IlmnID CHR MAPINFO FFPE_benign_1 FFPE_benign_2 FFPE_benign_3
#> cg00725145 cg00725145 7 31006 0.5942203 0.6258024 0.5058501
#> cg05107246 cg05107246 7 31441 0.8433906 0.8475270 0.6314132
#> cg15535638 cg15535638 7 31467 0.3437175 0.4081485 0.3283133
#> cg20044143 cg20044143 7 31497 0.5863234 0.6594629 0.5081511
#> cg07921164 cg07921164 7 36011 0.8811453 0.8876218 0.8556576
#> cg07972624 cg07972624 7 41525 0.6200234 0.6917817 0.8356504
#> FFPE_benign_4 FFPE_tumour_1 FFPE_tumour_2 FFPE_tumour_3
#> cg00725145 0.5520584 0.4602464 0.5864220 0.5699625
#> cg05107246 0.7155429 0.7354857 0.6965985 0.7902308
#> cg15535638 0.3578261 0.3874172 0.5715608 0.4915726
#> cg20044143 0.6201563 0.4197936 0.6960895 0.6438289
#> cg07921164 0.8807811 0.8342399 0.8981871 0.8781351
#> cg07972624 0.8365774 0.8512817 0.8675258 0.8292696
#> FFPE_tumour_4 hidden_state DMC
#> cg00725145 0.5628296 5 0
#> cg05107246 0.7044976 5 0
#> cg15535638 0.4126223 3 0
#> cg20044143 0.7456794 3 0
#> cg07921164 0.7571686 8 0
#> cg07972624 0.6937511 5 0
summary(dmc_out)
#> *************************************************
#> betaHMM workflow function: DMC identification
#> No. of total CpG sites:38672
#> Number of treatment conditions :2
#> No. of patients in each treatment group:4,4
#> Total number of DMCs:1231
#> *************************************************
#>
#>
#> No. of CpG sites and DMCs in each chromosome:
#> Chromosome Number No. of CpG sites No. of DMCs
#> "7" "38672" "1231"
The fitted density estimates, kernel density estimates, and the uncertainty in
the hidden state assignment can be observed using the plot function for the
betaHMM output. Since the parameters are estimated individually for each
chromosome, one can generate plots for each chromosome separately. To specify
the chromosome of interest, the user should utilize the chromosome
parameter
within the function.
Additionally, the AUC metrics calculated for the hidden states can also be
displayed using the AUC
parameter. By providing the AUC values obtained
through the dmc_identification
function as an input parameter, the plot will
depict the AUC metrics corresponding to the selected chromosome in the plot
panels.
AUC_chr <- AUC(dmc_out)
plot(betaHMM_out, chromosome = "7", what = "fitted density", AUC = AUC_chr)