scmeth Vignette

4.2 Functions

This section will elaborate on some of the main functions and show the usage of these functions based on a sample data set that comes along with the package.

scmeth package contains several functions to assess different metrics and success of the sequencing process.

coverage

One main metric is the CpG coverage. Coverage of the CpG can be assessed in different ways. Very basic one is to check how many CpG were observed in each sample. coverage function can be used to get this information.

Loading the data

directory <- system.file("extdata", "bismark_data", package='scmeth')
bsObject <- HDF5Array::loadHDF5SummarizedExperiment(directory)

scmeth::coverage(bsObject)

## 2017-08-02_HL2F5BBXX_2_AAGAGGCA_AGAAGG_report.txt 
##                                               756 
## 2017-08-02_HL2F5BBXX_2_AAGAGGCA_AGGATG_report.txt 
##                                               424 
## 2017-08-02_HL2F5BBXX_2_AAGAGGCA_ATCAAG_report.txt 
##                                               333

readmetrics

Read information is important to assess whether sequencing and alignment succeeded. readmetrics function outputs a visualization showing number of reads seen in each samples and of those reads what proportion of them were mapped to the reference genome.

readmetrics(bsObject)

##                                              sample   total mapped unmapped
## 1 2017-08-02_HL2F5BBXX_2_AAGAGGCA_AGAAGG_report.txt 1145278 974438   170840
## 2 2017-08-02_HL2F5BBXX_2_AAGAGGCA_AGGATG_report.txt  763055 633756   129299
## 3 2017-08-02_HL2F5BBXX_2_AAGAGGCA_ATCAAG_report.txt  847927 717059   130868

repmask

CpG Islands are characterized by their high GC content, high level of observed to expected ratio of CpGs and length over 500 bp. However some repeat regions in the genome also fit the same criteria although they are not bona fide CpG Island. Therefore it is important to see how many CpGs are observed in the non repeat regions of the genome. repMask functions provide information on the CpG coverage in non repeat regions of the genome. In order to build the repeat mask regions of the genome repmask function will require the organism and the genome build information.

library(BSgenome.Mmusculus.UCSC.mm10)
load(system.file("extdata", 'bsObject.rda', package='scmeth'))
repMask(bs, Mmusculus, "mm10")

##                     coveredCpgs
## sc-RRBS_zyg_01_chr1       12208
## sc-RRBS_zyg_02_chr1        3056
## sc-RRBS_zyg_03_chr1        6666

Coverage by Chromosome

There are several other ways the number of CpGs captured can be visualized. One of the way is to observe how the CpGs are distributed across different chromosomes. *chromosomeCoverage outputs CpG coverage by individual chromosomes.(Since the example data only contains information in chromosome 1 only the CpGs covered in chromosome 1 is shown.)

chromosomeCoverage(bsObject)

##   2017-08-02_HL2F5BBXX_2_AAGAGGCA_AGAAGG_report.txt
## 1                                               756
##   2017-08-02_HL2F5BBXX_2_AAGAGGCA_AGGATG_report.txt
## 1                                               424
##   2017-08-02_HL2F5BBXX_2_AAGAGGCA_ATCAAG_report.txt
## 1                                               333

featureCoverage

Another way to observe the distribution of CpGs is to classify them by the genomic features they belong. Some of the features are very specific to the CpG dense regions such as CpG Islands, CpG Shores, CpG Shelves etc. Others are general genomic features such as introns, exons, promoters etc. This information can be obtained by featureCoverage function. In addition to the bs object this function requires the genomic features of interest and the genome build. Each element in the table represents the fraction of CpGs seen in particular cell in specific region compared to all the CpGs seen in that region.

#library(annotatr)
featureList <- c('cpg_islands', 'cpg_shores', 'cpg_shelves')
DT::datatable(featureCoverage(bsObject, features=featureList, "hg38"))

cpgDensity

CpGs are not distributed across the genome uniformly. Most of the genome contains very low percentage of CpGs except for the CpG dense regions, i.e. CpG islands. Bisulfite sequencing targets all the CpGs across the genome, however reduced representation bisulfite sequencing (RRBS) target CpG dense CpG islands. Therefore CpG density plot will be a great diagnostic to see whether the protocol succeeded. In order to calculate the CpG density a window length should be specified. By default cpgDensity function chooses 1kB regions. Therefore CpG density plot can be used to check whether the protocol specifically targeted CpG dense or CpG sparse regions or whether CpGs were obtained uniformly across the regions.

library(BSgenome.Hsapiens.NCBI.GRCh38)
DT::datatable(cpgDensity(bsObject, Hsapiens, windowLength=1000, small=TRUE))

downsample

In addition to the CpG coverage, methylation data can be assessed via down sampling analysis, methylation bias plot and methylation distribution. Down sampling analysis is a technique to assess whether the sequencing process achieved the saturation level in terms of CpG capture. In order to perform down sampling analysis CpGs that are covered at least will be sampled via binomial random sampling with given probability. At each probability level the number of CpGs captured is assessed. If the number of CpG captured attains a plateau then the sequencing was successful. downsample function provides a matrix of CpG coverage for each sample at various down sampling rates. The report renders this information into a plot. Downsampling rate ranges from 0.01 to 1, however users can change the downsampling rates.

DT::datatable(scmeth::downsample(bsObject))

mbiasPlot

Methylation bias plot shows the methylation along the reads. In a high quality samples methylation across the read would be more or less a horizontal line. However there could be fluctuations in the beginning or the end of the read due to the quality of the bases. Single cell sequencing samples also can show jagged trend in the methylation bias plot due to low read count. Methylation bias can be assessed via mbiasPlot function. This function takes the mbias file generated from FireCloud pipeline and generates the methylation bias plot.

methylationBiasFile <- '2017-04-21_HG23KBCXY_2_AGGCAGAA_TATCTC_pe.M-bias.txt'
mbiasList <- mbiasplot(mbiasFiles=system.file("extdata", methylationBiasFile,
                                         package='scmeth'))

mbiasDf <- do.call(rbind.data.frame, mbiasList)
meanTable <- stats::aggregate(methylation ~ position + read, data=mbiasDf, FUN=mean)
sdTable <- stats::aggregate(methylation ~ position + read, data=mbiasDf, FUN=sd)
seTable <- stats::aggregate(methylation ~ position + read, data=mbiasDf, FUN=function(x){sd(x)/sqrt(length(x))})
sum_mt<-data.frame('position'=meanTable$position,'read'=meanTable$read,
                       'meth'=meanTable$methylation, 'sdMeth'=sdTable$methylation,
                       'seMeth'=seTable$methylation)

sum_mt$upperCI <- sum_mt$meth + (1.96*sum_mt$seMeth)
sum_mt$lowerCI <- sum_mt$meth - (1.96*sum_mt$seMeth)
sum_mt$read_rep <- paste(sum_mt$read, sum_mt$position, sep="_")

g <- ggplot2::ggplot(sum_mt)
g <- g + ggplot2::geom_line(ggplot2::aes_string(x='position', y='meth',
                                                colour='read'))
g <- g + ggplot2::geom_ribbon(ggplot2::aes_string(ymin = 'lowerCI',
                        ymax = 'upperCI', x='position', fill = 'read'),
                        alpha=0.4)
g <- g + ggplot2::ylim(0,100) + ggplot2::ggtitle('Mbias Plot')
g <- g + ggplot2::ylab('methylation')
g

methylationDist

methylationDist function provides the methylation distribution of the samples. In this visualization methylation is divided into quantiles and ordered according to the cells with the lowest methylation to highest methylation. In single cell analysis almost all CpGs will be in the highest quantile or the lowest quantile. This visualization provides information on whether there are cells with intermediate methylation. Ideally , in single cell methylation most methylation should be either 1 or 0. If there are large number of intermediate methylation this indicates there might be some error in sequencing.

methylationDist(bsObject)

## 2017-08-02_HL2F5BBXX_2_AAGAGGCA_AGAAGG_report.txt 
##                                         0.7841647 
## 2017-08-02_HL2F5BBXX_2_AAGAGGCA_AGGATG_report.txt 
##                                         0.6940448 
## 2017-08-02_HL2F5BBXX_2_AAGAGGCA_ATCAAG_report.txt 
##                                         0.8130466

bsConversionPlot

Another important metric in methylation analysis is the bisulfite conversion rate. Bisulfite conversion rate indicates out of all the Cytosines in the non CpG context what fraction of them were methylated. Ideally this number should be 1 or 100% indicating none of the non CpG context cytosines are methylated. However in real data this will not be the case, yet bisulfite conversion rate below 95% indicates some problem with sample preparation. bsConversionPlot function generates a plot showing this metric for each sample.

bsConversionPlot(bsObject)

##                                              sample       bsc
## 1 2017-08-02_HL2F5BBXX_2_AAGAGGCA_AGAAGG_report.txt 0.9940704
## 2 2017-08-02_HL2F5BBXX_2_AAGAGGCA_AGGATG_report.txt 0.9951570
## 3 2017-08-02_HL2F5BBXX_2_AAGAGGCA_ATCAAG_report.txt 0.9932266

sessionInfo()

## R version 4.5.0 RC (2025-04-04 r88126)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000
##  [2] BSgenome.Mmusculus.UCSC.mm10_1.4.3    
##  [3] BSgenome_1.76.0                       
##  [4] rtracklayer_1.68.0                    
##  [5] BiocIO_1.18.0                         
##  [6] Biostrings_2.76.0                     
##  [7] XVector_0.48.0                        
##  [8] GenomicRanges_1.60.0                  
##  [9] GenomeInfoDb_1.44.0                   
## [10] IRanges_2.42.0                        
## [11] S4Vectors_0.46.0                      
## [12] BiocGenerics_0.54.0                   
## [13] generics_0.1.3                        
## [14] scmeth_1.28.0                         
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.2.3                   bitops_1.0-9               
##   [3] bsseq_1.44.0                permute_0.9-7              
##   [5] rlang_1.1.6                 magrittr_2.0.3             
##   [7] matrixStats_1.5.0           compiler_4.5.0             
##   [9] RSQLite_2.3.9               GenomicFeatures_1.60.0     
##  [11] DelayedMatrixStats_1.30.0   png_0.1-8                  
##  [13] vctrs_0.6.5                 reshape2_1.4.4             
##  [15] stringr_1.5.1               pkgconfig_2.0.3            
##  [17] crayon_1.5.3                fastmap_1.2.0              
##  [19] dbplyr_2.5.0                labeling_0.4.3             
##  [21] Rsamtools_2.24.0            rmarkdown_2.29             
##  [23] tzdb_0.5.0                  UCSC.utils_1.4.0           
##  [25] purrr_1.0.4                 bit_4.6.0                  
##  [27] xfun_0.52                   beachmat_2.24.0            
##  [29] cachem_1.1.0                jsonlite_2.0.0             
##  [31] blob_1.2.4                  rhdf5filters_1.20.0        
##  [33] DelayedArray_0.34.0         Rhdf5lib_1.30.0            
##  [35] BiocParallel_1.42.0         parallel_4.5.0             
##  [37] R6_2.6.1                    bslib_0.9.0                
##  [39] stringi_1.8.7               limma_3.64.0               
##  [41] jquerylib_0.1.4             Rcpp_1.0.14                
##  [43] SummarizedExperiment_1.38.0 knitr_1.50                 
##  [45] R.utils_2.13.0              readr_2.1.5                
##  [47] Matrix_1.7-3                tidyselect_1.2.1           
##  [49] abind_1.4-8                 yaml_2.3.10                
##  [51] codetools_0.2-20            curl_6.2.2                 
##  [53] lattice_0.22-7              tibble_3.2.1               
##  [55] regioneR_1.40.0             plyr_1.8.9                 
##  [57] withr_3.0.2                 Biobase_2.68.0             
##  [59] KEGGREST_1.48.0             evaluate_1.0.3             
##  [61] BiocFileCache_2.16.0        pillar_1.10.2              
##  [63] BiocManager_1.30.25         filelock_1.0.3             
##  [65] MatrixGenerics_1.20.0       DT_0.33                    
##  [67] vroom_1.6.5                 RCurl_1.98-1.17            
##  [69] BiocVersion_3.21.1          hms_1.1.3                  
##  [71] ggplot2_3.5.2               sparseMatrixStats_1.20.0   
##  [73] munsell_0.5.1               scales_1.3.0               
##  [75] gtools_3.9.5                glue_1.8.0                 
##  [77] tools_4.5.0                 AnnotationHub_3.16.0       
##  [79] data.table_1.17.0           locfit_1.5-9.12            
##  [81] GenomicAlignments_1.44.0    XML_3.99-0.18              
##  [83] rhdf5_2.52.0                grid_4.5.0                 
##  [85] crosstalk_1.2.1             AnnotationDbi_1.70.0       
##  [87] colorspace_2.1-1            GenomeInfoDbData_1.2.14    
##  [89] HDF5Array_1.36.0            restfulr_0.0.15            
##  [91] annotatr_1.34.0             cli_3.6.4                  
##  [93] rappdirs_0.3.3              S4Arrays_1.8.0             
##  [95] dplyr_1.1.4                 gtable_0.3.6               
##  [97] R.methodsS3_1.8.2           sass_0.4.10                
##  [99] digest_0.6.37               SparseArray_1.8.0          
## [101] farver_2.1.2                rjson_0.2.23               
## [103] htmlwidgets_1.6.4           R.oo_1.27.0                
## [105] memoise_2.0.1               htmltools_0.5.8.1          
## [107] lifecycle_1.0.4             h5mread_1.0.0              
## [109] httr_1.4.7                  mime_0.13                  
## [111] statmod_1.5.0               bit64_4.6.0-1

scmeth Vignette

Divy S. Kangeyan

2025-04-15

Contents

1. Introduction

2. Installation and package loading

3. Input files

4. Usage

4.1 Report