periodicDNA

Jacques Serizay

2020-10-27

Introduction to periodicDNA

Short DNA sequence motifs provide key information for interpreting the instructions in DNA, for example by providing binding sites for proteins or altering the structure of the double-helix. A less studied but important feature of DNA sequence motifs is their periodicity. A famous example is the 10-bp periodicity of many k-mers in nucleosome positioning (reviewed in Travers et al. 2010 or in Struhl and Segal 2013).

periodicDNA provides a framework to quantify the periodicity of k-mers of interest in DNA sequences. For a chosen k-mer, periodicDNA can identify which periods are statistically enriched in a set of sequences, by using a randomized shuffling approach to compute an empirical p-value. It can also generate continuous linear tracks of k-mer periodicity strength over genomic loci.

Internal steps of periodicDNA

To estimate the periodicity strength of a given k-mer in one or several sequences, periodicDNA performs the following steps:

  1. The k-mer occurrences are mapped and their pairwise distances are calculated.
  2. The distribution of all the resulting pairwise distances (also called “distogram”) is generated.
  3. The distogram is transformed into a frequency histogram and smoothed using a moving window of 3 to mask the universal three-base genomic periodicity. To normalize the frequency for distance decay, the local average (obtained by averaging the frequency with a moving window of 10) is then subtracted from the smoothed frequency.
  4. Finally, the power spectral density (PSD) is estimated by applying a Fast Fourier Transform (Figure 1F) over the normalized frequency histogram. The magnitude of the PSD values indicates the contribution of a given period to the overall periodicity of the k-mer of interest.

Quantifying k-mer periodicity over a set of sequences

Basic usage

The main goal of periodicDNA is to quantify periodicity of a given k-mer in a set of sequences. For instance, one can assess the periodicity of TT dinucleotides in sequences around TSSs of ubiquitous promoters using getPeriodicity().

In the following example, getPeriodicity() is directly ran using a GRanges object, specifying from which genome this GRanges comes from.

The main output of getPeriodicity() is a table of power spectral density (PSD) values associated with discrete frequencies, computed using a Fast Fourier Transform. For a given frequency, a high PSD score indicates a high periodicity of the k-mer of interest.

In the previous example, TT dinucleotides in sequences around TSSs of ubiquitous promoters are highly periodic, with a periodicity of 10 bp.

Graphical output of getPeriodicity() can be obtained using the plotPeriodicityResults() function:

The first plot shows the raw distribution of distances between pairs of ‘TT’ in the sequences of the provided GRanges. The second plot shows the decay-normalised distribution. Finally, the third plot shows the PSD scores of the ‘TT’ k-mer, measured from the normalised distribution.

Repeated shuffling of input sequences

periodicDNA provides an approach to compare the periodicity of a given k-mer in a set of sequences compared to background. For a given k-mer at a period T in a set of input sequences, the fold-change over background of its PSD is estimated by iteratively shuffling the input sequences and estimating the resulting PSD values.
Eventually, the log2 fold-change (l2FC) between the observed PSD and the median of the PSD values measured after shuffling is computed as follows:

l2FC = log2(observed PSD / median(shuffled PSDs)).

If n_shuffling >= 100, an associated empirical p-value is calculated as well (North et al 2002). This metric indicates, for each individual period T, whether the observed PSD is significantly greater than the PSD values measured after shuffling the input sequences. Note that empirical p-values are only an estimation of the real p-value. Notably, small p-values are systematically under-estimated (North et al 2002).

Note

getPeriodicity() can also be ran directly on a set of sequences of interest as follows:

Track of periodicity over a set of Genomic Ranges

The other aim of periodicDNA is to generate continuous linear tracks of k-mer periodicity strength over genomic loci of interest. getPeriodicityTrack() can be used to generate suck genomic tracks. In the following example,

WW_10bp_track <- getPeriodicityTrack(
    genome = 'BSgenome.Celegans.UCSC.ce11',
    granges = ce11_proms, 
    motif = 'WW',
    period = 10,
    BPPARAM = setUpBPPARAM(1),
    bw_file = 'WW-10-bp-periodicity_over-proms.bw'
)

When plotted over sets of ubiquitous, germline or somatic TSSs, the resulting track clearly shows increase of WW 10-bp periodicity above the ubiquitous and germline TSSs, whereas somatic TSSs do not show such increase.

data(ce11_TSSs)
plotAggregateCoverage(
    WW_10bp_track, 
    ce11_TSSs, 
    xlab = 'Distance from TSS',
    ylab = '10-bp periodicity strength (forward proms.)'
)

Session info

sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] BSgenome.Celegans.UCSC.ce11_1.4.2 periodicDNA_1.0.0                
#>  [3] BiocParallel_1.24.0               BSgenome_1.58.0                  
#>  [5] rtracklayer_1.50.0                Biostrings_2.58.0                
#>  [7] XVector_0.30.0                    magrittr_1.5                     
#>  [9] ggplot2_3.3.2                     GenomicRanges_1.42.0             
#> [11] GenomeInfoDb_1.26.0               IRanges_2.24.0                   
#> [13] S4Vectors_0.28.0                  BiocGenerics_0.36.0              
#> 
#> loaded via a namespace (and not attached):
#>  [1] SummarizedExperiment_1.20.0 zoo_1.8-8                  
#>  [3] tidyselect_1.1.0            xfun_0.18                  
#>  [5] purrr_0.3.4                 lattice_0.20-41            
#>  [7] colorspace_1.4-1            vctrs_0.3.4                
#>  [9] generics_0.0.2              htmltools_0.5.0            
#> [11] yaml_2.2.1                  XML_3.99-0.5               
#> [13] rlang_0.4.8                 pillar_1.4.6               
#> [15] glue_1.4.2                  withr_2.3.0                
#> [17] matrixStats_0.57.0          GenomeInfoDbData_1.2.4     
#> [19] lifecycle_0.2.0             stringr_1.4.0              
#> [21] zlibbioc_1.36.0             MatrixGenerics_1.2.0       
#> [23] munsell_0.5.0               gtable_0.3.0               
#> [25] evaluate_0.14               labeling_0.4.2             
#> [27] Biobase_2.50.0              knitr_1.30                 
#> [29] scales_1.1.1                DelayedArray_0.16.0        
#> [31] farver_2.0.3                Rsamtools_2.6.0            
#> [33] digest_0.6.27               stringi_1.5.3              
#> [35] dplyr_1.0.2                 grid_4.0.3                 
#> [37] cowplot_1.1.0               tools_4.0.3                
#> [39] bitops_1.0-6                RCurl_1.98-1.2             
#> [41] tibble_3.0.4                crayon_1.3.4               
#> [43] pkgconfig_2.0.3             ellipsis_0.3.1             
#> [45] Matrix_1.2-18               rmarkdown_2.5              
#> [47] R6_2.4.1                    GenomicAlignments_1.26.0   
#> [49] compiler_4.0.3