Kendall, B., Jin, C., Li, K., Ruan, F., Wang, X.A., Wang, J.-P., DNAcycP2: improved estimation of intrinsic DNA cyclizability through data augmentation, Nucleic Acids Research, gkaf145, 2025.
Introduction
DNAcycP2, short for DNA cyclizability Prediction v2, is an
R package (Python version is also available) developed for precise and unbiased
prediction of DNA intrinsic cyclizability scores. This tool builds on a deep
learning framework that integrates Inception and Residual network architectures
with an LSTM layer, providing a robust and accurate prediction mechanism.
DNAcycP2 is an updated version of the earlier DNAcycP tool released by Li
et al. in 2021. While DNAcycP was trained on loop-seq data from Basu et al.
(2021), DNAcycP2 improves upon it by training on smoothed predictions derived
from this dataset. The predicted score, termed C-score, exhibits high
accuracy when compared with experimentally measured cyclizability scores
obtained from the loop-seq assay. This makes DNAcycP2 a valuable tool for
researchers studying DNA mechanics and structure.
Key differences between DNAcycP2 and DNAcycP
Following the release of DNAcycP, it was found that the intrinsic cyclizability
scores derived from Basu et al. (2021) retained residual bias from the biotin
effect, resulting in inaccuracies (Kendall et al., 2025). To address this, we
employed a data augmentation + moving average smoothing method to produce
unbiased estimates of intrinsic DNA cyclizability for each sequence in the
original training dataset. A new model, trained on this corrected data but
using the same architecture as DNAcycP, was developed, resulting in DNAcycP2.
This version also introduces improved computational efficiency through
parallelization options. Further details are available in Kendall et al. (2025).
To demonstrate the differences, we compared predictions from DNAcycP and
DNAcycP2 in a yeast genomic region at base-pair resolution (Figure 1). The
predicted biotin-dependent scores (\(\tilde C_{26}\), \(\tilde C_{29}\), and
$ C_{31}$, model trained separately) show 10-bp periodic oscillations
due to biotin biases, each with distinct phases. DNAcycP’s predictions improved
over the biotin-dependent scores, while still show substantial
local fluctuations likely caused by residual bias in the training data (the
called intrinsic cyclizability score \(\hat C_0\) from Basu et al. 2021). In
contrast, DNAcycP2, trained on corrected intrinsic cyclizability scores,
produces much smoother local-scale predictions, indicating a further
improvement in removing the biotin bias.
The DNAcycP2 package retains all prediction functions from the original
DNAcycP. The improved prediction model, based on smoothed data, can be accessed
using the argument smooth=TRUE in the main function (see usage below).
Usage
Main Functions
The DNAcycP2 R package provides two primary functions for cyclizability
prediction:
cycle
: Takes an R object (vector of strings) as input. Each element in
the vector is a DNA sequence.
cycle_fasta
: Takes the path of a fasta file as input.
Selecting the Prediction Model
Both functions use the smooth
argument to specify the prediction model:
smooth=TRUE
: DNAcycP2 (trained on smoothed data, recommended).
smooth=FALSE
: DNAcycP (trained on original data).
Parallelization with cycle_fasta
The cycle_fasta
function is designed for handling larger files and supports
parallelization. To enable parallelization, use the following arguments:
n_cores
: Number of cores to use (default: 1).
chunk_length
: Sequence length (in bp) each core processes at a time
(default: 100,000).
The cycle_fasta
function is designed for larger files, so it has added
parallelization capability. To utilize this capability, specify the number of
cores to be greater than 1 using the n_cores
argument (default 1). You can
also specify the length of the sequence that each core will predict on at a
given time using the chunk_length
argument (default 100000).
For reference, on a personal computer (16 Gb RAM, M1 chip with 8-core CPU),
prediction at full parallelization directly on the yeast genome FASTA file
completes in 12 minutes, and on the hg38 human genome Chromosome I FASTA file
in just over 4 hours. In our experience, selection of parallelization
parameters (n_cores
and chunk_length
) has little affect when making
predictions on a personal computer, but if using the package on a high-
performance compute cluster, prediction time should decrease as the number of
cores increases. If you do run into memory issues, we first suggest reducing
chunk_length
.
DNAcycP2 prediction – Normalized vs unnormalized
Both cycle_fasta
and cycle
output the prediction results in normalized
(C0_norm
,C0S_norm
) and unnomralized (C0_unnorm
,C0S_unnorm
) version.
In DNAcycP2, the predicted cyclizability always contains normalized and
unnormalized values. the unnormalized results were based on the model
trained on unnormalized \(\hat C_0\) or \(\hat C_0^s\) scores. In contrast, the
normalized results were predicted by the model trained on the normalized
\(\hat C_0\) or \(\hat C_0^s\) values.
The cyclizability score from different loop-seq libraries may be subject to a
systematic library-specific constant difference due to its definition (see
Basu et al 2021), and hence it’s a relative measure and not directly comparable
between libraries. The normalization will force the training data to have
mean = 0 and standard deviation = 1 such that the 50 bp sequences from yeast
genome roughly have mean = 0 and standard deviation = 1 for intrinsic
cyclizabilty score. Thus for any sequence under prediciton, the normalized
C-score can be more informative in terms of its cyclizabilty relative to the
population. For example, the C-score provides statisitcal significance
indicator, i.e. a C-score of 1.96 indicates 97.5% in the distribution.
Save DNAcycP2 prediciton to external file
Both cycle_fasta
and cycle
provides an argument save_path_prefix
to save
the prediction results onto local hard drive. For example:
ex2_smooth <- DNAcycP2::cycle(
ex2$V1,
smooth=TRUE,
save_path_prefix="ex2_smooth"
)
This will execute the same predictions as previously, and additionally save two
files named ‘ex2_smooth_C0S_norm.txt’ and ‘ex2_smooth_C0S_unnorm.txt’ to the
current working directory. The output files from cycle_fasta
have the same
format as the function output, but for consistency with the Python pacakge
it is important to note that the output files from cycle
have a different
format than the function output. Namely, rather than writing a single file
for every sequence, the function always writes two files (regardless of the
number of sequences), one containing normalized predictions for every sequence
(ending in ‘C0S_norm.txt’ or ‘C0_norm.txt’) and the other containing
unnormalized predictions for every sequence (ending in ‘C0S_unnorm.txt’ or
‘C0_unnorm.txt’). C-scores in each line correspond to the sequence from the
sequences
input in the same order.
For any input sequence, DNAcycP2 predicts the C-score for every 50 bp.
Regardless of the input sequence format the first C-score in the output file
corresponds to the sequence from position 1-50, second for 2-51 and so forth.
Example 3 (Single Sequence):
If you want the predict C-scores for a single sequence, you can follow
the same protocol as Example 1 or 2, depending on the input format. We
have included two example files representing the same 1000bp stretch of
S. Cerevisiae sacCer3 Chromosome I (1:1000) in .fasta and .txt format.
First, we will consider the .fasta format:
ex3_fasta_file <- system.file(
"extdata", "ex3_single_seq.fasta", package = "DNAcycP2"
)
ex3_fasta_smooth <- DNAcycP2::cycle_fasta(ex3_fasta_file,smooth=TRUE)
#> Sequence length for ID 1: 1000
#> Predicting cyclizability...
#> Chunk size: 100000, num threads: 1
ex3_fasta_original <- DNAcycP2::cycle_fasta(ex3_fasta_file,smooth=FALSE)
#> Sequence length for ID 1: 1000
#> Predicting cyclizability...
#> Chunk size: 100000, num threads: 1
The output (ex3_fasta_smooth
or ex3_fasta_original
) is a list with
1 entry named “cycle_1”.
Let’s say we are interested only in the smooth (DNAcycP2), normalized
predictions for the subsequence defined by the first 100bp
(corresponding to subsequences defined by regions [1,50], [2,51],
…, and [51-100], or position
s 25, 26, …, and 75). We can
access the outputs for this subsequence using the following command:
ex3_fasta_smooth[[1]][1:51,c("position", "C0S_norm")]
#> position C0S_norm
#> 1 25 0.94794828
#> 2 26 0.88397902
#> 3 27 1.05313075
#> 4 28 1.31837559
#> 5 29 1.43364513
#> 6 30 1.47490740
#> 7 31 1.68857002
#> 8 32 1.79380059
#> 9 33 1.78517485
#> 10 34 1.51873457
#> 11 35 1.83588874
#> 12 36 1.81811345
#> 13 37 1.89403594
#> 14 38 1.94740200
#> 15 39 1.67138946
#> 16 40 1.55992007
#> 17 41 1.66846406
#> 18 42 1.62409890
#> 19 43 1.54238129
#> 20 44 1.46230042
#> 21 45 1.41134250
#> 22 46 1.27757108
#> 23 47 1.02153277
#> 24 48 1.01221037
#> 25 49 0.93556404
#> 26 50 1.00608754
#> 27 51 1.09152973
#> 28 52 1.32879055
#> 29 53 1.25343049
#> 30 54 1.34268761
#> 31 55 1.13988316
#> 32 56 1.00737357
#> 33 57 1.13837850
#> 34 58 1.11469960
#> 35 59 0.96556181
#> 36 60 0.85989267
#> 37 61 0.91100568
#> 38 62 0.87353712
#> 39 63 0.55437237
#> 40 64 0.60366714
#> 41 65 0.41495007
#> 42 66 0.22158629
#> 43 67 0.14434627
#> 44 68 0.08841624
#> 45 69 -0.09656694
#> 46 70 -0.19039956
#> 47 71 -0.34425002
#> 48 72 -0.41227332
#> 49 73 -0.38969830
#> 50 74 -0.33039862
#> 51 75 -0.44366446
Or, equivalently,
ex3_fasta_smooth$cycle_1[1:51,c("position", "C0S_norm")]
#> position C0S_norm
#> 1 25 0.94794828
#> 2 26 0.88397902
#> 3 27 1.05313075
#> 4 28 1.31837559
#> 5 29 1.43364513
#> 6 30 1.47490740
#> 7 31 1.68857002
#> 8 32 1.79380059
#> 9 33 1.78517485
#> 10 34 1.51873457
#> 11 35 1.83588874
#> 12 36 1.81811345
#> 13 37 1.89403594
#> 14 38 1.94740200
#> 15 39 1.67138946
#> 16 40 1.55992007
#> 17 41 1.66846406
#> 18 42 1.62409890
#> 19 43 1.54238129
#> 20 44 1.46230042
#> 21 45 1.41134250
#> 22 46 1.27757108
#> 23 47 1.02153277
#> 24 48 1.01221037
#> 25 49 0.93556404
#> 26 50 1.00608754
#> 27 51 1.09152973
#> 28 52 1.32879055
#> 29 53 1.25343049
#> 30 54 1.34268761
#> 31 55 1.13988316
#> 32 56 1.00737357
#> 33 57 1.13837850
#> 34 58 1.11469960
#> 35 59 0.96556181
#> 36 60 0.85989267
#> 37 61 0.91100568
#> 38 62 0.87353712
#> 39 63 0.55437237
#> 40 64 0.60366714
#> 41 65 0.41495007
#> 42 66 0.22158629
#> 43 67 0.14434627
#> 44 68 0.08841624
#> 45 69 -0.09656694
#> 46 70 -0.19039956
#> 47 71 -0.34425002
#> 48 72 -0.41227332
#> 49 73 -0.38969830
#> 50 74 -0.33039862
#> 51 75 -0.44366446
Next, we will consider the .txt format:
ex3_txt_file <- system.file(
"extdata",
"ex3_single_seq.txt",
package = "DNAcycP2"
)
ex3_txt <- read.csv(ex3_txt_file, header = FALSE)
ex3_txt_smooth <- DNAcycP2::cycle(ex3_txt$V1, smooth=TRUE)
#> Reading sequences...
#> Not all sequences are length 50, predicting every subsequence...
ex3_txt_original <- DNAcycP2::cycle(ex3_txt$V1, smooth=FALSE)
#> Reading sequences...
#> Not all sequences are length 50, predicting every subsequence...
The output (ex3_txt_smooth
or ex3_txt_original
) is a list with 1 entry
(unnamed).
Note, that ex3_fasta_smooth
and ex3_txt_smooth
are essentially equivalent.
The only exceptions are perhaps slight rounding differences that come from the
computation, and that the list ex3_fasta_smooth
has named entries (‘cycle_1’)
while ex3_txt_smooth
does not. The same applies for ex3_fasta_original
and
ex3_txt_original
.
Therefore, we can use a similar command to access the outputs for our
subsequence of interest:
ex3_txt_smooth[[1]][1:51,c("position", "C0S_norm")]
If there is a sequence (or group of sequences) we want to make predictions on,
we can also input them directly as strings. For example:
input_seq1 =
"CATGACTGCAGCTAAAACGTTGACCTAGTCGTCAGTCTACGTACTAGCGTAGCTATATCGAGTCTAGCGTCTAG"
input_seq2 = "ATCTTTTGTATATCAAAAGACTAGATCGATTAGCGTACGCCCCTGACTAGATAGATCG"
seq1_smooth = DNAcycP2::cycle(c(input_seq1), smooth=TRUE)
both_seqs_smooth = DNAcycP2::cycle(c(input_seq1, input_seq2), smooth=TRUE)