The gDNAx
package provides functionality to diagnose the presence of genomic DNA (gDNA) contamination in RNA-seq data sets, and filter out reads of potential gDNA origin.
gDNAx 1.0.2
Genomic DNA (gDNA) contamination is an internal contaminant that can be present in gene quantification techniques, such as in an RNA-sequencing (RNA-seq) experiment. This contamination can be due to an absent or inefficient gDNA digestion step (with DNase) during the extraction of total RNA in the library preparation process. In fact, some protocols do not include a DNase treatment step, or they include it as optional.
While gDNA contamination is not a major issue for poly(A) RNA-seq, it can remarkably affect gene expression quantification of total RNA-seq experiments. Moreover, gDNA contamination can lead to a misleading attribution of expression to unannotated regions of the genome. For this reason, it is important to check the level of gDNA contamination in the quality control analysis before performing further analyses, specially when total RNA has been sequenced.
Here we illustrate the use of the gDNAx
package for calculating different
diagnostics related to gDNA contamination levels.
To do so, a subset of the data in (Li et al. 2022) is used. This data consists
in 9 paired-end samples of total RNA-seq with increasing levels of gDNA
contamination: 0% (no contamination), 1% and 10%, with 3 replicates each.
The data is available through the gDNAinRNAseqData
package. BAM files
contain about 100,000 alignments, sampled uniformly at random from complete
BAM files.
library(gDNAinRNAseqData)
# Retrieving BAM files
bamfiles <- LiYu22subsetBAMfiles()
bamfiles
[1] "/tmp/Rtmpf6TXkK/s32gDNA0.bam" "/tmp/Rtmpf6TXkK/s33gDNA0.bam"
[3] "/tmp/Rtmpf6TXkK/s34gDNA0.bam" "/tmp/Rtmpf6TXkK/s26gDNA1.bam"
[5] "/tmp/Rtmpf6TXkK/s27gDNA1.bam" "/tmp/Rtmpf6TXkK/s28gDNA1.bam"
[7] "/tmp/Rtmpf6TXkK/s23gDNA10.bam" "/tmp/Rtmpf6TXkK/s24gDNA10.bam"
[9] "/tmp/Rtmpf6TXkK/s25gDNA10.bam"
# Getting information about the gDNA concentrations of each BAM file
pdat <- LiYu22phenoData(bamfiles)
pdat
gDNA
s32gDNA0 0
s33gDNA0 0
s34gDNA0 0
s26gDNA1 1
s27gDNA1 1
s28gDNA1 1
s23gDNA10 10
s24gDNA10 10
s25gDNA10 10
strandMode
The strandMode
of a sample depends on the library protocol used: it can be
strand-specific (stranded) or non strand-specific (non-stranded). Stranded
paired-end RNA-seq is, in turn, divided into: libraries were
the pair strand is that of the first alignment and libraries were the pair
strand is that of the second alignment.
Function identifyStrandMode()
can be used to try to identify the library
protocol used:
library(gDNAx)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
strandM <- identifyStrandMode(bamfiles, txdb, singleEnd=FALSE)
strandM$strandMode
[1] NA
The strandMode
identified is NA
, meaning that a non-stranded protocol
was used. See the “Details” section in the help page of identifyStrandMode()
for further details.
Let’s take a look at the data.frame
with all strandedness values. The
strandedness values are based on the proportion of reads aligning to the
same or opposite strand as transcripts in the annotations. See help page of
identifyStrandMode()
for more information.
strandM$Strandedness
strandMode1 strandMode2 ambig Nalignments
s32gDNA0 0.4661305 0.4646406 0.06922888 69133
s33gDNA0 0.4665387 0.4647487 0.06871265 69274
s34gDNA0 0.4706538 0.4619302 0.06741606 69123
s26gDNA1 0.4710407 0.4629427 0.06601662 66559
s27gDNA1 0.4727378 0.4595452 0.06771699 65567
s28gDNA1 0.4665966 0.4666584 0.06674493 64739
s23gDNA10 0.4674282 0.4642238 0.06834804 53052
s24gDNA10 0.4635957 0.4691156 0.06728863 51450
s25gDNA10 0.4666502 0.4655258 0.06782407 52474
As we can see, the proportion of alignments overlapping
transcripts when strandMode = 1L
is used (“strandMode1” column)
is very similar to the one when strandMode = 2L
is considered
(“strandMode2” column), which is compatible with a non-stranded library.
This contrasts with the reported stranded protocol used to obtain this data
according to (Li et al. 2022). However, the results obtained by
identifyStrandMode()
were contrasted with results of the RSeQC
infer_experiment.py
tool ((Wang, Wang, and Li 2012)) and visual inspection of data
in the Integrative Genomics Viewer (IGV) ((Robinson et al. 2011)), all
of which point to an unstranded RNA-seq experiment.
identifyStrandMode()
uses 200,000 alignments overlapping exonic regions to
compute strandedness (recommended by (Signal and Kahlke 2022)),
unless the number of these kind of alignments in the BAM file is lower.
In this vignette, the number of alignments used is close to 60,000, which is
the total number of exonic alignments present in the BAM files.
Once the strandMode
is known, we can calculate gDNA contamination diagnostics
using the gDNAdx()
function. A subset of the alignments present in the BAM
file are considered to obtain these diagnostics. The number of alignments used
is set by the yieldSize
parameter.
gdnax <- gDNAdx(bamfiles, txdb, singleEnd=FALSE, strandMode=strandM$strandMode)
gdnax
gDNAx object
# BAM files (9): s32gDNA0.bam, ..., s25gDNA10.bam
# Library layout: paired-end (2x50nt)
# Strand mode: NA
# Sequences: only standard chromosomes
# Annotation pkg: TxDb.Hsapiens.UCSC.hg38.knownGene
# Alignments employed: first 100000
We, then, can get statistics on the origin of alignments and strandedness with
getDx()
:
dx <- getDx(gdnax)
dx
IGC INT SCJ SCE SCC IGCFLM SCJFLM SCEFLM
s32gDNA0 1.036524 11.75898 15.18563 40.04114 19.56452 178.939 157.224 158.402
s33gDNA0 1.125444 11.78607 15.21656 40.31236 19.55383 174.529 162.123 165.682
s34gDNA0 1.095440 12.26551 15.40036 40.19421 19.70788 171.118 156.341 155.538
s26gDNA1 1.402844 12.22650 14.78351 38.69202 18.73132 174.421 158.713 163.458
s27gDNA1 1.365428 12.45073 14.54513 38.21891 18.31765 173.409 166.383 164.063
s28gDNA1 1.503162 12.49083 14.09956 37.67457 17.96053 174.973 161.662 160.597
s23gDNA10 3.529043 13.16473 11.22648 30.99781 14.41110 174.421 161.371 164.748
s24gDNA10 3.781627 13.65285 10.85400 30.01465 13.91748 169.245 160.336 160.806
s25gDNA10 3.412804 13.51507 11.19751 30.65271 14.20691 174.168 160.922 158.964
INTFLM STRAND
s32gDNA0 154.940 NA
s33gDNA0 157.002 NA
s34gDNA0 152.265 NA
s26gDNA1 157.834 NA
s27gDNA1 158.440 NA
s28gDNA1 159.784 NA
s23gDNA10 162.035 NA
s24gDNA10 163.551 NA
s25gDNA10 157.776 NA
Next, we can plot the previous gDNA diagnostic measures using the default
plot()
function. This creates four plots, each one representing a diagnostic
measure as a function of the percentage of intergenic alignments, which can
be considered the most informative measure regarding gDNA contamination levels.
Here, strandedness values (STRAND column) are NA
since the dataset is not
strand-specific.
plot(gdnax, group=pdat$gDNA, pch=19)
Splice-compatible junction (SCJ) alignments (spliced alignments overlapping a transcript in a “splice compatible” way) and splice compatible exonic (SCE) alignments (alignments without a splice site, but that overlap a transcript in a “splice compatible” way) are expected to come from RNA sequenced reads. Instead, intergenic alignments (IGC) mainly come from DNA sequenced reads. For this reason, we see a negative correlation between the percentage of SCJ or SCE alignments and the percentage of IGC alignments: higher gDNA contamination levels lead to more IGC alignments and less SCJ or SCE alignments.
On the contrary, intronic (INT) alignments are positively correlated with IGC alignments and, thus, with gDNA contamination levels.
The last plot shows the strandedness value. In stranded RNA-seq protocols, we expect a strandedness value close to 1, meaning that most reads align to the same strand than the annotated transcripts. Lower strandedness values can be indicative of gDNA contamination: reads sequenced from DNA are expected to align in equal proportions to both strands. Therefore, a low strandedness in a stranded RNA-seq experiment can be due to the presence of DNA reads (contamination) mapping to transcripts but in the opposite strand.
Another plot to represent diagnostic measures is the one representing the origin of alignments per sample. Fluctuations in this proportions evidence different levels of gDNA contamination in samples.
plotAlnOrigins(gdnax, group=pdat$gDNA)