1 What is genomic DNA contamination in a RNA-seq experiment

Genomic DNA (gDNA) contamination is an internal contaminant that can be present in gene quantification techniques, such as in an RNA-sequencing (RNA-seq) experiment. This contamination can be due to an absent or inefficient gDNA digestion step (with DNase) during the extraction of total RNA in the library preparation process. In fact, some protocols do not include a DNase treatment step, or they include it as optional.

While gDNA contamination is not a major issue for poly(A) RNA-seq, it can remarkably affect gene expression quantification of total RNA-seq experiments. Moreover, gDNA contamination can lead to a misleading attribution of expression to unannotated regions of the genome. For this reason, it is important to check the level of gDNA contamination in the quality control analysis before performing further analyses, specially when total RNA has been sequenced.

2 Diagnose the presence of gDNA contamination

Here we illustrate the use of the gDNAx package for calculating different diagnostics related to gDNA contamination levels.

To do so, a subset of the data in (Li et al. 2022) is used. This data consists in 9 paired-end samples of total RNA-seq with increasing levels of gDNA contamination: 0% (no contamination), 1% and 10%, with 3 replicates each. The data is available through the gDNAinRNAseqData package. BAM files contain about 100,000 alignments, sampled uniformly at random from complete BAM files.

library(gDNAinRNAseqData)

# Retrieving BAM files
bamfiles <- LiYu22subsetBAMfiles()
bamfiles
[1] "/tmp/Rtmpf6TXkK/s32gDNA0.bam"  "/tmp/Rtmpf6TXkK/s33gDNA0.bam" 
[3] "/tmp/Rtmpf6TXkK/s34gDNA0.bam"  "/tmp/Rtmpf6TXkK/s26gDNA1.bam" 
[5] "/tmp/Rtmpf6TXkK/s27gDNA1.bam"  "/tmp/Rtmpf6TXkK/s28gDNA1.bam" 
[7] "/tmp/Rtmpf6TXkK/s23gDNA10.bam" "/tmp/Rtmpf6TXkK/s24gDNA10.bam"
[9] "/tmp/Rtmpf6TXkK/s25gDNA10.bam"

# Getting information about the gDNA concentrations of each BAM file
pdat <- LiYu22phenoData(bamfiles)
pdat
          gDNA
s32gDNA0     0
s33gDNA0     0
s34gDNA0     0
s26gDNA1     1
s27gDNA1     1
s28gDNA1     1
s23gDNA10   10
s24gDNA10   10
s25gDNA10   10

2.1 Identifying strandMode

The strandMode of a sample depends on the library protocol used: it can be strand-specific (stranded) or non strand-specific (non-stranded). Stranded paired-end RNA-seq is, in turn, divided into: libraries were the pair strand is that of the first alignment and libraries were the pair strand is that of the second alignment.

Function identifyStrandMode() can be used to try to identify the library protocol used:

library(gDNAx)

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene

strandM <- identifyStrandMode(bamfiles, txdb, singleEnd=FALSE)
strandM$strandMode
[1] NA

The strandMode identified is NA, meaning that a non-stranded protocol was used. See the “Details” section in the help page of identifyStrandMode() for further details.

Let’s take a look at the data.frame with all strandedness values. The strandedness values are based on the proportion of reads aligning to the same or opposite strand as transcripts in the annotations. See help page of identifyStrandMode() for more information.

strandM$Strandedness
          strandMode1 strandMode2      ambig Nalignments
s32gDNA0    0.4661305   0.4646406 0.06922888       69133
s33gDNA0    0.4665387   0.4647487 0.06871265       69274
s34gDNA0    0.4706538   0.4619302 0.06741606       69123
s26gDNA1    0.4710407   0.4629427 0.06601662       66559
s27gDNA1    0.4727378   0.4595452 0.06771699       65567
s28gDNA1    0.4665966   0.4666584 0.06674493       64739
s23gDNA10   0.4674282   0.4642238 0.06834804       53052
s24gDNA10   0.4635957   0.4691156 0.06728863       51450
s25gDNA10   0.4666502   0.4655258 0.06782407       52474

As we can see, the proportion of alignments overlapping transcripts when strandMode = 1L is used (“strandMode1” column) is very similar to the one when strandMode = 2L is considered (“strandMode2” column), which is compatible with a non-stranded library. This contrasts with the reported stranded protocol used to obtain this data according to (Li et al. 2022). However, the results obtained by identifyStrandMode() were contrasted with results of the RSeQC infer_experiment.py tool ((Wang, Wang, and Li 2012)) and visual inspection of data in the Integrative Genomics Viewer (IGV) ((Robinson et al. 2011)), all of which point to an unstranded RNA-seq experiment.

identifyStrandMode() uses 200,000 alignments overlapping exonic regions to compute strandedness (recommended by (Signal and Kahlke 2022)), unless the number of these kind of alignments in the BAM file is lower. In this vignette, the number of alignments used is close to 60,000, which is the total number of exonic alignments present in the BAM files.

2.2 Calculating and plotting diagnostics

Once the strandMode is known, we can calculate gDNA contamination diagnostics using the gDNAdx() function. A subset of the alignments present in the BAM file are considered to obtain these diagnostics. The number of alignments used is set by the yieldSize parameter.

gdnax <- gDNAdx(bamfiles, txdb, singleEnd=FALSE, strandMode=strandM$strandMode)
gdnax
gDNAx object
# BAM files (9): s32gDNA0.bam, ..., s25gDNA10.bam
# Library layout: paired-end (2x50nt)
# Strand mode: NA
# Sequences: only standard chromosomes
# Annotation pkg: TxDb.Hsapiens.UCSC.hg38.knownGene
# Alignments employed: first 100000

We, then, can get statistics on the origin of alignments and strandedness with getDx():

dx <- getDx(gdnax)
dx
               IGC      INT      SCJ      SCE      SCC  IGCFLM  SCJFLM  SCEFLM
s32gDNA0  1.036524 11.75898 15.18563 40.04114 19.56452 178.939 157.224 158.402
s33gDNA0  1.125444 11.78607 15.21656 40.31236 19.55383 174.529 162.123 165.682
s34gDNA0  1.095440 12.26551 15.40036 40.19421 19.70788 171.118 156.341 155.538
s26gDNA1  1.402844 12.22650 14.78351 38.69202 18.73132 174.421 158.713 163.458
s27gDNA1  1.365428 12.45073 14.54513 38.21891 18.31765 173.409 166.383 164.063
s28gDNA1  1.503162 12.49083 14.09956 37.67457 17.96053 174.973 161.662 160.597
s23gDNA10 3.529043 13.16473 11.22648 30.99781 14.41110 174.421 161.371 164.748
s24gDNA10 3.781627 13.65285 10.85400 30.01465 13.91748 169.245 160.336 160.806
s25gDNA10 3.412804 13.51507 11.19751 30.65271 14.20691 174.168 160.922 158.964
           INTFLM STRAND
s32gDNA0  154.940     NA
s33gDNA0  157.002     NA
s34gDNA0  152.265     NA
s26gDNA1  157.834     NA
s27gDNA1  158.440     NA
s28gDNA1  159.784     NA
s23gDNA10 162.035     NA
s24gDNA10 163.551     NA
s25gDNA10 157.776     NA

Next, we can plot the previous gDNA diagnostic measures using the default plot() function. This creates four plots, each one representing a diagnostic measure as a function of the percentage of intergenic alignments, which can be considered the most informative measure regarding gDNA contamination levels. Here, strandedness values (STRAND column) are NA since the dataset is not strand-specific.

plot(gdnax, group=pdat$gDNA, pch=19)
Default diagnostics.

Figure 1: Default diagnostics

Splice-compatible junction (SCJ) alignments (spliced alignments overlapping a transcript in a “splice compatible” way) and splice compatible exonic (SCE) alignments (alignments without a splice site, but that overlap a transcript in a “splice compatible” way) are expected to come from RNA sequenced reads. Instead, intergenic alignments (IGC) mainly come from DNA sequenced reads. For this reason, we see a negative correlation between the percentage of SCJ or SCE alignments and the percentage of IGC alignments: higher gDNA contamination levels lead to more IGC alignments and less SCJ or SCE alignments.

On the contrary, intronic (INT) alignments are positively correlated with IGC alignments and, thus, with gDNA contamination levels.

The last plot shows the strandedness value. In stranded RNA-seq protocols, we expect a strandedness value close to 1, meaning that most reads align to the same strand than the annotated transcripts. Lower strandedness values can be indicative of gDNA contamination: reads sequenced from DNA are expected to align in equal proportions to both strands. Therefore, a low strandedness in a stranded RNA-seq experiment can be due to the presence of DNA reads (contamination) mapping to transcripts but in the opposite strand.

Another plot to represent diagnostic measures is the one representing the origin of alignments per sample. Fluctuations in this proportions evidence different levels of gDNA contamination in samples.

plotAlnOrigins(gdnax, group=pdat$gDNA)