Contents

1 Introduction

strandCheckR is a package used a window that slides across a bam file and gets the count/coverage of reads mapping to +/- strand in each window. The returned Data Frame gives information across the whole genome through each sliding window. That helps to quantify the amount of reads coming from double strand genomic DNA. It can be applied to detect DNA contamination within a strand specifc RNA-seq by assessing whether each window contains reads mainly from one strand, as would be expected from RNA molecules, or from both strands as would be expected from contaminating DNA. Hence, it is considered as an additional quality checking for strand specific RNA data. The package can also be used to detect regions with specific behavior (e.g.  highest number of reads) through the whole genome and not only feature regions.

The window read count function is designed flexibly so that user can filter low mapping quality reads, set read proportion to overlap a window, set window length & step etc. It was also implemented in an efficient way to manange big bam file. For a typical human RNA-seq bam file, it takes about 3 minutes to scan and get strand information using a standard laptop 2,3 GHz i5 16 GB.

A function to filter out reads coming from double strand DNA is also provided. For each sliding window, it considers the proportion of +/- stranded reads comparing to a given threshold, to decide if a window contains reads derived from single strand RNA or double stranded DNA. A read that does not belong to any window coming from RNA will be removed. A read that belongs to some windows coming from RNA will be kept with a probability caluclated based on the proportion of +/- stranded of those windows.

2 Get windows information

The function getStrandFromBamFile is used to get the number of +/- stranded reads from all sliding windows throughout a list of bam files. The bam files should be sorted and have their index files located at the same path as well.

library(strandCheckR)
files <- system.file(
    "extdata",c("s1.sorted.bam","s2.sorted.bam"),package = "strandCheckR"
    )
win <- getStrandFromBamFile(files, sequences = "10")
# shorten the file name
win$File <- basename(as.character(win$File))
win
## DataFrame with 3078 rows and 10 columns
##       Type   Seq     Start       End NbPos NbNeg CovPos CovNeg MaxCoverage
##      <Rle> <Rle> <numeric> <numeric> <Rle> <Rle>  <Rle>  <Rle>       <Rle>
## 1       SE    10   7696701   7697700     0    17      0    393          17
## 2       SE    10   7696801   7697800     0    17      0    393          17
## 3       SE    10   7696901   7697900     0    17      0    393          17
## 4       SE    10   7697001   7698000     0    17      0    393          17
## 5       SE    10   7697101   7698100     0    17      0    393          17
## ...    ...   ...       ...       ...   ...   ...    ...    ...         ...
## 3074    SE    10   7398501   7399500    46    34   2241   1668          13
## 3075    SE    10   7398601   7399600    46    34   2241   1668          13
## 3076    SE    10   7398701   7399700    41    32   2046   1568          13
## 3077    SE    10   7398801   7399800    48    31   2500   1681          25
## 3078    SE    10   7398901   7399900    52    35   2581   1728          25
##               File
##        <character>
## 1    s1.sorted.bam
## 2    s1.sorted.bam
## 3    s1.sorted.bam
## 4    s1.sorted.bam
## 5    s1.sorted.bam
## ...            ...
## 3074 s2.sorted.bam
## 3075 s2.sorted.bam
## 3076 s2.sorted.bam
## 3077 s2.sorted.bam
## 3078 s2.sorted.bam

The returned DataFrame has 10 columns that correspond to the read type (R1 or R2 or single end read), sequence names, starting & ending bases of the windows (by default the window length is 1000 and the window step is 100), number of positive & negative reads that overlap each window (NbPositive, NbNegative), number of positive & negative bases that overlap each window (CovPositive, CovNegative), the maximum coverage (MaxCoverage) and the file name (File). Windows that do not overlap with any read are not reported.

The windows with highest read counts, for example, can be obtained as follows.

head(win[order((win$NbPos+win$NbNeg),decreasing=TRUE),])
## DataFrame with 6 rows and 10 columns
##    Type   Seq     Start       End NbPos NbNeg CovPos CovNeg MaxCoverage
##   <Rle> <Rle> <numeric> <numeric> <Rle> <Rle>  <Rle>  <Rle>       <Rle>
## 1    SE    10   7037801   7038800    17   817    838  40705         695
## 2    SE    10   7037901   7038900    13   816    653  40684         695
## 3    SE    10   7038001   7039000     4   810    180  40378         695
## 4    SE    10   7038101   7039100     1   805     50  40107         695
## 5    SE    10   7038201   7039200     1   804     50  40056         695
## 6    SE    10   7038301   7039300     1   801     50  39908         695
##            File
##     <character>
## 1 s2.sorted.bam
## 2 s2.sorted.bam
## 3 s2.sorted.bam
## 4 s2.sorted.bam
## 5 s2.sorted.bam
## 6 s2.sorted.bam

Here is an example for paired end bam file.

fileP <- system.file("extdata","paired.bam",package = "strandCheckR")
winP <- getStrandFromBamFile(fileP, sequences = "10")
winP$File <- basename(as.character(winP$File)) #shorten file name
winP
## DataFrame with 1184 rows and 10 columns
##       Type   Seq     Start       End NbPos NbNeg CovPos CovNeg MaxCoverage
##      <Rle> <Rle> <numeric> <numeric> <Rle> <Rle>  <Rle>  <Rle>       <Rle>
## 1       R1    10   7699501   7700500     0     1      0     49           1
## 2       R1    10   7699601   7700600     0     1      0     67           1
## 3       R1    10   7699701   7700700     0     1      0     67           1
## 4       R1    10   7699801   7700800     0     1      0     67           1
## 5       R1    10   7699901   7700900     0     1      0     67           1
## ...    ...   ...       ...       ...   ...   ...    ...    ...         ...
## 1180    R2    10   7758601   7759600    10     0    989    202           4
## 1181    R2    10   7758701   7759700     9     0    911    202           4
## 1182    R2    10   7758801   7759800     9     0    905    202           4
## 1183    R2    10   7758901   7759900     7     0   1099     51          32
## 1184    R2    10   7759001   7760000    39     0   3032      0          32
##             File
##      <character>
## 1     paired.bam
## 2     paired.bam
## 3     paired.bam
## 4     paired.bam
## 5     paired.bam
## ...          ...
## 1180  paired.bam
## 1181  paired.bam
## 1182  paired.bam
## 1183  paired.bam
## 1184  paired.bam

3 Intersect with an annotation GRanges object

If you have an annotation data, you can integrate it with the sliding windows obtained from the previous step using the function intersectWithFeature. The annotation must be a GRanges object, with or without mcols. Make sure that the sequence names in windows and annotation data are consistent. By default, you will have an additional column in the returned windows which indicates wheather a window overlap with any feature in the annotation object. You can also get details of the overlapped features in the mcols of the annotation object by specifying it in the mcolsAnnnot parameter.

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
annot <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene)
#add chr before the sequence names to be consistent with the annot data
win$Seq <- paste0("chr",win$Seq) 
win <- intersectWithFeature(
    windows = win, annotation = annot, overlapCol = "OverlapTranscript"
    )
win
## DataFrame with 3078 rows and 11 columns
##       Type         Seq     Start       End NbPos NbNeg CovPos CovNeg
##      <Rle> <character> <numeric> <numeric> <Rle> <Rle>  <Rle>  <Rle>
## 1       SE       chr10   7696701   7697700     0    17      0    393
## 2       SE       chr10   7696801   7697800     0    17      0    393
## 3       SE       chr10   7696901   7697900     0    17      0    393
## 4       SE       chr10   7697001   7698000     0    17      0    393
## 5       SE       chr10   7697101   7698100     0    17      0    393
## ...    ...         ...       ...       ...   ...   ...    ...    ...
## 3074    SE       chr10   7398501   7399500    46    34   2241   1668
## 3075    SE       chr10   7398601   7399600    46    34   2241   1668
## 3076    SE       chr10   7398701   7399700    41    32   2046   1568
## 3077    SE       chr10   7398801   7399800    48    31   2500   1681
## 3078    SE       chr10   7398901   7399900    52    35   2581   1728
##      MaxCoverage          File OverlapTranscript
##            <Rle>   <character>       <character>
## 1             17 s1.sorted.bam        NotOverlap
## 2             17 s1.sorted.bam        NotOverlap
## 3             17 s1.sorted.bam        NotOverlap
## 4             17 s1.sorted.bam        NotOverlap
## 5             17 s1.sorted.bam        NotOverlap
## ...          ...           ...               ...
## 3074          13 s2.sorted.bam           Overlap
## 3075          13 s2.sorted.bam           Overlap
## 3076          13 s2.sorted.bam           Overlap
## 3077          25 s2.sorted.bam           Overlap
## 3078          25 s2.sorted.bam           Overlap

If you have a gtf or gff file, you can use the function gffReadGR of the ballgown package to read it as a GRanges object.

4 Plot histogram and windows information

With these windows, you can have some plots via plotHist and plotWin functions which can be saved to an appropriate location.

To plot the histogram of positive proportion of the sliding windows, we use the function plotHist. This function will first calculates the frequency of positive proportion over all windows, and also group/normalize them based on given column names. Then it uses the geom_bar function of ggplot2 to plot those frequencies. The plot gives you an idea on how much double-strand DNA is contained in your sample. In perfectly clear ss-RNA-seq, the positive proportion of every window should be either around 0% or 100%. The more amount of windows have this proportion around 50%, the more the sample was contaminated by DNA.

plotHist(
    windows = win, groupBy = c("File","OverlapTranscript"), 
    normalizeBy = "File", scales = "free_y"
    )

In this example, file s2.sorted.bam seems to be contaminated with double strand DNA, while file s1.sorted.bam is cleaner. By default, the windows are splitted in to 4 coverage groups <10, 10-100, 100-1000, and >1000. It can be set via option split.

For paired-end data, we can group the Data Frame by read type:

plotHist(
    windows = winP, groupBy = "Type", normalizeBy = "Type", scales = "free_y"
    )

Heatmap can be used instead of classic barplot for histogram when specifying heatmap=TRUE. This would be usefull to visualize mutliple files in the same plot.

plotHist(
    windows = win, groupBy = c("File","OverlapTranscript"), 
    normalizeBy = "File", heatmap = TRUE
    )

plotWin creates a plot on the number of reads vs positive proportion of each window. There are also 4 lines correspond to 4 representative thresholds (0.6, 0.7, 0.8, 0.9). Threshold is a parameter that is used when filtering a bam file using filterDNA. Given a threshold, a positive (resp. negative) window is kept if and only if it is above (resp. below) the corresponding threshold line on this plot. This can be used to give guidance as to the best threshold to choose when filtering windows.

plotWin(win, groupBy = "File")

5 Filter bam files

The functions filterDNA removes potential double strand DNA from a bam file using a given threshold.

win2 <- filterDNA(
    file = files[2], sequences = "10", destination = "s2.filter.bam", 
    threshold = 0.7, getWin = TRUE
    )

Other parameters can be specified for more flexible filtering: define the ranges that you want to always keep, the minimum number of reads under which you want to ignore, the pvalue threshold for keeping a windows etc. Look at the manual page of the function for more options and explanations.

Plotting the windows before and after filering:

win2$File <- basename(as.character(win2$File))
win2$File <- factor(win2$File,levels = c("s2.sorted.bam","s2.filter.bam"))
library(ggplot2)
plotHist(win2,groupBy = "File",normalizeBy = "File",scales = "free_y") +
    ggtitle("Histogram of positive proportions over all sliding windows before 
            and after filtering reads coming from double strand DNA")