wavClusteR 2.41.0
#Abstract
A number of recently developed next-generation sequencing based methods (e.g. PAR-CLIP, Bisulphite sequencing, RRBS) specifically induce nucleotide substitutions within the short reads with respect to the reference genome. wavClusteR provides functions for the analysis of the data obtained by such methods with a major focus on PAR-CLIP. The package leverages on experimentally induced substitutions to identify high confidence signals (e.g. RNA-binding sites) in the data. A wavClusteR workflow consists of two steps:
The package supports multicore computing. For a detailed description of the method please refer to Sievers et al., 2012; Comoglio et al., 2015.
#Preparing the input
wavClusteR expects a sorted and indexed BAM file as input. A streamlined workflow to generate the required input from a fastq file is outlined below (line 1 is pseudo code, replace it with the aligner specific syntax).
#ALIGN:
sample.fastq -> sample.sam
#CONVERT:
samtools view -b -S sample.sam -o sample.bam
#SORT:
samtools sort sample.bam sample_sorted
#INDEXING:
samtools index sample_sorted.bam
samtools view
from SAMtoolssamtools sort
samtools index
.##Example dataset
wavClusteR provides a chunk of a human Argonaute 2 (AGO2) PAR-CLIP data set from Kishore et al., 2011. Reads in the chunk map to the genomic interval chrX:23996166-24023263. This data set is used below to illustrate a workflow for PAR-CLIP data analysis.
#A workflow for PAR-CLIP data analysis
##Reading sorted BAM files
A sorted and indexed BAM file can be loaded into the R session with readSortedBam. This function extracts aligned reads, sequences and the mismatch MD field, and returns a GRanges object.
library(wavClusteR)
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
## lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff, table,
## tapply, union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: Rsamtools
## Loading required package: Biostrings
## Loading required package: XVector
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
filename <- system.file( "extdata", "example.bam", package = "wavClusteR" )
Bam <- readSortedBam(filename = filename)
Bam
## GRanges object with 5358 ranges and 2 metadata columns:
## seqnames ranges strand | qseq
## <Rle> <IRanges> <Rle> | <DNAStringSet>
## [1] chrX 24001819-24001844 - | CAGAGATAAA...TATTTTAAAG
## [2] chrX 24001819-24001843 - | CAGAGATAAA...ATATTTTAAG
## [3] chrX 24001834-24001863 - | ATATTTTAGA...ATTTTATTTA
## [4] chrX 24001836-24001865 - | TTTTTAAAGA...TTTATTTAAA
## [5] chrX 24001841-24001876 - | AAAGATTAAA...TTTTCTTCAT
## ... ... ... ... . ...
## [5354] chrX 24023018-24023051 - | GTTTCACAGC...AAAAATATGT
## [5355] chrX 24023018-24023051 - | GTTTCACAGC...AAAAATATGT
## [5356] chrX 24023019-24023051 - | TTTCACAGCG...AAAAATATGT
## [5357] chrX 24023019-24023051 - | TTTCACAGCG...AAAAATATGT
## [5358] chrX 24023067-24023090 - | CAAAGGCGCG...GGTTTATTTT
## tag.MD
## <character>
## [1] 26
## [2] 24A0
## [3] 8A21
## [4] 0A13A15
## [5] 24A11
## ... ...
## [5354] 10A23
## [5355] 10A23
## [5356] 9A23
## [5357] 9A23
## [5358] 9A14
## -------
## seqinfo: 25 sequences from an unspecified genome; no seqlengths
##Extracting informative genomic positions
Prior to model fitting, genome-wide substitutions need to be identified and filtered based on a coverage threshold. The getAllSub function identifies all genomic positions exhibiting at least one substitution and coverage above this threshold.
countTable <- getAllSub( Bam, minCov = 10 )
## Loading required package: doMC
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
## Considering substitutions, n = 497, processing in 1 chunks
## chunk #: 1
## considering the + strand
## Computing local coverage at substitutions...
## considering the - strand
## Computing local coverage at substitutions...
head( countTable )
## GRanges object with 6 ranges and 3 metadata columns:
## seqnames ranges strand | substitutions coverage count
## <Rle> <IRanges> <Rle> | <character> <numeric> <integer>
## [1] chrX 24001959 - | TC 17 2
## [2] chrX 24001973 - | TC 17 12
## [3] chrX 24001977 - | TC 13 1
## [4] chrX 24002046 - | TC 10 1
## [5] chrX 24002057 - | TC 10 6
## [6] chrX 24002147 - | TC 22 3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
The function returns a GRanges object specifying genomic position, strand, substitution type (e.g. “TC”: T in the reference genome; C in the read), strand-specific coverage and number of observed substitutions at the position.
###How to choose the coverage threshold?
The coverage threshold minCov
defines the genomic positions to be used for parameter estimation, thus providing a means to tune the stringency of the analysis. Currently, wavClusteR does not allow to learn this threshold from the data. However, since the model is based on relative substitution frequencies, the value of minCov
will influence the variance of the estimated parameters. Therefore, values smaller than default (minCov=10
) are not recommended.
##Inspecting the substitutions profile (diagnostic I)
Once all substitutions are computed, a diagnostic substitution profile can be plotted with plotSubstitutions.
plotSubstitutions( countTable, highlight = "TC" )
The function returns a barplot showing the total number of genomic positions that exhibit a given type of substitution and highlights the substitution type that is expected to be generated by the experimental procedure. The percentage of substitution of this type is also shown. This plot can be readily used to assess the quality of the sequenced libraries and can be used to compare different data sets generated under the same experimental condition.
##Estimating the model
wavClusteR uses the identified genomic positions to learn a non-parametric mixture model discriminating PAR-CLIP-specific from extrinsic transitions. Model parameters are estimated by fitMixtureModel.
model <- fitMixtureModel(countTable, substitution = "TC")
The function returns a list of:
As the small size of our example data set does not to estimate the model reliably, the mixture model for the entire AGO2 dataset has been precomputed and is provided by the package.
data(model)
str(model)
## List of 5
## $ l1: Named num 0.181
## ..- attr(*, "names")= chr "TC"
## $ l2: Named num 0.819
## ..- attr(*, "names")= chr "TC"
## $ p : num [1:999] 7.52 9.44 10.05 10.38 10.48 ...
## $ p1: num [1:999] 89.6 64.4 50.4 41.5 35.3 ...
## $ p2: num [1:999] 0 0 1.14 3.51 5 ...
##Inspecting the model fit (diagnostic II)
The model fit can be inspected using getExpInterval.
(support <- getExpInterval( model, bayes = TRUE ) )
## $supportStart
## [1] 0.007
##
## $supportEnd
## [1] 0.98
The function returns two diagnostic plots. The first plot illustrates the estimated densities \(p\), \(p_1\) and \(p_2\), and log odds
\[ o=
The second plot shows the resulting posterior class probability, i.e. the probability that a given relative substitution frequency (RSF, horizontal axis) has been induced by PAR-CLIP. The area under the curve corresponding to the returned RSF interval is colored, and the RSF interval indicated. By default, getExpInterval returns the RSF interval according to the Bayes classifier, i.e. a posterior probability cutoff of 0.5. However, the user can specify a custom RSF interval:
By means of the rightProb and leftProb parameters, e.g.
(support <- getExpInterval( model, bayes = FALSE, leftProb = 0.9, rightProb = 0.9 ) )
## $supportStart
## [1] 0.076
##
## $supportEnd
## [1] 0.905
By inspecting the posterior class probability density and passing the RSF interval boundaries when calling high-confidence substitutions (see point 6).
Finally, the model can be used to produce further diagnostic plots. Particularly, the total number of reads exhibitng a given substitution and an RSF-based partitioning of genomic positions with substitutions can be obtained by
plotSubstitutions( countTable, highlight = "TC", model )