Detection of 2’-O methylations by RiboMethSeq
Among the various post-transcriptional RNA modifications, 2’-O methylations are commonly found in rRNA and tRNA. They promote the endo conformation of the ribose and confere resistance to alkaline degradation by preventing a nucleophilic attack on the 3’-phosphate especially in flexible RNA, which is fascilitated by high pH conditions. This property can be queried using a method called RiboMethSeq (Birkedal et al. 2015) for which RNA is treated in alkaline conditions and RNA fragments are used to prepare a sequencing library (Marchand et al. 2017).
At position containing a 2’-O methylations, read ends are less frequent, which is used to detect and score the 2’-O methylations.
The ModRiboMethSeq
class uses the the ProtectedEndSequenceData
class to
store and aggregate data along the transcripts. The calculated scores follow the
nomenclature of (Birkedal et al. 2015; Galvanin et al. 2019) with the names
scoreRMS
(default), scoreA
, scoreB
and scoreMean
.
## Warning: replacing previous import 'utils::findMatches' by
## 'S4Vectors::findMatches' when loading 'AnnotationDbi'
## Warning: replacing previous import 'utils::findMatches' by
## 'S4Vectors::findMatches' when loading 'AnnotationForge'
## Warning: replacing previous import 'utils::findMatches' by
## 'S4Vectors::findMatches' when loading 'ExperimentHubData'
library(rtracklayer)
library(GenomicRanges)
library(RNAmodR.RiboMethSeq)
library(RNAmodR.Data)
The example workflow is limited to two 2’-O methylated position on 5.8S rRNA,
since the size of the raw data is limited. For annotation data either a gff file
or a TxDb
object and for sequence data a fasta file or a BSgenome
object can
be used. The data is provided as bam files.
annotation <- GFF3File(RNAmodR.Data.example.RMS.gff3())
sequences <- RNAmodR.Data.example.RMS.fasta()
files <- list("Sample1" = c(treated = RNAmodR.Data.example.RMS.1()),
"Sample2" = c(treated = RNAmodR.Data.example.RMS.2()))
The analysis is triggered by the construction of a ModSetRiboMethSeq
object.
Internally parallelization is used via the BiocParallel
package, which would
allow optimization depending on number/size of input files (number of samples,
number of replicates, number of transcripts, etc).
msrms <- ModSetRiboMethSeq(files, annotation = annotation, sequences = sequences)
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
msrms
## ModSetRiboMethSeq of length 2
## names(2): Sample1 Sample2
## | Modification type(s): Am / Cm / Gm / Um
## Sample1 Sample2
## | Modifications found: no yes (1)
## | Settings:
## minCoverage minReplicate find.mod maxLength minSignal flankingRegion
## <integer> <integer> <logical> <integer> <integer> <integer>
## Sample1 10 1 TRUE 50 10 6
## Sample2 10 1 TRUE 50 10 6
## minScoreA minScoreB minScoreRMS minScoreMean flankingRegionMean
## <numeric> <numeric> <numeric> <numeric> <integer>
## Sample1 0.6 3.6 0.75 0.75 2
## Sample2 0.6 3.6 0.75 0.75 2
## weights
## <NumericList>
## Sample1 0.9,1.0,0.0,...
## Sample2 0.9,1.0,0.0,...
## scoreOperator
## <character>
## Sample1 &
## Sample2 &
To compare samples, we need to know, which positions should be part of the comparison. This can either be done by aggregating the detect over all samples and use the union or intersect or by using publish data. We want to assemble a GRanges object from the latter by utilising the infomation from the snoRNAdb (Lestrade and Weber 2006).
In this specific example only information for the 5.8S RNA is used, since the
example data would be to big otherwise. The information regarding the parent and
seqname must match the information used as the annotation data. Check that it
matches the output of ranges()
on a SequenceData
, Modifier
oder
ModifierSet
object.
table <- read.csv2(RNAmodR.Data.snoRNAdb(), stringsAsFactors = FALSE)
## see ?RNAmodR.Data and browseVignettes('RNAmodR.Data') for documentation
## loading from cache
table <- table[table$hgnc_id == "53533",] # Subset to RNA5.8S
# keep only the current coordinates
table <- table[,1L:7L]
snoRNAdb <- GRanges(seqnames = "chr1",
ranges = IRanges(start = table$position,
width = 1),
strand = "+",
type = "RNAMOD",
mod = table$modification,
Parent = "1", #this is the transcript id
Activity = IRanges::CharacterList(strsplit(table$guide,",")))
coord <- split(snoRNAdb,snoRNAdb$Parent)
In addition to the coordinates of published, we also want to include more
meaningful names for the transcripts. For this we provide a data.frame
with
two columns, tx_id
and name
. All values in the first column have to match
transcript IDs.
ranges(msrms)
## GRangesList object of length 1:
## $`1`
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] chr1 1-157 + | 1 NR_003285.2 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
alias <- data.frame(tx_id = "1", name = "5.8S rRNA", stringsAsFactors = FALSE)
plotCompareByCoord(msrms[c(2L,1L)], coord, alias = alias)
Results can also be compared on a sequence level, by selecting specific coordinates to compare.
singleCoord <- coord[[1L]][1L,]
plotDataByCoord(msrms, singleCoord)
By default only the RiboMethSeq score and the ScoreMean are shown. The raw sequence data can be inspected as well
singleCoord <- coord[[1L]][1L,]
plotDataByCoord(msrms, singleCoord, showSequenceData = TRUE)