srnadiff 1.6.0
srnadiff is an R package that finds differently expressed regions from RNA-seq data at base-resolution level without relying on existing annotation. To do so, the package implements the identify-then-annotate methodology that builds on the idea of combining two pipelines approach: differential expressed regions detection and differential expression quantification.
There is no real method for finding differentially expressed short RNAs. The most used method focusses on miRNAs, and only uses a standard RNA-Seq pipe-line on these genes.
However, annotated tRF, siRNAs, piRNA, etc. are thus out of the scope of these analyses. Several ad hoc method have been used, and this package implements a unifying method, finding differentially expressed genes or regions of any kind.
The srnadiff package implements two major methods to produce potential differentially expressed regions: the HMM and IR method. Briefly, these methods identify contiguous base-pairs in the genome that present differential expression signal, then these are regrouped into genomic intervals called differentially expressed regions (DERs).
Once DERs are detected, the second step in a sRNA-diff approach is to quantify the signification of these. To do so, reads (including fractions of reads) that overlap each expressed region are counted to arrive at a count matrix with one row per region and one column per sample. Then, this count matrix is analyzed using the standard workflow of DESeq2 for differential expression of RNA-seq data, assigning a p-value to each candidate DER.
The main functions for finds differently expressed regions are
srnadiffExp
and srnadiff
. The first one
creates an S4 class providing the infrastructure (slots) to store the
input data, methods parameters, intermediate calculations and results
of an sRNA-diff approach. The second one implement four methods to find
candidate differentially expressed regions and quantify the statistic
signification of the finded regions.
This vignette explains the basics of using srnadiff by showing an example, including advanced material for fine tuning some options. The vignette also includes description of the methods behind the package.
We hope that srnadiff will be useful for your research. Please use the following information to cite srnadiff and the overall approach when you publish results obtained using this package, as such citation is the main means by which the authors receive credit for their work. Thank you!
Zytnicki, M., and I. González. (2019). “srnadiff: Finding differentially expressed unannotated genomic regions from RNA-seq data.” R package version 1.6.0.
Most questions about individual functions will hopefully be answered by the
documentation. To get more information on any specific named function, for
example MIMFA
, you can bring up the documentation by typing at the
R.
help("srnadiff")
or
?srnadiff
The authors of srnadiff always appreciate receiving reports of bugs in the package functions or in the documentation. The same goes for well-considered suggestions for improvements. If you’ve run into a question that isn’t addressed by the documentation, or you’ve found a conflict between the documentation and what the software does, then there is an active community that can offer help. Send your questions or problems concerning srnadiff to the Bioconductor support site at .
Please send requests for general assistance and advice to the support site, rather than to the individual authors. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error. Users posting to the support site for the first time will find it helpful to read the posting guide at the Bioconductor help page.
A typical sRNA-diff session can be divided into three steps:
srnadiffExp
is created containing all the information required
for the two remaining steps. The user needs to provide a vector with the full
paths to the BAM files, a data.frame
with sample and experimental
design information and optionally annotated regions as a GRanges
object.srnadiff
to find potential DERs and quantify the
statistic signification of these.A typical srnadiff session might look like the following.
Here we assume that bamFiles
is a vector with the full paths to the
BAM files and the sample and experimental design information are stored in a
data frame sampleInfo
.
We assume that the user has the R program (see the R project) already installed.
The srnadiff package is available from the Bioconductor repository. To be able to install the package one needs first to install the core Bioconductor packages. If you have already installed Bioconductor packages on your system then you can skip the two lines below.
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
Once the core Bioconductor packages are installed, you can install the srnadiff package by
BiocManager::install("srnadiff", version="3.8")
Load the srnadiff package in your R session:
library(srnadiff)
A list of all accessible vignettes and methods is available with the following command:
help.search("srnadiff")
To help demonstrate the functionality of srnadiff, the package includes datasets of published by (Viollet et al. 2015).
Briefly, these data consist of three replicates of sRNA-Seq of SLK (human) cell lines, and three replicates of SLK cell lines infected with Kaposi’s sarcoma associated herpesvirus. The analysis shows that several loci are repressed in the infected cell lines, including the 14q32 miRNA cluster.
Raw data have been downloaded from the GEO data set GSE62830. Adapters were removed with fastx_clipper and mapped with (Salzberg and Langmead 2012) on the human genome version GRCh38.
This data is restricted to a small locus on chr14. It uses the whole genome annotation (with coding genes, etc.) and extracts miRNAs.
The file dataInfo.csv
contains three columns for each BAM file:
FileName
)SampleName
)WT
(Condition
)srnadiffExp
objectThe first step in an sRNA-diff approach is to create a
srnadiffExp
object. srnadiffExp
is an S4
class
providing the infrastructure to store the input data, methods parameters,
intermediate calculations and results of a sRNA-diff approach. This
object will be also the input of the visualization function.
The object srnadiffExp
will usually be represented in the code here
as srnaExp
. To build such an object the user needs the following:
data.frame
with three columns labelled
FileName
, SampleName
and Condition
. The first column
is the BAM file name (without extension), the second column the sample name,
and the third column the condition to which sample belongs. Each row describes
one sample.Here, we demonstrate how to construct a srnadiffExp
object from (Viollet et al. 2015).
## Determine the path to data files
basedir <- system.file("extdata", package="srnadiff", mustWork=TRUE)
## Vector with the full paths to the BAM files to use
bamFiles <- paste(file.path(basedir, sampleInfo$FileName), "bam", sep=".")
## Reads sample information file and creates a data frame from it
sampleInfo <- read.csv(file.path(basedir, "dataInfo.csv"))
## Vector with the full paths to the BAM files to use
bamFiles <- paste(file.path(basedir, sampleInfo$FileName), "bam", sep = ".")
## Creates an srnadiffExp object
srnaExp <- srnadiffExp(bamFiles, sampleInfo)
Optionally, if annotation information is available as a GRanges
object, annotReg
say, then a srnadiffExp
object can be
created by
srnaExp <- srnadiffExp(bamFiles, sampleInfo, annotReg)
or by
srnaExp <- srnadiffExp(bamFiles, sampleInfo)
annotReg(srnaExp) <- annotReg
A summary of the srnaExp
object can be seen by typing the object name
at the R prompt
srnaExp
## Object of class srnadiffExp.
## Sample information
## FileName SampleName Condition
## 1 SRR1634756 ctr_1 control
## 2 SRR1634757 ctr_2 control
## 3 SRR1634758 ctr_3 control
## 4 SRR1634759 inf_1 infected
## 5 SRR1634760 inf_2 infected
## 6 SRR1634761 inf_3 infected
For your conveniance and illustrative purposes, an example of an
srnadiffExp
object can be loaded with an only command, so the script
boils down to:
srnaExp <- srnadiffExample()
The srnadiffExp
object in this example was constructed by:
basedir <- system.file("extdata", package="srnadiff", mustWork=TRUE)
sampleInfo <- read.csv(file.path(basedir, "dataInfo.csv"))
gtfFile <- file.path(basedir, "Homo_sapiens.GRCh38.76.gtf.gz")
annotReg <- readAnnotation(gtfFile, feature="gene", source="miRNA")
bamFiles <- paste(file.path(basedir, sampleInfo$FileName), "bam", sep=".")
srnaExp <- srnadiffExp(bamFiles, sampleInfo, annotReg)
srnadiff offers the readAnnotation
function related to
loading annotation data. This accepts two annotation format files: GTF and GFF
formats. Specification of GTF/GFF format can be found at the
UCSC dedicated page.
readAnnotation
reads and parses content of GTF/GFF files
and stores annotated genomic features (regions) in a GRanges
object. This has three main arguments: the first argument indicates the path,
URL or connection to the GTF/GFF annotation file. The second and third
argument, feature
and source
respectively, are of type
character string, these specify the feature and attribute type used to
select rows in the GTF/GFF annotation which will be imported. feature
and source
can be NULL
, in this case, no selection is
performed and all content into the file is imported.
This method simply provides the genomic regions corresponding to the annotation file that is optionally given by the user. It can be a set of known miRNAs, siRNAs, piRNAs, genes, or a combination thereof.
This GTF file can be found in the central repositories (NCBI, Ensembl) and contains all the annotation found in an organism (coding genes, tranposable element, etc.). The following function reads the annotation file and extracts the miRNAs. Annotation files may have different formats, but this command has been tested on several model organisms (including human) from Ensembl.
gtfFile <- file.path(basedir, "Homo_sapiens.GRCh38.76.gtf.gz")
annotReg <- readAnnotation(gtfFile, feature="gene", source="miRNA")
miRBase (Kozomara and Griffiths-Jones 2014) is the central repository for miRNAs. If your organism is available, you can download their miRNA annotation in GFF3 format (check the “Browse” tab). The following code parses a GFF3 miRBase file, and extracts the precursor miRNAs.
gffFile <- file.path(basedir, "mirbase21_GRCh38.gff3")
annotReg <- readAnnotation(gffFile, feature="miRNA_primary_transcript")
In the previous example, the reads will be counted per pre-miRNA, and the 5’ and 3’ arms, the miRNA and the miRNA* will be merged in the same feature. If you want to separate the two, use:
gffFile <- file.path(basedir, "mirbase21_GRCh38.gff3")
annotReg <- readAnnotation(gffFile, feature="miRNA")
When the previous functions do not work, you can use your own parser with:
annotation <- readAnnotation(gtfFile, source="miRNA", feature="gene")
The source
parameter keeps all the lines such that the second field matches
the given parameter (e.g. miRNA
).
The feature
parameter keeps all the lines such that the third field matches
the given parameter (e.g. gene
).
The name of the feature will be given by the tag name
(e.g. gene_name
).
source
, feature
and name
can be NULL
.
In this case, no selection is performed on source
or feature
.
If name
is null, then a systematic name is given (annotation_N
).
The main function for performing an sRNA-diff analysis is
srnadiff
, this the wrapper for running several key functions from
this package. srnadiff
implement four methods to produce potential
DERs: the annotation, naive, hmm and IR method
(see bellow).
Once potential DERs are detected, the second step in srnadiff
is to
quantify the statistic signification of these.
srnadiff
has three main arguments. The first argument is an instance
of class srnadiffExp
. The second argument is of type character
vector, it specify the segmentation methods to use, one of annotation
,
naive
, hmm
, IR
or combinations thereof.
The default all
, all methods are used.
The third arguments is of type list, it contain named components for the
methods parameters to use. If missing, default parameter values are supplied.
Details about the methods parameters are further described in the manual page
of the parameters
function and in Methods to produce
differentially expressed regions section.
We then performs an sRNA-diff analysis on the input data contained in
srnaExp
by
srnaExp <- srnadiff(srnaExp)
srnadiff
returns an object of class srnadiffExp
again
containing additional slots for:
regions
parameters
countMatrix
srnadiffExp
objectOnce the srnadiffExp
object is created the user can use the methods
defined for this class to access the information encapsulated in the object.
By example, the sample information is accessed by
sampleInfo(srnaExp)
## FileName SampleName Condition
## 1 SRR1634756 ctr_1 control
## 2 SRR1634757 ctr_2 control
## 3 SRR1634758 ctr_3 control
## 4 SRR1634759 inf_1 infected
## 5 SRR1634760 inf_2 infected
## 6 SRR1634761 inf_3 infected
For accessing the chromosomeSize
slot
chromosomeSizes(srnaExp)
## 14
## 107043718
The list of parameters can be exported by the function
parameters
parameters(srnaExp)
The regions, with corresponding information provided by DESeq2 (mean expression, fold-change, p-value, adjusted p-value, etc.), can be extracted with this command:
regions <- regions(srnaExp, pvalue=0.5)
where pvalue
is the (adjusted) p-value threshold. The output in a
GenomicRanges object, and the information is accessible with the
mcols()
function.
An insightful way of looking at the results of srnadiff
is to
investigate how the coverage information surrounding finded regions
are distributed on the genomic coordinates.
plotRegions
provides a flexible genomic visualization framework by
displaying tracks in the sense of the Gviz
package. Given
a region (or regions), four separate tracks are represented:
GenomeAxisTrack
, a horizontal axis with genomic
coordinate tickmarks for reference location to the displayed genomic
regions;GeneRegionTrack
, if the annot
argument is
passed, a track displaying all gene and/or sRNA annotation information
in a particular region;AnnotationTrack
, regions are plotted as simple
boxes if no strand information is available, or as arrows to indicate
their direction; andDataTrack
, plot the sample coverages surrounding
the genomic regions.The sample coverages can be plotted in various different forms as well as combinations thereof. Supported plotting types are:
The sample coverages can be plotted in various different forms as well as combinations thereof. Supported plotting types are:
p
: simple dot plot;
l
: lines plot;
b
: combination of dot and lines plot;
a
: lines plot of the sample-groups average (i.e., mean) values;
confint
: confidence intervals for average values.
The default visualization for results from srnadiff
is a lines plot
of the sample-groups average.
plotRegions(srnaExp, regions(srnaExp)[1])
Displayed below are the same sample data as before but plotted with the different type settings: