In this document, we show how to conduct Exploratory Data Analysis
(EDA) and normalization for a typical RNA-Seq experiment using the
One can think of EDA for RNA-Seq as a two-step process: “read-level” EDA helps in discovering lanes with low sequencing depths, quality issues, and unusual nucleotide frequencies, while ``gene-level’’ EDA can capture mislabeled lanes, issues with distributional assumptions (e.g., over-dispersion), and GC-content bias.
The package also implements both “within-lane” and “between-lane” normalization procedures, to account, respectively, for within-lane gene-specific (and possibly lane-specific) effects on read counts (e.g., related to gene length or GC-content) and for between-lane distributional differences in read counts (e.g., sequencing depths).
To illustrate the functionality of the
EDASeq package, we
make use of the Saccharomyces cerevisiae RNA-Seq data from
(Lee et al. 2008). Briefly, a wild-type strain and three mutant
strains were sequenced using the Solexa 1G Genome Analyzer. For each
strain, there are four technical replicate lanes from the same library
preparation. The reads were aligned using
(Langmead et al. 2009), with unique mapping and allowing up to
leeBamViews package provides a subset of the aligned
reads in BAM format. In particular, only the reads mapped between
bases 800,000 and 900,000 of chromosome XIII are considered. We use
these reads to illustrate read-level EDA.
yeastRNASeq package contains gene-level read counts for
four lanes: two replicates of the wild-type strain (“wt”) and
two replicates of one of the mutant strains (“mut”). We use
these data to illustrate gene-level EDA.
## Warning: replacing previous import 'utils::findMatches' by ## 'S4Vectors::findMatches' when loading 'AnnotationDbi'
Unaligned (unmapped) reads stored in FASTQ format may be managed via
FastqFileList imported from
Information related to the libraries sequenced in each lane can be
stored in the
elementMetadata slot of the
files <- list.files(file.path(system.file(package = "yeastRNASeq"), "reads"), pattern = "fastq", full.names = TRUE) names(files) <- gsub("\\.fastq.*", "", basename(files)) met <- DataFrame(conditions=c(rep("mut",2), rep("wt",2)), row.names=names(files)) fastq <- FastqFileList(files) elementMetadata(fastq) <- met fastq
## FastqFileList of length 4 ## names(4): mut_1_f mut_2_f wt_1_f wt_2_f
The package can deal with aligned (mapped) reads in BAM format, using
Rsamtools. Again, the
elementMetadata slot can be used to store lane-level sample
files <- list.files(file.path(system.file(package = "leeBamViews"), "bam"), pattern = "bam$", full.names = TRUE) names(files) <- gsub("\\.bam", "", basename(files)) gt <- gsub(".*/", "", files) gt <- gsub("_.*", "", gt) lane <- gsub(".*(.)$", "\\1", gt) geno <- gsub(".$", "", gt) pd <- DataFrame(geno=geno, lane=lane, row.names=paste(geno,lane,sep=".")) bfs <- BamFileList(files) elementMetadata(bfs) <- pd bfs
## BamFileList of length 8 ## names(8): isowt5_13e isowt6_13e ... xrn1_13e xrn2_13e
One important check for quality control is to look at the total number of reads produced in each lane, the number and the percentage of reads mapped to a reference genome. A low total number of reads might be a symptom of low quality of the input RNA, while a low mapping percentage might indicate poor quality of the reads (low complexity), problems with the reference genome, or mislabeled lanes.
colors <- c(rep(rgb(1,0,0,alpha=0.7),2), rep(rgb(0,0,1,alpha=0.7),2), rep(rgb(0,1,0,alpha=0.7),2), rep(rgb(0,1,1,alpha=0.7),2)) barplot(bfs,las=2,col=colors)
The figure, produced using the
barplot method for the
BamFileList class, displays the number of mapped reads for
the subset of the yeast dataset included in the package
not provide unaligned reads, but barplots of the total number of reads
can be obtained using the
barplot method for the
FastqFileList class. Analogously, one can plot the percentage
of mapped reads with the
plot method with signature
c(x="BamFileList", y="FastqFileList"). See the manual pages for
As an additional quality check, one can plot the mean per-base (i.e., per-cycle) quality of the unmapped or mapped reads in every lane.
plotQuality(bfs,col=colors,lty=1) legend("topright",unique(elementMetadata(bfs)[,1]), fill=unique(colors))
If one is interested in looking more thoroughly at one lane, it is possible to display the per-base distribution of quality scores for each lane and the number of mapped reads stratified by chromosome or strand. As expected, all the reads are mapped to chromosome XIII.
A potential source of bias is related to the sequence composition of
the reads. The function
plotNtFrequency plots the
per-base nucleotide frequencies for all the reads in a given