Here we show a brief example of using the alpine package to model
bias parameters and then using those parameters to estimate transcript
abundance. We load a metadata table and a subset of reads from four
samples from the GEUVADIS project. For more details on these files,
see ?alpineData
in the alpineData package.
library(alpineData)
dir <- system.file("extdata",package="alpineData")
metadata <- read.csv(file.path(dir,"metadata.csv"),
stringsAsFactors=FALSE)
metadata[,c("Title","Performer","Date","Population")]
## Title Performer Date Population
## 1 ERR188297 UNIGE 111124 TSI
## 2 ERR188088 UNIGE 111124 TSI
## 3 ERR188204 CNAG_CRG 111215 TSI
## 4 ERR188317 CNAG_CRG 111215 TSI
A subset of the reads from one of the samples:
library(GenomicAlignments)
ERR188297()
## GAlignmentPairs object with 25531 pairs, strandMode=1, and 0 metadata columns:
## seqnames strand : ranges --
## <Rle> <Rle> : <IRanges> --
## [1] 1 + : [108560389, 108560463] --
## [2] 1 - : [108560454, 108560528] --
## [3] 1 + : [108560534, 108600608] --
## [4] 1 - : [108569920, 108569994] --
## [5] 1 - : [108587954, 108588028] --
## ... ... ... ... ... ...
## [25527] X + : [119790596, 119790670] --
## [25528] X + : [119790988, 119791062] --
## [25529] X + : [119791037, 119791111] --
## [25530] X + : [119791348, 119791422] --
## [25531] X + : [119791376, 119791450] --
## ranges
## <IRanges>
## [1] [108560454, 108560528]
## [2] [108560383, 108560457]
## [3] [108600626, 108606236]
## [4] [108569825, 108569899]
## [5] [108587881, 108587955]
## ... ...
## [25527] [119790717, 119790791]
## [25528] [119791086, 119791160]
## [25529] [119791142, 119791216]
## [25530] [119791475, 119791549]
## [25531] [119791481, 119791555]
## -------
## seqinfo: 194 sequences from an unspecified genome
Before we start, we need to write these paired-end reads, here stored in a R/Bioconductor data object, out to a BAM file, because the alpine software works with alignments stored as BAM files. This is not a typical step, as you would normally have BAM files already on disk. We write out four BAM files for each of the four samples contained in alpineData. So you can ignore the following code chunk if you are working with your own BAM files.
library(rtracklayer)
dir <- system.file(package="alpineData", "extdata")
for (sample.name in metadata$Title) {
# the reads are accessed with functions named
# after the sample name. the following line calls
# the function with the sample name and saves
# the reads to `gap`
gap <- match.fun(sample.name)()
file.name <- file.path(dir,paste0(sample.name,".bam"))
export(gap, con=file.name)
}
bam.files <- file.path(dir, paste0(metadata$Title, ".bam"))
names(bam.files) <- metadata$Title
stopifnot(all(file.exists(bam.files)))
Now we continue with the typical steps in an alpine workflow. To fit the bias model, we need to identify single-isoform genes. We used the following chunk of code (here not evaluated) to generate a GRangesList of exons per single-isoform gene.
library(ensembldb)
gtf.file <- "Homo_sapiens.GRCh38.84.gtf"
txdb <- EnsDb(gtf.file) # already an EnsDb
txdf <- transcripts(txdb, return.type="DataFrame")
tab <- table(txdf$gene_id)
one.iso.genes <- names(tab)[tab == 1]
# pre-selected genes based on medium to high counts
# calculated using Rsubread::featureCounts
selected.genes <- scan("selected.genes.txt", what="char")
one.iso.txs <- txdf$tx_id[txdf$gene_id %in%
intersect(one.iso.genes, selected.genes)]
ebt0 <- exonsBy(txdb, by="tx")
ebt.fit <- ebt0[one.iso.txs]
Here we pick a subset of single-isoform genes based on the number of exons, and the length. We show in comments the recommended parameters to use in selecting this subset of genes, although here we use different parameters to ensure the building of the vignette takes only a short period of time and does not use much memory.
library(GenomicRanges)
library(alpine)
data(preprocessedData)
# filter small genes and long genes
min.bp <- 600
max.bp <- 7000
gene.lengths <- sum(width(ebt.fit))
summary(gene.lengths)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 66.0 689.8 1585.0 2147.0 3352.0 6103.0
ebt.fit <- ebt.fit[gene.lengths > min.bp & gene.lengths < max.bp]
length(ebt.fit)
## [1] 25
set.seed(1)
# better to use ~100 genes
ebt.fit <- ebt.fit[sample(length(ebt.fit),10)]
Robust fitting of these bias parameters is best with ~100 medium to
high count genes, e.g. mean count across samples between 200 and
10,000. These counts can be identified by featureCounts from the
Rsubread Bioconductor package, for example.
It is required to specify a minimum and maximum fragment size
which should be lower and upper quantiles of the fragment length
distribution. The minsize
and maxsize
arguments are recommended to be roughly the 2.5% and 97.5% of the
fragment length distribution. This can be quickly estimated using the
helper function getFragmentWidths, iterating over a few
single-isoform genes with sufficient counts:
w <- getFragmentWidths(bam.files[1], ebt.fit[[1]])
c(summary(w), Number=length(w))
## Min. 1st Qu. Median Mean 3rd Qu. Max. Number
## 79.0 161.0 189.0 197.1 224.0 468.0 1762.0
quantile(w, c(.025, .975))
## 2.5% 97.5%
## 122.05 321.95
It is also required to specify the read length. Currently alpine only supports unstranded, paired-end RNA-seq with fixed read length. Differences of +/- 1 bp in read length across samples can be ignored.
getReadLength(bam.files)
## ERR188297 ERR188088 ERR188204 ERR188317
## 75 75 76 76
Here we use a very limited range of fragment lengths for speed, but for a real analysis we would suggest using the minimum and maximum of the quantiles computed above across all samples (the minimum of the lower quantiles and the maximum of the upper quantiles).
library(alpine)
library(BSgenome.Hsapiens.NCBI.GRCh38)
readlength <- 75
minsize <- 125 # better 80 for this data
maxsize <- 175 # better 350 for this data
gene.names <- names(ebt.fit)
names(gene.names) <- gene.names
The following function builds a list of DataFrames which store information about the fragment types from each gene in our training set.
system.time({
fragtypes <- lapply(gene.names, function(gene.name) {
buildFragtypes(exons=ebt.fit[[gene.name]],
genome=Hsapiens,
readlength=readlength,
minsize=minsize,
maxsize=maxsize,
gc.str=FALSE)
})
})
## user system elapsed
## 16.180 0.368 17.386
print(object.size(fragtypes), units="auto")
## 99.7 Mb
We can examine the information for a single gene:
head(fragtypes[[1]], 3)
## DataFrame with 3 rows and 14 columns
## start end relpos fraglen id fivep.test
## <integer> <integer> <numeric> <integer> <IRanges> <logical>
## 1 1 125 0.01032279 125 [1, 125] FALSE
## 2 1 126 0.01032279 126 [1, 126] FALSE
## 3 1 127 0.01048665 127 [1, 127] FALSE
## fivep threep.test threep gc gstart
## <DNAStringSet> <logical> <DNAStringSet> <numeric> <integer>
## 1 ATCCGGGGCAGCG TRUE TCAATTCAAAATGTCTCAGGA 0.7680000 32277664
## 2 ATCCGGGGCAGCG TRUE GTCAATTCAAAATGTCTCAGG 0.7619048 32277664
## 3 ATCCGGGGCAGCG TRUE TGTCAATTCAAAATGTCTCAG 0.7637795 32277664
## gend gread1end gread2start
## <integer> <integer> <integer>
## 1 32309735 32277738 32277714
## 2 32309736 32277738 32277715
## 3 32309737 32277738 32277716
The definition of bias models is extremely flexible in alpine. The
models
argument should be given as a list, where each element is
model. The model itself should be provided as a list with elements
formula
and offset
. Either formula
or offset
can be set to
NULL
for a given model.
The allowable offsets are fraglen
and/or vlmm
which should be
provided in a character vector.
Offsets are only estimated once for all models, so setting
formula=NULL
only makes sense if extra offsets are desired
which were not already calculated by other models.
Any kind of R formula can be provided to formula
, making use of the
fragment features:
gc
(fragment GC content from 0 to 1)relpos
(fragment midpoint relative position from 0 to 1)GC40.80
, GC40.90
, GC20.80
, GC20.90
(indicator variables
indicating the presence of, e.g. a 40 bp stretch of 80% or higher GC
content within the fragment)These fragment features reference columns of information stored in
fragtypes
. Interactions between these terms and offsets are also
possible, e.g. gc:fraglen
.
Note: It is required to provide formula as character strings, which are converted internally into formula, due to details in how R formula make copies of objects from the environment.
models <- list(
"GC" = list(
formula = "count ~ ns(gc,knots=gc.knots,Boundary.knots=gc.bk) + ns(relpos,knots=relpos.knots,Boundary.knots=relpos.bk) + gene",
offset=c("fraglen")
),
"all" = list(
formula = "count ~ ns(gc,knots=gc.knots,Boundary.knots=gc.bk) + ns(relpos,knots=relpos.knots,Boundary.knots=relpos.bk) + gene",
offset=c("fraglen","vlmm")
)
)
Here we fit one bias model, GC
, using fragment length, fragment GC
content, relative position, and a term for differences in expression
across the genes (+ gene
).
We fit another bias model, all
, with all the terms of the first but
additionally with read start bias (encoded by a Variable Length Markov
Model, or VLMM).
Note: It is required if a formula is provided that it end with +
gene
to account for differences in base expression levels while
fitting the bias parameters.
The knots and boundary knots for GC content (gc
) and relative
position (relpos
) splines have reasonable default values, but they
can be customized using arguments to the fitBiasModels function.
The returned object, fitpar
, stores the information as a list of
fitted parameters across samples.
system.time({
fitpar <- lapply(bam.files, function(bf) {
fitBiasModels(genes=ebt.fit,
bam.file=bf,
fragtypes=fragtypes,
genome=Hsapiens,
models=models,
readlength=readlength,
minsize=minsize,
maxsize=maxsize)
})
})
## user system elapsed
## 55.620 1.064 60.237
# this object saved was 'fitpar.small' for examples in alpine man pages
# fitpar.small <- fitpar
Note that with more basepairs between minsize
and maxsize
and with
more genes used for estimation, the bias parameters would be more
precise. As estimated here, the curves look a bit wobbly. Compare to
the curves that are fit in the alpine paper (see citation("alpine")
).
The estimated spline coefficients have high variance from too few
observations (paired-end fragments) across too few genes.
First we set a palette to distinguish between samples
library(RColorBrewer)
palette(brewer.pal(8,"Dark2"))
The fragment length distribution:
perf <- as.integer(factor(metadata$Performer))
plotFragLen(fitpar, col=perf)
The fragment GC bias curves:
plotGC(fitpar, model="all", col=perf)
The relative position curves:
plotRelPos(fitpar, model="all", col=perf)
A 0-order version of the VLMM (note that the VLMM that is used in the model includes positions that are 1- and 2-order, so this plot does not represent the final VLMM used in bias estimation or in estimation of abundances).
plotOrder0(fitpar[["ERR188297"]][["vlmm.fivep"]][["order0"]])
plotOrder0(fitpar[["ERR188297"]][["vlmm.threep"]][["order0"]])
A coefficient table for the terms in formula
:
print(head(fitpar[["ERR188297"]][["summary"]][["all"]]), row.names=FALSE)
## Estimate Std. Error z value Pr(>|z|)
## -4.9780792 1.2944129 -3.8458 1.20e-04
## 1.3428276 1.2486577 1.0754 2.82e-01
## 1.8331308 0.8646923 2.1200 3.40e-02
## 1.8867834 2.5408684 0.7426 4.58e-01
## -3.8256451 2.0065307 -1.9066 5.66e-02
## 0.8080533 0.1568161 5.1529 2.57e-07
We pick a subset of genes for estimating transcript abundances. If the gene annotation includes genes with transcripts which span multiple chromosomes or which do not have any overlap and are very far apart, splitGenesAcrossChroms and splitLongGenes, respectively, can be used to split these. For again merging any overlapping transcripts into “genes”, the mergeGenes function can be used. Here we use the ENSEMBL gene annotation as is.
The following code chunk is not evaluated but was used to select a few genes for demonstrating estimateAbundance:
one.iso.genes <- intersect(names(tab)[tab == 1], selected.genes)
two.iso.genes <- intersect(names(tab)[tab == 2], selected.genes)
three.iso.genes <- intersect(names(tab)[tab == 3], selected.genes)
set.seed(1)
genes.theta <- c(sample(one.iso.genes, 2),
sample(two.iso.genes, 2),
sample(three.iso.genes, 2))
txdf.theta <- txdf[txdf$gene_id %in% genes.theta,]
ebt.theta <- ebt0[txdf.theta$tx_id]
Next we specify the set of models we want to use, referring back by
name to the models that were fit in the previous step. Additionally,
we can include any of the following models: null
, fraglen
, vlmm
,
or fraglen.vlmm
which are the four models that can be fit using only
offsets (none, either or both of the offsets).
model.names <- c("null","fraglen.vlmm","GC")
Here we estimate FPKM-scale abundances for multiple genes and multiple
samples. If lib.sizes
is not specified, a default value of 1e6
is used. estimateAbundance works one gene at a time, where the
transcripts
argument expects a GRangesList of the exons for each
transcript (multiple if the gene has multiple isoforms).
system.time({
res <- lapply(genes.theta, function(gene.name) {
txs <- txdf.theta$tx_id[txdf.theta$gene_id == gene.name]
estimateAbundance(transcripts=ebt.theta[txs],
bam.files=bam.files,
fitpar=fitpar,
genome=Hsapiens,
model.names=model.names)
})
})
## user system elapsed
## 56.780 2.952 59.807
Each element of this list has the abundances (theta
) and average
bias (lambda
) for a single gene across all samples, all models, and all
isoforms of the gene:
res[[1]][["ERR188297"]][["GC"]]
## $theta
## ENST00000259030
## 0.3425053
##
## $lambda
## ENST00000259030
## 73.34301
res[[6]][["ERR188297"]][["GC"]]
## $theta
## ENST00000477403 ENST00000468844 ENST00000361575
## 7.869292934 0.008579681 3.014203873
##
## $lambda
## ENST00000477403 ENST00000468844 ENST00000361575
## 82.26703 70.47448 82.04643
The extractAlpine function can be used to collate estimates from
across all genes. extractAlpine will scale the estimates such that
the total bias observed over all transcripts is centered at 1. The
estimates produce by estimateAbundance presume a default library size of
1e6, but will be rescaled using the total number of fragments across
genes when using extractAlpine (if this library size rescaling is
not desired, choose divide.out=FALSE
).
mat <- extractAlpine(res, model="GC")
mat
## ERR188297 ERR188088 ERR188204 ERR188317
## ENST00000259030 5.557029e+03 17769.0617 10514.5581 1.302500e+04
## ENST00000304788 4.300769e+03 5620.1425 8585.7954 7.675728e+03
## ENST00000295025 1.341303e+04 9652.3313 18721.2112 1.531396e+04
## ENST00000394479 4.793712e+03 1000.5219 6205.6560 4.114905e+03
## ENST00000330871 1.525750e+04 33787.1234 10879.3160 1.139717e+04
## ENST00000587578 0.000000e+00 0.0000 0.0000 0.000000e+00
## ENST00000264254 3.832130e+04 50374.8018 53612.5766 6.396532e+04
## ENST00000416255 2.095944e+03 6057.3662 4859.2903 2.357083e+03
## ENST00000450127 5.805294e-32 1681.2494 585.0105 2.387932e-28
## ENST00000477403 1.276765e+05 128681.9577 243208.0857 1.707819e+05
## ENST00000468844 1.392023e+02 634.9873 1192.6475 1.167633e+03
## ENST00000361575 4.890440e+04 63053.9109 84102.2731 1.851644e+05
If we provide a GRangesList which contains the exons for each
transcript, the returned object will be a SummarizedExperiment.
The GRangesList provided to transcripts
does not have to be in the
correct order, the transcripts will be extracted by name to match the
rows of the FPKM matrix.
se <- extractAlpine(res, model="GC", transcripts=ebt.theta)
se
## class: RangedSummarizedExperiment
## dim: 12 4
## metadata(0):
## assays(1): FPKM
## rownames(12): ENST00000259030 ENST00000304788 ... ENST00000468844
## ENST00000361575
## rowData names(0):
## colnames(4): ERR188297 ERR188088 ERR188204 ERR188317
## colData names(0):
The matrix of FPKM values can be scaled using the median ratio method of DESeq with the normalizeDESeq function. This is a robust method which removes systematic differences in values across samples, and is more appropriate than using the total count which is sensitive to very large abundance estimates for a minority of transcripts.
norm.mat <- normalizeDESeq(mat, cutoff=0.1)
The fragment GC bias which alpine estimates can be used in
downstream simulations, for example in the
polyester Bioconductor
package. All we need to do is to run the plotGC function, but
specifying that instead of a plot, we want to return a matrix of
probabilities for each percentile of fragment GC content. This matrix
can be provided to the frag_GC_bias
argument of simulate_experiment.
We load a fitpar
object that was run with the fragment length range
[80,350] bp.
data(preprocessedData)
prob.mat <- plotGC(fitpar, "all", return.type=2)
head(prob.mat)
## ERR188297 ERR188088 ERR188204 ERR188317
## 0 0.04366855 0.2561645 0.06914584 0.07234787
## 0.01 0.04936226 0.2725952 0.07667866 0.08005226
## 0.02 0.05578805 0.2900464 0.08501986 0.08856473
## 0.03 0.06302707 0.3085437 0.09424127 0.09795502
## 0.04 0.07116603 0.3281071 0.10441771 0.10829555
## 0.05 0.08029675 0.3487502 0.11562636 0.11966079
If return.type=0
(the default) the function makes the plot of log
fragment rate over fragment GC content. If return.type=1
the
function returns the matrix of log fragment rate over percentiles of
fragment GC content, and if return.type=2
, the matrix returns
probabilities of observing fragments based on percentiles of fragment
GC content (the log fragment rate exponentiated and scaled to have a
maximum of 1). The matrix returned by return.type=2
is appropriate
for downstream use with polyester.
In the alpine paper, it was shown that models incorporating fragment GC bias can be a better predictor of test set RNA-seq fragment coverage, compared to models incorporating read start bias. Here we show how to predict fragment coverage for a single-isoform gene using a variety of fitted bias models. As with estimateAbundace, the model names need to refer back to models fit using fitBiasModels.
model.names <- c("fraglen","fraglen.vlmm","GC","all")
The following function computes the predicted coverage for one
single-isoform gene. We load a fitpar
object that was run
with the fragment length range [80,350] bp.
fitpar[[1]][["model.params"]][c("minsize","maxsize")]
## $minsize
## [1] 80
##
## $maxsize
## [1] 350
system.time({
pred.cov <- predictCoverage(gene=ebt.fit[["ENST00000245479"]],
bam.files=bam.files["ERR188204"],
fitpar=fitpar,
genome=Hsapiens,
model.names=model.names)
})
## user system elapsed
## 24.808 2.916 27.751
We can plot the observed and predicted coverage for one of the genes:
palette(brewer.pal(9, "Set1"))
frag.cov <- pred.cov[["ERR188204"]][["frag.cov"]]
plot(frag.cov, type="l", lwd=3, ylim=c(0,max(frag.cov)*1.5))
for (i in seq_along(model.names)) {
m <- model.names[i]
pred <- pred.cov[["ERR188204"]][["pred.cov"]][[m]]
lines(pred, col=i, lwd=3)
}
legend("topright", legend=c("observed",model.names),
col=c("black",seq_along(model.names)), lwd=3)
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] RColorBrewer_1.1-2
## [2] BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000
## [3] BSgenome_1.42.0
## [4] alpine_1.0.0
## [5] rtracklayer_1.34.0
## [6] GenomicAlignments_1.10.0
## [7] Rsamtools_1.26.0
## [8] Biostrings_2.42.0
## [9] XVector_0.14.0
## [10] SummarizedExperiment_1.4.0
## [11] Biobase_2.34.0
## [12] alpineData_0.99.0
## [13] ExperimentHubData_1.0.0
## [14] AnnotationHubData_1.4.0
## [15] futile.logger_1.4.3
## [16] GenomicRanges_1.26.0
## [17] GenomeInfoDb_1.10.0
## [18] IRanges_2.8.0
## [19] S4Vectors_0.12.0
## [20] ExperimentHub_1.0.0
## [21] AnnotationHub_2.6.0
## [22] BiocGenerics_0.20.0
## [23] knitr_1.14
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.7 lattice_0.20-34
## [3] digest_0.6.10 mime_0.5
## [5] R6_2.2.0 chron_2.3-47
## [7] futile.options_1.0.0 RSQLite_1.0.0
## [9] evaluate_0.10 httr_1.2.1
## [11] BiocInstaller_1.24.0 biocViews_1.42.0
## [13] zlibbioc_1.20.0 GenomicFeatures_1.26.0
## [15] curl_2.1 data.table_1.9.6
## [17] Matrix_1.2-7.1 RUnit_0.4.31
## [19] splines_3.3.1 BiocParallel_1.8.0
## [21] stringr_1.1.0 RCurl_1.95-4.8
## [23] biomaRt_2.30.0 shiny_0.14.1
## [25] httpuv_1.3.3 speedglm_0.3-1
## [27] htmltools_0.3.5 GEOquery_2.40.0
## [29] interactiveDisplayBase_1.12.0 codetools_0.2-15
## [31] BiocCheck_1.10.0 XML_3.98-1.4
## [33] AnnotationForge_1.16.0 MASS_7.3-45
## [35] bitops_1.0-6 grid_3.3.1
## [37] RBGL_1.50.0 jsonlite_1.1
## [39] xtable_1.8-2 DBI_0.5-1
## [41] magrittr_1.5 formatR_1.4
## [43] graph_1.52.0 stringi_1.1.2
## [45] getopt_1.20.0 optparse_1.3.2
## [47] xml2_1.0.0 rBiopaxParser_2.14.0
## [49] lambda.r_1.1.9 tools_3.3.1
## [51] OrganismDbi_1.16.0 AnnotationDbi_1.36.0