Modeling and correcting fragment sequence bias

Here we show a brief example of using the alpine package to model bias parameters and then using those parameters to estimate transcript abundance. We load a metadata table and a subset of reads from four samples from the GEUVADIS project. For more details on these files, see ?alpineData in the alpineData package.

library(alpineData)
dir <- system.file("extdata",package="alpineData")
metadata <- read.csv(file.path(dir,"metadata.csv"),
                     stringsAsFactors=FALSE)
metadata[,c("Title","Performer","Date","Population")]
##       Title Performer   Date Population
## 1 ERR188297     UNIGE 111124        TSI
## 2 ERR188088     UNIGE 111124        TSI
## 3 ERR188204  CNAG_CRG 111215        TSI
## 4 ERR188317  CNAG_CRG 111215        TSI

A subset of the reads from one of the samples:

library(GenomicAlignments)
ERR188297()
## GAlignmentPairs object with 25531 pairs, strandMode=1, and 0 metadata columns:
##           seqnames strand   :                 ranges  --
##              <Rle>  <Rle>   :              <IRanges>  --
##       [1]        1      +   : [108560389, 108560463]  --
##       [2]        1      -   : [108560454, 108560528]  --
##       [3]        1      +   : [108560534, 108600608]  --
##       [4]        1      -   : [108569920, 108569994]  --
##       [5]        1      -   : [108587954, 108588028]  --
##       ...      ...    ... ...                    ... ...
##   [25527]        X      +   : [119790596, 119790670]  --
##   [25528]        X      +   : [119790988, 119791062]  --
##   [25529]        X      +   : [119791037, 119791111]  --
##   [25530]        X      +   : [119791348, 119791422]  --
##   [25531]        X      +   : [119791376, 119791450]  --
##                           ranges
##                        <IRanges>
##       [1] [108560454, 108560528]
##       [2] [108560383, 108560457]
##       [3] [108600626, 108606236]
##       [4] [108569825, 108569899]
##       [5] [108587881, 108587955]
##       ...                    ...
##   [25527] [119790717, 119790791]
##   [25528] [119791086, 119791160]
##   [25529] [119791142, 119791216]
##   [25530] [119791475, 119791549]
##   [25531] [119791481, 119791555]
##   -------
##   seqinfo: 194 sequences from an unspecified genome

Before we start, we need to write these paired-end reads, here stored in a R/Bioconductor data object, out to a BAM file, because the alpine software works with alignments stored as BAM files. This is not a typical step, as you would normally have BAM files already on disk. We write out four BAM files for each of the four samples contained in alpineData. So you can ignore the following code chunk if you are working with your own BAM files.

library(rtracklayer)
dir <- system.file(package="alpineData", "extdata")
for (sample.name in metadata$Title) {
  # the reads are accessed with functions named
  # after the sample name. the following line calls
  # the function with the sample name and saves 
  # the reads to `gap`
  gap <- match.fun(sample.name)()
  file.name <- file.path(dir,paste0(sample.name,".bam"))
  export(gap, con=file.name)
}
bam.files <- file.path(dir, paste0(metadata$Title, ".bam"))
names(bam.files) <- metadata$Title
stopifnot(all(file.exists(bam.files)))

Now we continue with the typical steps in an alpine workflow. To fit the bias model, we need to identify single-isoform genes. We used the following chunk of code (here not evaluated) to generate a GRangesList of exons per single-isoform gene.

library(ensembldb)
gtf.file <- "Homo_sapiens.GRCh38.84.gtf"
txdb <- EnsDb(gtf.file) # already an EnsDb
txdf <- transcripts(txdb, return.type="DataFrame")
tab <- table(txdf$gene_id)
one.iso.genes <- names(tab)[tab == 1]
# pre-selected genes based on medium to high counts
# calculated using Rsubread::featureCounts
selected.genes <- scan("selected.genes.txt", what="char")
one.iso.txs <- txdf$tx_id[txdf$gene_id %in%
                          intersect(one.iso.genes, selected.genes)]
ebt0 <- exonsBy(txdb, by="tx")
ebt.fit <- ebt0[one.iso.txs]

Here we pick a subset of single-isoform genes based on the number of exons, and the length. We show in comments the recommended parameters to use in selecting this subset of genes, although here we use different parameters to ensure the building of the vignette takes only a short period of time and does not use much memory.

library(GenomicRanges)
library(alpine)
data(preprocessedData)
# filter small genes and long genes
min.bp <- 600
max.bp <- 7000 
gene.lengths <- sum(width(ebt.fit))
summary(gene.lengths)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    66.0   689.8  1585.0  2147.0  3352.0  6103.0
ebt.fit <- ebt.fit[gene.lengths > min.bp & gene.lengths < max.bp]
length(ebt.fit)
## [1] 25
set.seed(1)
# better to use ~100 genes
ebt.fit <- ebt.fit[sample(length(ebt.fit),10)] 

Defining a set of fragment types

Robust fitting of these bias parameters is best with ~100 medium to high count genes, e.g. mean count across samples between 200 and 10,000. These counts can be identified by featureCounts from the Rsubread Bioconductor package, for example. It is required to specify a minimum and maximum fragment size which should be lower and upper quantiles of the fragment length distribution. The minsize and maxsize arguments are recommended to be roughly the 2.5% and 97.5% of the fragment length distribution. This can be quickly estimated using the helper function getFragmentWidths, iterating over a few single-isoform genes with sufficient counts:

w <- getFragmentWidths(bam.files[1], ebt.fit[[1]])
c(summary(w), Number=length(w))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.  Number 
##    79.0   161.0   189.0   197.1   224.0   468.0  1762.0
quantile(w, c(.025, .975))
##   2.5%  97.5% 
## 122.05 321.95

It is also required to specify the read length. Currently alpine only supports unstranded, paired-end RNA-seq with fixed read length. Differences of +/- 1 bp in read length across samples can be ignored.

getReadLength(bam.files)
## ERR188297 ERR188088 ERR188204 ERR188317 
##        75        75        76        76

Here we use a very limited range of fragment lengths for speed, but for a real analysis we would suggest using the minimum and maximum of the quantiles computed above across all samples (the minimum of the lower quantiles and the maximum of the upper quantiles).

library(alpine)
library(BSgenome.Hsapiens.NCBI.GRCh38)
readlength <- 75 
minsize <- 125 # better 80 for this data
maxsize <- 175 # better 350 for this data
gene.names <- names(ebt.fit)
names(gene.names) <- gene.names

The following function builds a list of DataFrames which store information about the fragment types from each gene in our training set.

system.time({
fragtypes <- lapply(gene.names, function(gene.name) {
                      buildFragtypes(exons=ebt.fit[[gene.name]],
                                     genome=Hsapiens,
                                     readlength=readlength,
                                     minsize=minsize,
                                     maxsize=maxsize,
                                     gc.str=FALSE)
                    })
})
##    user  system elapsed 
##  16.180   0.368  17.386
print(object.size(fragtypes), units="auto")
## 99.7 Mb

We can examine the information for a single gene:

head(fragtypes[[1]], 3)
## DataFrame with 3 rows and 14 columns
##       start       end     relpos   fraglen        id fivep.test
##   <integer> <integer>  <numeric> <integer> <IRanges>  <logical>
## 1         1       125 0.01032279       125  [1, 125]      FALSE
## 2         1       126 0.01032279       126  [1, 126]      FALSE
## 3         1       127 0.01048665       127  [1, 127]      FALSE
##            fivep threep.test                threep        gc    gstart
##   <DNAStringSet>   <logical>        <DNAStringSet> <numeric> <integer>
## 1  ATCCGGGGCAGCG        TRUE TCAATTCAAAATGTCTCAGGA 0.7680000  32277664
## 2  ATCCGGGGCAGCG        TRUE GTCAATTCAAAATGTCTCAGG 0.7619048  32277664
## 3  ATCCGGGGCAGCG        TRUE TGTCAATTCAAAATGTCTCAG 0.7637795  32277664
##        gend gread1end gread2start
##   <integer> <integer>   <integer>
## 1  32309735  32277738    32277714
## 2  32309736  32277738    32277715
## 3  32309737  32277738    32277716

Defining and fitting bias models

The definition of bias models is extremely flexible in alpine. The models argument should be given as a list, where each element is model. The model itself should be provided as a list with elements formula and offset. Either formula or offset can be set to NULL for a given model. The allowable offsets are fraglen and/or vlmm which should be provided in a character vector. Offsets are only estimated once for all models, so setting formula=NULL only makes sense if extra offsets are desired which were not already calculated by other models.

Any kind of R formula can be provided to formula, making use of the fragment features:

These fragment features reference columns of information stored in fragtypes. Interactions between these terms and offsets are also possible, e.g. gc:fraglen.

Note: It is required to provide formula as character strings, which are converted internally into formula, due to details in how R formula make copies of objects from the environment.

models <- list(
  "GC" = list(
    formula = "count ~ ns(gc,knots=gc.knots,Boundary.knots=gc.bk) + ns(relpos,knots=relpos.knots,Boundary.knots=relpos.bk) + gene",
    offset=c("fraglen")
  ),
  "all" = list(
    formula = "count ~ ns(gc,knots=gc.knots,Boundary.knots=gc.bk) + ns(relpos,knots=relpos.knots,Boundary.knots=relpos.bk) + gene",
    offset=c("fraglen","vlmm")
  )
)

Here we fit one bias model, GC, using fragment length, fragment GC content, relative position, and a term for differences in expression across the genes (+ gene).

We fit another bias model, all, with all the terms of the first but additionally with read start bias (encoded by a Variable Length Markov Model, or VLMM).

Note: It is required if a formula is provided that it end with + gene to account for differences in base expression levels while fitting the bias parameters.

The knots and boundary knots for GC content (gc) and relative position (relpos) splines have reasonable default values, but they can be customized using arguments to the fitBiasModels function.

The returned object, fitpar, stores the information as a list of fitted parameters across samples.

system.time({
fitpar <- lapply(bam.files, function(bf) {
                   fitBiasModels(genes=ebt.fit,
                                 bam.file=bf,
                                 fragtypes=fragtypes,
                                 genome=Hsapiens,
                                 models=models,
                                 readlength=readlength,
                                 minsize=minsize,
                                 maxsize=maxsize)
                 })
})
##    user  system elapsed 
##  55.620   1.064  60.237
# this object saved was 'fitpar.small' for examples in alpine man pages
# fitpar.small <- fitpar 

Visually exploring the bias parameters

Note that with more basepairs between minsize and maxsize and with more genes used for estimation, the bias parameters would be more precise. As estimated here, the curves look a bit wobbly. Compare to the curves that are fit in the alpine paper (see citation("alpine")). The estimated spline coefficients have high variance from too few observations (paired-end fragments) across too few genes.

First we set a palette to distinguish between samples

library(RColorBrewer)
palette(brewer.pal(8,"Dark2"))

The fragment length distribution:

perf <- as.integer(factor(metadata$Performer))
plotFragLen(fitpar, col=perf)

plot of chunk fraglen

The fragment GC bias curves:

plotGC(fitpar, model="all", col=perf)

plot of chunk gccurve

The relative position curves:

plotRelPos(fitpar, model="all", col=perf)

plot of chunk relpos

A 0-order version of the VLMM (note that the VLMM that is used in the model includes positions that are 1- and 2-order, so this plot does not represent the final VLMM used in bias estimation or in estimation of abundances).

plotOrder0(fitpar[["ERR188297"]][["vlmm.fivep"]][["order0"]])

plot of chunk vlmm

plotOrder0(fitpar[["ERR188297"]][["vlmm.threep"]][["order0"]])

plot of chunk vlmm

A coefficient table for the terms in formula:

print(head(fitpar[["ERR188297"]][["summary"]][["all"]]), row.names=FALSE)
##    Estimate Std. Error z value Pr(>|z|)
##  -4.9780792  1.2944129 -3.8458 1.20e-04
##   1.3428276  1.2486577  1.0754 2.82e-01
##   1.8331308  0.8646923  2.1200 3.40e-02
##   1.8867834  2.5408684  0.7426 4.58e-01
##  -3.8256451  2.0065307 -1.9066 5.66e-02
##   0.8080533  0.1568161  5.1529 2.57e-07

Estimating transcript abundances

We pick a subset of genes for estimating transcript abundances. If the gene annotation includes genes with transcripts which span multiple chromosomes or which do not have any overlap and are very far apart, splitGenesAcrossChroms and splitLongGenes, respectively, can be used to split these. For again merging any overlapping transcripts into “genes”, the mergeGenes function can be used. Here we use the ENSEMBL gene annotation as is.

The following code chunk is not evaluated but was used to select a few genes for demonstrating estimateAbundance:

one.iso.genes <- intersect(names(tab)[tab == 1], selected.genes)
two.iso.genes <- intersect(names(tab)[tab == 2], selected.genes)
three.iso.genes <- intersect(names(tab)[tab == 3], selected.genes)
set.seed(1)
genes.theta <- c(sample(one.iso.genes, 2),
                 sample(two.iso.genes, 2),
                 sample(three.iso.genes, 2)) 
txdf.theta <- txdf[txdf$gene_id %in% genes.theta,]
ebt.theta <- ebt0[txdf.theta$tx_id]

Next we specify the set of models we want to use, referring back by name to the models that were fit in the previous step. Additionally, we can include any of the following models: null, fraglen, vlmm, or fraglen.vlmm which are the four models that can be fit using only offsets (none, either or both of the offsets).

model.names <- c("null","fraglen.vlmm","GC")

Here we estimate FPKM-scale abundances for multiple genes and multiple samples. If lib.sizes is not specified, a default value of 1e6 is used. estimateAbundance works one gene at a time, where the transcripts argument expects a GRangesList of the exons for each transcript (multiple if the gene has multiple isoforms).

system.time({
res <- lapply(genes.theta, function(gene.name) {
         txs <- txdf.theta$tx_id[txdf.theta$gene_id == gene.name]
         estimateAbundance(transcripts=ebt.theta[txs],
                           bam.files=bam.files,
                           fitpar=fitpar,
                           genome=Hsapiens,
                           model.names=model.names)
       })
})
##    user  system elapsed 
##  56.780   2.952  59.807

Each element of this list has the abundances (theta) and average bias (lambda) for a single gene across all samples, all models, and all isoforms of the gene:

res[[1]][["ERR188297"]][["GC"]]
## $theta
## ENST00000259030 
##       0.3425053 
## 
## $lambda
## ENST00000259030 
##        73.34301
res[[6]][["ERR188297"]][["GC"]]
## $theta
## ENST00000477403 ENST00000468844 ENST00000361575 
##     7.869292934     0.008579681     3.014203873 
## 
## $lambda
## ENST00000477403 ENST00000468844 ENST00000361575 
##        82.26703        70.47448        82.04643

The extractAlpine function can be used to collate estimates from across all genes. extractAlpine will scale the estimates such that the total bias observed over all transcripts is centered at 1. The estimates produce by estimateAbundance presume a default library size of 1e6, but will be rescaled using the total number of fragments across genes when using extractAlpine (if this library size rescaling is not desired, choose divide.out=FALSE).

mat <- extractAlpine(res, model="GC")
mat
##                    ERR188297   ERR188088   ERR188204    ERR188317
## ENST00000259030 5.557029e+03  17769.0617  10514.5581 1.302500e+04
## ENST00000304788 4.300769e+03   5620.1425   8585.7954 7.675728e+03
## ENST00000295025 1.341303e+04   9652.3313  18721.2112 1.531396e+04
## ENST00000394479 4.793712e+03   1000.5219   6205.6560 4.114905e+03
## ENST00000330871 1.525750e+04  33787.1234  10879.3160 1.139717e+04
## ENST00000587578 0.000000e+00      0.0000      0.0000 0.000000e+00
## ENST00000264254 3.832130e+04  50374.8018  53612.5766 6.396532e+04
## ENST00000416255 2.095944e+03   6057.3662   4859.2903 2.357083e+03
## ENST00000450127 5.805294e-32   1681.2494    585.0105 2.387932e-28
## ENST00000477403 1.276765e+05 128681.9577 243208.0857 1.707819e+05
## ENST00000468844 1.392023e+02    634.9873   1192.6475 1.167633e+03
## ENST00000361575 4.890440e+04  63053.9109  84102.2731 1.851644e+05

If we provide a GRangesList which contains the exons for each transcript, the returned object will be a SummarizedExperiment. The GRangesList provided to transcripts does not have to be in the correct order, the transcripts will be extracted by name to match the rows of the FPKM matrix.

se <- extractAlpine(res, model="GC", transcripts=ebt.theta)
se
## class: RangedSummarizedExperiment 
## dim: 12 4 
## metadata(0):
## assays(1): FPKM
## rownames(12): ENST00000259030 ENST00000304788 ... ENST00000468844
##   ENST00000361575
## rowData names(0):
## colnames(4): ERR188297 ERR188088 ERR188204 ERR188317
## colData names(0):

The matrix of FPKM values can be scaled using the median ratio method of DESeq with the normalizeDESeq function. This is a robust method which removes systematic differences in values across samples, and is more appropriate than using the total count which is sensitive to very large abundance estimates for a minority of transcripts.

norm.mat <- normalizeDESeq(mat, cutoff=0.1)

Simulating RNA-seq data with empirical GC bias

The fragment GC bias which alpine estimates can be used in downstream simulations, for example in the polyester Bioconductor package. All we need to do is to run the plotGC function, but specifying that instead of a plot, we want to return a matrix of probabilities for each percentile of fragment GC content. This matrix can be provided to the frag_GC_bias argument of simulate_experiment.

We load a fitpar object that was run with the fragment length range [80,350] bp.

data(preprocessedData)
prob.mat <- plotGC(fitpar, "all", return.type=2)
head(prob.mat)
##       ERR188297 ERR188088  ERR188204  ERR188317
## 0    0.04366855 0.2561645 0.06914584 0.07234787
## 0.01 0.04936226 0.2725952 0.07667866 0.08005226
## 0.02 0.05578805 0.2900464 0.08501986 0.08856473
## 0.03 0.06302707 0.3085437 0.09424127 0.09795502
## 0.04 0.07116603 0.3281071 0.10441771 0.10829555
## 0.05 0.08029675 0.3487502 0.11562636 0.11966079

If return.type=0 (the default) the function makes the plot of log fragment rate over fragment GC content. If return.type=1 the function returns the matrix of log fragment rate over percentiles of fragment GC content, and if return.type=2, the matrix returns probabilities of observing fragments based on percentiles of fragment GC content (the log fragment rate exponentiated and scaled to have a maximum of 1). The matrix returned by return.type=2 is appropriate for downstream use with polyester.

Plotting predicted fragment coverage

In the alpine paper, it was shown that models incorporating fragment GC bias can be a better predictor of test set RNA-seq fragment coverage, compared to models incorporating read start bias. Here we show how to predict fragment coverage for a single-isoform gene using a variety of fitted bias models. As with estimateAbundace, the model names need to refer back to models fit using fitBiasModels.

model.names <- c("fraglen","fraglen.vlmm","GC","all")

The following function computes the predicted coverage for one single-isoform gene. We load a fitpar object that was run with the fragment length range [80,350] bp.

fitpar[[1]][["model.params"]][c("minsize","maxsize")]
## $minsize
## [1] 80
## 
## $maxsize
## [1] 350
system.time({
  pred.cov <- predictCoverage(gene=ebt.fit[["ENST00000245479"]],
                              bam.files=bam.files["ERR188204"],
                              fitpar=fitpar,
                              genome=Hsapiens,
                              model.names=model.names)
})
##    user  system elapsed 
##  24.808   2.916  27.751

We can plot the observed and predicted coverage for one of the genes:

palette(brewer.pal(9, "Set1"))
frag.cov <- pred.cov[["ERR188204"]][["frag.cov"]]
plot(frag.cov, type="l", lwd=3, ylim=c(0,max(frag.cov)*1.5))
for (i in seq_along(model.names)) {
  m <- model.names[i]
  pred <- pred.cov[["ERR188204"]][["pred.cov"]][[m]]
  lines(pred, col=i, lwd=3)
}
legend("topright", legend=c("observed",model.names),
       col=c("black",seq_along(model.names)), lwd=3)

plot of chunk unnamed-chunk-24

Session information

sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-2                    
##  [2] BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000
##  [3] BSgenome_1.42.0                       
##  [4] alpine_1.0.0                          
##  [5] rtracklayer_1.34.0                    
##  [6] GenomicAlignments_1.10.0              
##  [7] Rsamtools_1.26.0                      
##  [8] Biostrings_2.42.0                     
##  [9] XVector_0.14.0                        
## [10] SummarizedExperiment_1.4.0            
## [11] Biobase_2.34.0                        
## [12] alpineData_0.99.0                     
## [13] ExperimentHubData_1.0.0               
## [14] AnnotationHubData_1.4.0               
## [15] futile.logger_1.4.3                   
## [16] GenomicRanges_1.26.0                  
## [17] GenomeInfoDb_1.10.0                   
## [18] IRanges_2.8.0                         
## [19] S4Vectors_0.12.0                      
## [20] ExperimentHub_1.0.0                   
## [21] AnnotationHub_2.6.0                   
## [22] BiocGenerics_0.20.0                   
## [23] knitr_1.14                            
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.7                   lattice_0.20-34              
##  [3] digest_0.6.10                 mime_0.5                     
##  [5] R6_2.2.0                      chron_2.3-47                 
##  [7] futile.options_1.0.0          RSQLite_1.0.0                
##  [9] evaluate_0.10                 httr_1.2.1                   
## [11] BiocInstaller_1.24.0          biocViews_1.42.0             
## [13] zlibbioc_1.20.0               GenomicFeatures_1.26.0       
## [15] curl_2.1                      data.table_1.9.6             
## [17] Matrix_1.2-7.1                RUnit_0.4.31                 
## [19] splines_3.3.1                 BiocParallel_1.8.0           
## [21] stringr_1.1.0                 RCurl_1.95-4.8               
## [23] biomaRt_2.30.0                shiny_0.14.1                 
## [25] httpuv_1.3.3                  speedglm_0.3-1               
## [27] htmltools_0.3.5               GEOquery_2.40.0              
## [29] interactiveDisplayBase_1.12.0 codetools_0.2-15             
## [31] BiocCheck_1.10.0              XML_3.98-1.4                 
## [33] AnnotationForge_1.16.0        MASS_7.3-45                  
## [35] bitops_1.0-6                  grid_3.3.1                   
## [37] RBGL_1.50.0                   jsonlite_1.1                 
## [39] xtable_1.8-2                  DBI_0.5-1                    
## [41] magrittr_1.5                  formatR_1.4                  
## [43] graph_1.52.0                  stringi_1.1.2                
## [45] getopt_1.20.0                 optparse_1.3.2               
## [47] xml2_1.0.0                    rBiopaxParser_2.14.0         
## [49] lambda.r_1.1.9                tools_3.3.1                  
## [51] OrganismDbi_1.16.0            AnnotationDbi_1.36.0