Introduction to derfinder

If you wish, you can view this vignette online here.

Overview

In this vignette we present derfinder (Collado-Torres, Frazee, Jaffe, and Leek, 2014) which enables annotation-agnostic differential expression analysis at base-pair resolution. A example with publicly available data is showcased in this vignette. With it, the two main type of analyses that can be performed with derfinder are illustrated.

Example

As an example, we will analyze a small subset of the samples from the BrainSpan Atlas of the Human Brain (BrainSpan, 2011) publicly available data.

We first load the required packages.

## Load libraries
library('derfinder')
library('derfinderData')
library('GenomicRanges')

Phenotype data

For this example, we created a small table with the relevant phenotype data for 12 samples: 6 from fetal samples and 6 from adult samples. We chose at random a brain region, in this case the amygdaloid complex. For this example we will only look at data from chromosome 21. Other variables include the age in years and the gender. The data is shown below.

library('knitr')
## Get pheno table
pheno <- subset(brainspanPheno, structure_acronym == 'AMY')

## Display the main information
p <- pheno[, -which(colnames(pheno) %in% c('structure_acronym', 'structure_name', 'file'))]
rownames(p) <- NULL
kable(p, format = 'html', row.names = TRUE)
gender lab Age group
1 F HSB97.AMY -0.5476190 fetal
2 M HSB92.AMY -0.4523810 fetal
3 M HSB178.AMY -0.5714286 fetal
4 M HSB159.AMY -0.3809524 fetal
5 F HSB153.AMY -0.6666667 fetal
6 F HSB113.AMY -0.6666667 fetal
7 F HSB130.AMY 21.0000000 adult
8 M HSB136.AMY 23.0000000 adult
9 F HSB126.AMY 30.0000000 adult
10 M HSB145.AMY 36.0000000 adult
11 M HSB123.AMY 37.0000000 adult
12 F HSB135.AMY 40.0000000 adult

Load the data

derfinder offers three functions related to loading data. The first one, rawFiles(), is a helper function for identifying the full paths to the input files. Next, loadCoverage() loads the base-level coverage data from either BAM or BigWig files for a specific chromosome. Finally, fullCoverage() will load the coverage for a set of chromosomes using loadCoverage().

We can load the data from derfinderData (Collado-Torres, Jaffe, and Leek, 2014) by first identifying the paths to the BigWig files with rawFiles() and then loading the data with fullCoverage().

## Determine the files to use and fix the names
files <- rawFiles(system.file('extdata', 'AMY', package = 'derfinderData'), samplepatt = 'bw', fileterm = NULL)
names(files) <- gsub('.bw', '', names(files))

## Load the data from disk
system.time(fullCov <- fullCoverage(files = files, chrs = 'chr21'))
## 2014-11-22 21:38:50 fullCoverage: processing chromosome chr21
## 2014-11-22 21:38:50 loadCoverage: finding chromosome lengths
## 2014-11-22 21:38:50 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.0-bioc/R/library/derfinderData/extdata/AMY/HSB113.bw
## 2014-11-22 21:38:50 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.0-bioc/R/library/derfinderData/extdata/AMY/HSB123.bw
## 2014-11-22 21:38:51 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.0-bioc/R/library/derfinderData/extdata/AMY/HSB126.bw
## 2014-11-22 21:38:51 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.0-bioc/R/library/derfinderData/extdata/AMY/HSB130.bw
## 2014-11-22 21:38:51 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.0-bioc/R/library/derfinderData/extdata/AMY/HSB135.bw
## 2014-11-22 21:38:52 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.0-bioc/R/library/derfinderData/extdata/AMY/HSB136.bw
## 2014-11-22 21:38:52 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.0-bioc/R/library/derfinderData/extdata/AMY/HSB145.bw
## 2014-11-22 21:38:53 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.0-bioc/R/library/derfinderData/extdata/AMY/HSB153.bw
## 2014-11-22 21:38:53 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.0-bioc/R/library/derfinderData/extdata/AMY/HSB159.bw
## 2014-11-22 21:38:53 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.0-bioc/R/library/derfinderData/extdata/AMY/HSB178.bw
## 2014-11-22 21:38:53 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.0-bioc/R/library/derfinderData/extdata/AMY/HSB92.bw
## 2014-11-22 21:38:54 loadCoverage: loading BigWig file /home/biocbuild/bbs-3.0-bioc/R/library/derfinderData/extdata/AMY/HSB97.bw
## 2014-11-22 21:38:54 loadCoverage: applying the cutoff to the merged data
## 2014-11-22 21:38:54 filterData: originally there were 48129895 rows, now there are 48129895 rows. Meaning that 0 percent was filtered.
##    user  system elapsed 
##   4.153   0.020   4.361

Alternatively, since the BigWig files are publicly available from BrainSpan (see here), we can extract the relevant coverage data using fullCoverage(). Note that as of rtracklayer 1.25.16 BigWig files are not supported on Windows: you can find the fullCov object inside derfinderData to follow the examples.

## Determine the files to use and fix the names
files <- pheno$file
names(files) <- gsub('.AMY', '', pheno$lab)

## Load the data from the web
system.time(fullCov <- fullCoverage(files = files, chrs = 'chr21'))

Note how loading the coverage for 12 samples from the web was quite fast. Although in this case we only retained the information for chromosome 21.

The result of fullCov is a list with one element per chromosome. If no filtering was performed, each chromosome has a DataFrame with the number of rows equaling the number of bases in the chromosome with one column per sample.

## Lets explore it
fullCov
## $chr21
## DataFrame with 48129895 rows and 12 columns
##          HSB113 HSB123 HSB126 HSB130 HSB135 HSB136 HSB145 HSB153 HSB159
##                           
## 1             0      0      0      0      0      0      0      0      0
## 2             0      0      0      0      0      0      0      0      0
## 3             0      0      0      0      0      0      0      0      0
## 4             0      0      0      0      0      0      0      0      0
## 5             0      0      0      0      0      0      0      0      0
## ...         ...    ...    ...    ...    ...    ...    ...    ...    ...
## 48129891      0      0      0      0      0      0      0      0      0
## 48129892      0      0      0      0      0      0      0      0      0
## 48129893      0      0      0      0      0      0      0      0      0
## 48129894      0      0      0      0      0      0      0      0      0
## 48129895      0      0      0      0      0      0      0      0      0
##          HSB178 HSB92 HSB97
##             
## 1             0     0     0
## 2             0     0     0
## 3             0     0     0
## 4             0     0     0
## 5             0     0     0
## ...         ...   ...   ...
## 48129891      0     0     0
## 48129892      0     0     0
## 48129893      0     0     0
## 48129894      0     0     0
## 48129895      0     0     0

If filtering was performed, each chromosome also has a logical Rle indicating which bases of the chromosome passed the filtering. This information is useful later on to map back the results to the genome coordinates.

Filter coverage

Depending on the use case, you might want to filter the base level coverage at the time of reading it, or you might want to keep an unfiltered version. By default both loadCoverage() and fullCoverage() will not filter.

If you decide to filter, set the cutoff argument to a positive value. This will run filterData(). Note that you might want to standardize the library sizes prior to filtering, which can be done by supplying the totalMapped and targetSize arguments.

In this example, we prefer to keep both an unfiltered and filtered version. For the filtered version, we will retain the bases where at least one sample has coverage greater than 2.

## Filter coverage
filteredCov <- lapply(fullCov, filterData, cutoff = 2)
## 2014-11-22 21:38:56 filterData: originally there were 48129895 rows, now there are 130356 rows. Meaning that 99.73 percent was filtered.

The result is similar to fullCov but with the genomic position index as shown below.

## Similar to fullCov but with $position
filteredCov
## $chr21
## $chr21$coverage
## DataFrame with 130356 rows and 12 columns
##        HSB113 HSB123 HSB126 HSB130 HSB135 HSB136 HSB145 HSB153 HSB159
##                         
## 1        2.01      0   0.00   0.04   0.23   0.13      0   0.59   0.15
## 2        2.17      0   0.00   0.04   0.23   0.13      0   0.63   0.15
## 3        2.21      0   0.00   0.04   0.23   0.13      0   0.67   0.15
## 4        2.37      0   0.00   0.04   0.23   0.13      0   0.71   0.15
## 5        2.37      0   0.06   0.04   0.23   0.13      0   0.75   0.25
## ...       ...    ...    ...    ...    ...    ...    ...    ...    ...
## 130352   1.25   1.28   2.05   0.79   1.63   1.37   1.03   2.21   2.46
## 130353   1.21   1.24   2.05   0.79   1.63   1.37   1.03   2.21   2.46
## 130354   1.21   1.20   1.93   0.79   1.63   1.28   0.73   2.01   2.21
## 130355   1.17   1.20   1.93   0.79   1.63   1.28   0.60   2.01   2.11
## 130356   1.17   1.09   1.87   0.79   1.55   1.02   0.56   1.93   1.96
##        HSB178 HSB92 HSB97
##           
## 1        0.05  0.07  1.58
## 2        0.05  0.07  1.58
## 3        0.05  0.10  1.61
## 4        0.05  0.13  1.65
## 5        0.10  0.13  1.68
## ...       ...   ...   ...
## 130352   2.07  2.23  1.51
## 130353   2.07  2.23  1.51
## 130354   2.11  2.13  1.47
## 130355   2.11  2.13  1.47
## 130356   1.78  2.06  1.47
## 
## $chr21$position
## logical-Rle of length 48129895 with 3235 runs
##   Lengths: 9825448     149       2       9 ...     740     598   45091
##   Values :   FALSE    TRUE   FALSE    TRUE ...   FALSE    TRUE   FALSE

In terms of memory, the filtered version requires less resources. Although this will depend on how rich the data set is and how aggressive was the filtering step.

## Compare the size in Mb
round(c(fullCov = object.size(fullCov), filteredCov = object.size(filteredCov)) / 1024^2, 1)
##     fullCov filteredCov 
##        22.7         8.5

Note that with your own data, filtering for bases where at least one sample has coverage greater than 2 might not make sense: maybe you need a higher or lower filter. The amount of bases remaining after filtering will impact how long the analysis will take to complete. We thus recommend exploring this before proceeding.

DER analysis

One form of base-level differential expression analysis implemented in derfinder is to calculate F-statistics for every base and use them to define candidate differentially expressed regions. This type of analysis is further explained in this section.

Models

Once we have the base-level coverage data for all 12 samples, we can construct the models. In this case, we want to find differences between fetal and adult samples while adjusting for gender and a proxy of the library size.

We can use sampleDepth() and it's helper function collapseFullCoverage() to get a proxy of the library size. Note that you would normally use the unfiltered data from all the chromosomes in this step and not just one.

## Get some idea of the library sizes
sampleDepths <- sampleDepth(collapseFullCoverage(fullCov), 1)
## 2014-11-22 21:38:56 sampleDepth: Calculating sample quantiles
## 2014-11-22 21:39:03 sampleDepth: Calculating sample adjustments
sampleDepths
## HSB113.100% HSB123.100% HSB126.100% HSB130.100% HSB135.100% HSB136.100% 
##    19.82106    19.40505    19.53045    19.52017    20.33392    19.97758 
## HSB145.100% HSB153.100% HSB159.100% HSB178.100%  HSB92.100%  HSB97.100% 
##    19.49827    19.41285    19.24186    19.44252    19.55904    19.47733

sampleDepth() is similar to calcNormFactors() from metagenomeSeq with some code underneath tailored for the type of data we are using. collapseFullCoverage() is only needed to deal with the size of the data.

We can then define the nested models we want to use using makeModels(). This is a helper function that assumes that you will always adjust for the library size. You then need to define the variable to test, in this case we are comparing fetal vs adult samples. Optionally, you can adjust for other sample covariates, such as the gender in this case.

## Define models
models <- makeModels(sampleDepths, testvars = pheno$group, adjustvars = pheno[, c('gender')]) 

## Explore the models
lapply(models, head)
## $mod
##   (Intercept) testvarsadult sampleDepths adjustVar1M
## 1           1             0     19.82106           0
## 2           1             0     19.40505           1
## 3           1             0     19.53045           1
## 4           1             0     19.52017           1
## 5           1             0     20.33392           0
## 6           1             0     19.97758           0
## 
## $mod0
##   (Intercept) sampleDepths adjustVar1M
## 1           1     19.82106           0
## 2           1     19.40505           1
## 3           1     19.53045           1
## 4           1     19.52017           1
## 5           1     20.33392           0
## 6           1     19.97758           0

Note how the null model (mod0) is nested in the alternative model (mod). Use the same models for all your chromosomes unless you have a specific reason to use chromosome-specific models. Note that derfinder is very flexible and works with any type of nested model.

Find candidate DERs

Next, we can find candidate differentially expressed regions (DERs) using as input the segments of the genome where at least one sample has coverage greater than 2. That is, the filtered coverage version we created previously.

The main function in derfinder for this type of analysis is analyzeChr(). It works at a chromosome level and runs behinds the scenes several other derfinder functions. To use it, you have to provide the models, the grouping information, how to calculate the F-statistic cutoff and most importantly, the number of permutations.

By default analyzeChr() will use a theoretical cutoff. In this example, we use the cutoff that would correspond to a p-value of 0.05. To assign p-values to the candidate DERs, derfinder permutes the rows of the model matrices, re-calculates the F-statistics and identifies null regions. Then it compares the area of the observed regions versus the areas from the null regions to assign an empirical p-value.

In this example we will use twenty permutations, although in a real case scenario you might consider a larger number of permutations.

In real scenario, you might consider saving the results from all the chromosomes in a given directory. Here we will use analysisResults. For each chromosome you analyze, a new directory with the chromosome-specific data will be created. So in this case, we will have analysisResults/chr21.

## Create a analysis directory
dir.create('analysisResults')
originalWd <- getwd()
setwd(file.path(originalWd, 'analysisResults'))

## Perform differential expression analysis
system.time(results <- analyzeChr(chr = 'chr21', filteredCov$chr21, models, groupInfo = pheno$group, writeOutput = TRUE, cutoffFstat = 5e-02, nPermute = 20, seeds = 20140923 + seq_len(20), returnOutput = TRUE))
## 2014-11-22 21:39:03 analyzeChr: Pre-processing the coverage data
## 2014-11-22 21:39:09 analyzeChr: Calculating statistics
## 2014-11-22 21:39:09 calculateStats: calculating the F-statistics
## 2014-11-22 21:39:10 analyzeChr: Calculating pvalues
## 2014-11-22 21:39:10 analyzeChr: Using the following theoretical cutoff for the F-statistics 5.31765507157871
## 2014-11-22 21:39:10 calculatePvalues: identifying data segments
## 2014-11-22 21:39:10 findRegions: segmenting F-stats information
## 2014-11-22 21:39:10 findRegions: identifying candidate regions
## 2014-11-22 21:39:10 findRegions: identifying region clusters
## 2014-11-22 21:39:11 calculatePvalues: calculating F-statistics for permutation 1 and seed 20140924
## 2014-11-22 21:39:11 findRegions: segmenting F-stats information
## 2014-11-22 21:39:11 findRegions: identifying candidate regions
## 2014-11-22 21:39:11 calculatePvalues: calculating F-statistics for permutation 2 and seed 20140925
## 2014-11-22 21:39:11 findRegions: segmenting F-stats information
## 2014-11-22 21:39:11 findRegions: identifying candidate regions
## 2014-11-22 21:39:11 calculatePvalues: calculating F-statistics for permutation 3 and seed 20140926
## 2014-11-22 21:39:12 findRegions: segmenting F-stats information
## 2014-11-22 21:39:12 findRegions: identifying candidate regions
## 2014-11-22 21:39:12 calculatePvalues: calculating F-statistics for permutation 4 and seed 20140927
## 2014-11-22 21:39:12 findRegions: segmenting F-stats information
## 2014-11-22 21:39:12 findRegions: identifying candidate regions
## 2014-11-22 21:39:12 calculatePvalues: calculating F-statistics for permutation 5 and seed 20140928
## 2014-11-22 21:39:13 findRegions: segmenting F-stats information
## 2014-11-22 21:39:13 findRegions: identifying candidate regions
## 2014-11-22 21:39:13 calculatePvalues: calculating F-statistics for permutation 6 and seed 20140929
## 2014-11-22 21:39:13 findRegions: segmenting F-stats information
## 2014-11-22 21:39:13 findRegions: identifying candidate regions
## 2014-11-22 21:39:13 calculatePvalues: calculating F-statistics for permutation 7 and seed 20140930
## 2014-11-22 21:39:13 findRegions: segmenting F-stats information
## 2014-11-22 21:39:13 findRegions: identifying candidate regions
## 2014-11-22 21:39:13 calculatePvalues: calculating F-statistics for permutation 8 and seed 20140931
## 2014-11-22 21:39:14 findRegions: segmenting F-stats information
## 2014-11-22 21:39:14 findRegions: identifying candidate regions
## 2014-11-22 21:39:14 calculatePvalues: calculating F-statistics for permutation 9 and seed 20140932
## 2014-11-22 21:39:14 findRegions: segmenting F-stats information
## 2014-11-22 21:39:14 findRegions: identifying candidate regions
## 2014-11-22 21:39:14 calculatePvalues: calculating F-statistics for permutation 10 and seed 20140933
## 2014-11-22 21:39:15 findRegions: segmenting F-stats information
## 2014-11-22 21:39:15 findRegions: identifying candidate regions
## 2014-11-22 21:39:15 calculatePvalues: calculating F-statistics for permutation 11 and seed 20140934
## 2014-11-22 21:39:15 findRegions: segmenting F-stats information
## 2014-11-22 21:39:15 findRegions: identifying candidate regions
## 2014-11-22 21:39:15 calculatePvalues: calculating F-statistics for permutation 12 and seed 20140935
## 2014-11-22 21:39:16 findRegions: segmenting F-stats information
## 2014-11-22 21:39:16 findRegions: identifying candidate regions
## 2014-11-22 21:39:16 calculatePvalues: calculating F-statistics for permutation 13 and seed 20140936
## 2014-11-22 21:39:16 findRegions: segmenting F-stats information
## 2014-11-22 21:39:16 findRegions: identifying candidate regions
## 2014-11-22 21:39:16 calculatePvalues: calculating F-statistics for permutation 14 and seed 20140937
## 2014-11-22 21:39:16 findRegions: segmenting F-stats information
## 2014-11-22 21:39:16 findRegions: identifying candidate regions
## 2014-11-22 21:39:16 calculatePvalues: calculating F-statistics for permutation 15 and seed 20140938
## 2014-11-22 21:39:17 findRegions: segmenting F-stats information
## 2014-11-22 21:39:17 findRegions: identifying candidate regions
## 2014-11-22 21:39:17 calculatePvalues: calculating F-statistics for permutation 16 and seed 20140939
## 2014-11-22 21:39:17 findRegions: segmenting F-stats information
## 2014-11-22 21:39:17 findRegions: identifying candidate regions
## 2014-11-22 21:39:18 calculatePvalues: calculating F-statistics for permutation 17 and seed 20140940
## 2014-11-22 21:39:18 findRegions: segmenting F-stats information
## 2014-11-22 21:39:18 findRegions: identifying candidate regions
## 2014-11-22 21:39:18 calculatePvalues: calculating F-statistics for permutation 18 and seed 20140941
## 2014-11-22 21:39:18 findRegions: segmenting F-stats information
## 2014-11-22 21:39:19 findRegions: identifying candidate regions
## 2014-11-22 21:39:19 calculatePvalues: calculating F-statistics for permutation 19 and seed 20140942
## 2014-11-22 21:39:19 findRegions: segmenting F-stats information
## 2014-11-22 21:39:19 findRegions: identifying candidate regions
## 2014-11-22 21:39:19 calculatePvalues: calculating F-statistics for permutation 20 and seed 20140943
## 2014-11-22 21:39:19 findRegions: segmenting F-stats information
## 2014-11-22 21:39:19 findRegions: identifying candidate regions
## 2014-11-22 21:39:20 calculatePvalues: calculating the p-values
## 2014-11-22 21:39:22 analyzeChr: Annotating regions
## Matching regions to genes.
## nearestgene: loading bumphunter hg19 transcript database
## finding nearest transcripts...
## AnnotatingDone.
##    user  system elapsed 
##  41.607   0.440  42.397

To speed up analyzeChr(), you might need to use several cores via the mc.cores argument. If memory is limiting, you might want to use a smaller chunksize (default is 5 million). Note that if you use too many cores, you might hit the input/output ceiling of your data network and/or hard drives speed.

Before running with a large number of permutations we recommend exploring how long each permutation cycle takes using a single permutation.

Note that analyzing each chromosome with a large number of permutations and a rich data set can take several hours, so we recommend running one job running analyzeChr() per chromosome, and then merging the results via mergeResults(). This process is further described in the advanced derfinder vignette.

Explore results

When using returnOutput = TRUE, analyzeChr() will return a list with the results to explore interactively. However, by default it writes the results to disk (one .Rdata file per result).

The following code explores the results.

## Explore
names(results)
## [1] "timeinfo"     "optionsStats" "coveragePrep" "fstats"      
## [5] "regions"      "annotation"

optionStats

optionStats stores the main options used in the analyzeChr() call including the models used, the type of cutoff, number of permutations, seeds for the permutations. All this information can be useful to reproduce the analysis.

## Explore optionsStats
names(results$optionsStats)
##  [1] "models"          "cutoffPre"       "scalefac"       
##  [4] "chunksize"       "cutoffFstat"     "cutoffType"     
##  [7] "nPermute"        "seeds"           "groupInfo"      
## [10] "lowMemDir"       "analyzeCall"     "cutoffFstatUsed"
## [13] "returnOutput"
## Call used
results$optionsStats$analyzeCall
## analyzeChr(chr = "chr21", coverageInfo = filteredCov$chr21, models = models, 
##     cutoffFstat = 0.05, nPermute = 20, seeds = 20140923 + seq_len(20), 
##     groupInfo = pheno$group, writeOutput = TRUE, returnOutput = TRUE)

coveragePrep

coveragePrep has the result from the preprocessCoverage() step. This includes the genomic position index, the mean coverage (after scaling and the log2 transformation) for all the samples, and the group mean coverages. By default, the chunks are written to disk in optionsStats$lowMemDir (chr21/chunksDir in this example) to help reduce the required memory resources. Otherwise it is stored in coveragePrep$coverageProcessed.

## Explore coveragePrep
names(results$coveragePrep)
## [1] "coverageProcessed" "mclapplyIndex"     "position"         
## [4] "meanCoverage"      "groupMeans"
## Group means
results$coveragePrep$groupMeans
## $fetal
## numeric-Rle of length 130356 with 116452 runs
##   Lengths:                 1                 1 ...                 1
##   Values : 0.401666664828857 0.428333345800638 ...  1.24833332498868
## 
## $adult
## numeric-Rle of length 130356 with 119226 runs
##   Lengths:                 1                 1 ...                 1
##   Values : 0.406666670615474  0.41333334085842 ...   1.6266666551431

fstats

The F-statistics are then stored in fstats. These are calculated using calculateStats().

## Explore optionsStats
results$fstats
## numeric-Rle of length 130356 with 126807 runs
##   Lengths:                   1                   1 ...                   1
##   Values :  0.0192260952849003  0.0299693665742238 ...    2.32463844414503
## Note that the length matches the number of bases used
identical(length(results$fstats), sum(results$coveragePrep$position))
## [1] TRUE

regions

The candidate DERs and summary results from the permutations is then stored in regions. This is the output from calculatePvalues() which uses several underneath other functions including calculateStats() and findRegions().

## Explore regions
names(results$regions)
## [1] "regions"         "nullStats"       "nullWidths"      "nullPermutation"

For the null regions, the summary information is composed of the mean F-statistic for the null regions (regions$nullStats), the width of the null regions (regions$nullWidths), and the permutation number under which they were identified (regions$nullPermutation).

## Permutation summary information
results$regions[2:4]
## $nullStats
## numeric-Rle of length 16738 with 16736 runs
##   Lengths:                1                1 ...                1
##   Values : 12.2254943226418 5.42502250701367 ... 9.56804518038913
## 
## $nullWidths
## integer-Rle of length 16738 with 14779 runs
##   Lengths:   1   1   1   1   1   1   1   1 ...   1   1   1   1   1   1   1
##   Values :   8   2  49   1   2   1   2   1 ...   7   3  25   2  20   1 151
## 
## $nullPermutation
## integer-Rle of length 16738 with 20 runs
##   Lengths: 1128 3248  858 1472  406 1512 ...  912  296  266  196  330 1104
##   Values :    1    2    3    4    5    6 ...   15   16   17   18   19   20

The most important part of the output is the GRanges object with the candidate DERs shown below.

## Candidate DERs
results$regions$regions
## GRanges object with 591 ranges and 14 metadata columns:
##       seqnames               ranges strand   |     value      area
##                           |  
##    up    chr21 [47610386, 47610682]      *   | 11.103042  3297.603
##    up    chr21 [40196145, 40196444]      *   | 10.061425  3018.427
##    up    chr21 [27253616, 27253948]      *   |  8.434883  2808.816
##    up    chr21 [22115534, 22115894]      *   |  7.236451  2612.359
##    up    chr21 [22914853, 22915064]      *   |  9.780659  2073.500
##   ...      ...                  ...    ... ...       ...       ...
##    up    chr21 [35889784, 35889784]      *   |  5.319523  5.319523
##    up    chr21 [47610093, 47610093]      *   |  5.319119  5.319119
##    up    chr21 [16333728, 16333728]      *   |  5.318807  5.318807
##    up    chr21 [34001896, 34001896]      *   |  5.318713  5.318713
##    up    chr21 [34809571, 34809571]      *   |  5.318014  5.318014
##       indexStart  indexEnd cluster clusterL meanCoverage meanfetal
##                     
##    up     122158    122454     138      933    1.5979517  0.822890
##    up      76110     76409      71     1323    1.3035083  2.025322
##    up      22019     22351      28      407   33.6578579 42.467042
##    up      12274     12634       9      694    0.9644645  1.719058
##    up      17318     17529      21      217    2.8389780  4.235928
##   ...        ...       ...     ...      ...          ...       ...
##    up      60088     60088      51      742     2.754167  3.360000
##    up     121865    121865     138      933     1.455833  0.775000
##    up       5048      5048       1        9     1.195000  1.231667
##    up      32577     32577      38     1428     1.712500  2.333333
##    up      43694     43694      46      149     2.950000  2.878333
##        meanadult log2FoldChangeadultvsfetal    pvalues significant
##                                
##    up  2.3730135                  1.5279488 0.00985722        TRUE
##    up  0.5816944                 -1.7998180 0.01129100        TRUE
##    up 24.8486737                 -0.7731748 0.01224685        TRUE
##    up  0.2098707                 -3.0340455 0.01296374        TRUE
##    up  1.4420283                 -1.5545785 0.01834040        TRUE
##   ...        ...                        ...        ...         ...
##    up   2.148333                -0.64524334  0.9979688       FALSE
##    up   2.136667                 1.46309367  0.9983273       FALSE
##    up   1.158333                -0.08856135  0.9988052       FALSE
##    up   1.091667                -1.09586004  0.9991636       FALSE
##    up   3.021667                 0.07011082  0.9996416       FALSE
##         qvalues significantQval
##               
##    up 0.9484048           FALSE
##    up 0.9484048           FALSE
##    up 0.9484048           FALSE
##    up 0.9484048           FALSE
##    up 0.9484048           FALSE
##   ...       ...             ...
##    up 0.9996416           FALSE
##    up 0.9996416           FALSE
##    up 0.9996416           FALSE
##    up 0.9996416           FALSE
##    up 0.9996416           FALSE
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

The metadata columns are:

Note that for this type of analysis you might want to try a few coverage cutoffs and/or F-statistic cutoffs. One quick way to evaluate the results is to compare the width of the regions.

## Width of potential DERs
summary(width(results$regions$regions))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    4.00   17.98   17.00  361.00
sum(width(results$regions$regions) > 50)
## [1] 68
## Width of candidate DERs
sig <- as.logical(results$regions$regions$significant)
summary(width(results$regions$regions[ sig ]))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    76.0    97.5   115.0   163.4   193.5   361.0
sum(width(results$regions$regions[ sig ]) > 50)
## [1] 19

Nearest annotation

analyzeChr() will find the nearest annotation feature using annotateNearest() from bumphunter. This information is useful considering that the candidate DERs were identified without relying on annotation. Yet at the end, we are interested to check if they are inside a known exon, upstream a gene, etc.

## Nearest annotation
head(results$annotation)
##        name
## 1       LSS
## 2      ETS2
## 3       APP
## 4 LINC00320
## 5     NCAM2
## 6 LINC00320
##                                                                                                                                                                                                                                          annotation
## 1                                                                                             NM_001001438 NM_001145436 NM_001145437 NM_002340 NP_001001438 NP_001138908 NP_001138909 NP_002331 XM_006724003 XM_006724004 XP_006724066 XP_006724067
## 2                                                                                                                                                                           NM_001256295 NM_005239 NP_001243224 NP_005230 XM_005260935 XP_005260992
## 3 NM_000484 NM_001136016 NM_001136129 NM_001136130 NM_001136131 NM_001204301 NM_001204302 NM_001204303 NM_201413 NM_201414 NP_000475 NP_001129488 NP_001129601 NP_001129602 NP_001129603 NP_001191230 NP_001191231 NP_001191232 NP_958816 NP_958817
## 4                                                                                                                                                                                                           NR_024090 NR_109786 NR_109787 NR_109788
## 5                                                                                                                       NM_004540 NP_004531 XM_005260988 XM_006724007 XM_006724008 XM_006724009 XP_005261045 XP_006724070 XP_006724071 XP_006724072
## 6                                                                                                                                                                                                           NR_024090 NR_109786 NR_109787 NR_109788
##   description     region distance   subregion insidedistance exonnumber
## 1 inside exon     inside    38056 inside exon              0         22
## 2 inside exon     inside    18390 inside exon              0         10
## 3 inside exon     inside   289498 inside exon              0         16
## 4                       0                     NA         NA
## 5  downstream downstream   544220                     NA         NA
## 6                       0                     NA         NA
##   nexons   UTR strand  geneL codingL
## 1     22 3'UTR      -  39700   37641
## 2     10 3'UTR      +  19123   12854
## 3     16 3'UTR      - 290585  230434
## 4      7        -      0       0
## 5     18        + 541884  539396
## 6      7        -      0       0

For more details on the output please check the bumphunter package.

Check the section about non-human data (specifically, using annotation different from hg19) on the advanced vignette.

Time spent

The final piece is the wallclock time spent during each of the steps in analyzeChr().

## Time spent
results$timeinfo
##                      init                     setup 
## "2014-11-22 21:39:03 PST" "2014-11-22 21:39:03 PST" 
##                  prepData                  savePrep 
## "2014-11-22 21:39:09 PST" "2014-11-22 21:39:09 PST" 
##            calculateStats                 saveStats 
## "2014-11-22 21:39:10 PST" "2014-11-22 21:39:10 PST" 
##             saveStatsOpts          calculatePvalues 
## "2014-11-22 21:39:10 PST" "2014-11-22 21:39:22 PST" 
##                  saveRegs                  annotate 
## "2014-11-22 21:39:22 PST" "2014-11-22 21:39:46 PST" 
##                  saveAnno 
## "2014-11-22 21:39:46 PST"
## Use this information to make a plot
timed <- diff(results$timeinfo)
timed.df <- data.frame(Seconds = as.numeric(timed), Step = factor(names(timed),
    levels = rev(names(timed))))
library('ggplot2')
ggplot(timed.df, aes(y = Step, x = Seconds)) + geom_point()
plot of chunk exploreTime

Merge results

Once you have analyzed each chromosome using analyzeChr(), you can use mergeResults() to merge the results. This function does not return an object in R but instead creates several Rdata files with the main results from the different chromosomes.

## Go back to the original directory
setwd(originalWd)

## Merge results from several chromosomes. In this case we only have one.
mergeResults(chrs='chr21', prefix="analysisResults",
    genomicState = genomicState$fullGenome, 
    optionsStats = results$optionsStats)
## 2014-11-22 21:39:47 mergeResults: Saving options used
## 2014-11-22 21:39:47 Loading chromosome chr21
## 2014-11-22 21:39:47 mergeResults: calculating FWER
## 2014-11-22 21:39:47 mergeResults: Saving fullNullSummary
## 2014-11-22 21:39:47 mergeResults: Re-calculating the p-values
## 2014-11-22 21:39:47 mergeResults: Saving fullRegions
## 2014-11-22 21:39:47 mergeResults: assigning genomic states
## 2014-11-22 21:39:47 annotateRegions: counting
## 2014-11-22 21:39:47 annotateRegions: annotating
## 2014-11-22 21:39:48 mergeResults: Saving fullAnnotatedRegions
## 2014-11-22 21:39:48 mergeResults: Saving fullFstats
## 2014-11-22 21:39:48 mergeResults: Saving fullTime
## Files created by mergeResults()
dir('analysisResults', pattern = '.Rdata')
## [1] "fullAnnotatedRegions.Rdata" "fullFstats.Rdata"          
## [3] "fullNullSummary.Rdata"      "fullRegions.Rdata"         
## [5] "fullTime.Rdata"             "optionsMerge.Rdata"

optionsMerge

For reproducibility purposes, the options used the merge the results are stored in optionsMerge.

## Options used to merge
load(file.path('analysisResults', 'optionsMerge.Rdata'))

## Contents
names(optionsMerge)
## [1] "chrs"            "significantCut"  "minoverlap"      "mergeCall"      
## [5] "cutoffFstatUsed" "optionsStats"
## Merge call
optionsMerge$mergeCall
## mergeResults(chrs = "chr21", prefix = "analysisResults", genomicState = genomicState$fullGenome, 
##     optionsStats = results$optionsStats)

fullRegions

The main result from mergeResults() is in fullRegions. This is a GRanges object with the candidate DERs from all the chromosomes. It also includes the nearest annotation metadata as well as FWER adjusted p-values (fwer) and whether the FWER adjusted p-value is less than 0.05 (significantFWER).

## Load all the regions
load(file.path('analysisResults', 'fullRegions.Rdata'))

## Metadata columns
names(mcols(fullRegions))
##  [1] "value"                      "area"                      
##  [3] "indexStart"                 "indexEnd"                  
##  [5] "cluster"                    "clusterL"                  
##  [7] "meanCoverage"               "meanfetal"                 
##  [9] "meanadult"                  "log2FoldChangeadultvsfetal"
## [11] "pvalues"                    "significant"               
## [13] "qvalues"                    "significantQval"           
## [15] "name"                       "annotation"                
## [17] "description"                "region"                    
## [19] "distance"                   "subregion"                 
## [21] "insidedistance"             "exonnumber"                
## [23] "nexons"                     "UTR"                       
## [25] "annoStrand"                 "geneL"                     
## [27] "codingL"                    "fwer"                      
## [29] "significantFWER"

Note that analyzeChr() only has the information for a given chromosome at a time, so mergeResults() re-calculates the p-values and q-values using the information from all the chromosomes.

fullAnnotatedRegions

In preparation for visually exploring the results, mergeResults() will run annotateRegions() which counts how many known exons, introns and intragenic segments each candidate DER overlaps (by default with a minimum overlap of 20bp). annotateRegions() uses a summarized version of the genome annotation created with makeGenomicState(). For this example, we can use the data included in derfinder which is the summarized annotation of hg19 for chromosome 21.

## Load annotateRegions() output
load(file.path('analysisResults', 'fullAnnotatedRegions.Rdata'))

## Information stored
names(fullAnnotatedRegions)
## [1] "countTable"     "annotationList"
## Take a peak
lapply(fullAnnotatedRegions, head)
## $countTable
##   exon intragenic intron
## 1    1          0      0
## 2    1          0      0
## 3    1          0      0
## 4    1          0      0
## 5    0          1      0
## 6    1          0      0
## 
## $annotationList
## GRangesList object of length 6:
## $1 
## GRanges object with 1 range and 4 metadata columns:
##        seqnames               ranges strand |   theRegion
##                          | 
##   4871    chr21 [47609038, 47611149]      - |        exon
##                        tx_id                              tx_name
##                                      
##   4871 73448,73449,73450,... uc002zij.3,uc002zik.2,uc002zil.2,...
##                 gene
##        
##   4871           170
## 
## $2 
## GRanges object with 1 range and 4 metadata columns:
##        seqnames               ranges strand | theRegion       tx_id
##   1189    chr21 [40194598, 40196878]      + |      exon 72757,72758
##                      tx_name gene
##   1189 uc002yxf.3,uc002yxg.4   77
## 
## $3 
## GRanges object with 1 range and 4 metadata columns:
##        seqnames               ranges strand | theRegion
##   2965    chr21 [27252861, 27254082]      - |      exon
##                        tx_id                              tx_name gene
##   2965 73066,73067,73068,... uc002ylz.3,uc002yma.3,uc002ymb.3,...  139
## 
## ...
## <3 more elements>
## -------
## seqinfo: 1 sequence from hg19 genome

Visually explore results

Optionally, we can use the addon package derfinderPlot to visually explore the results. For more details, please check its vignette.

To make the region level plots, we will need to extract the region level coverage data. We can do so using getRegionCoverage() as shown below.

## Find overlaps between regions and summarized genomic annotation
annoRegs <- annotateRegions(fullRegions, genomicState$fullGenome)
## 2014-11-22 21:39:48 annotateRegions: counting
## 2014-11-22 21:39:48 annotateRegions: annotating
## Indeed, the result is the same because we only used chr21
identical(annoRegs, fullAnnotatedRegions)
## [1] FALSE
## Get the region coverage
regionCov <- getRegionCoverage(fullCov, fullRegions)
## 2014-11-22 21:39:49 getRegionCoverage: processing chr21
## 2014-11-22 21:39:49 getRegionCoverage: done processing chr21
## Explore the result
head(regionCov[[1]])
##   HSB113 HSB123 HSB126 HSB130 HSB135 HSB136 HSB145 HSB153 HSB159 HSB178
## 1   0.68   0.44   0.48   0.36   0.19   2.34   1.29   1.77   2.21   2.69
## 2   0.60   0.44   0.48   0.36   0.19   2.30   1.29   1.77   2.21   2.64
## 3   0.60   0.40   0.48   0.32   0.19   2.39   1.37   1.81   2.31   2.69
## 4   0.64   0.40   0.48   0.32   0.19   2.61   1.42   1.89   2.36   2.88
## 5   0.64   0.40   0.48   0.36   0.19   2.65   1.42   1.93   2.36   2.88
## 6   0.60   0.44   0.48   0.39   0.23   2.65   1.59   1.93   2.36   2.83
##   HSB92 HSB97
## 1  1.89  3.57
## 2  1.86  3.60
## 3  1.89  3.60
## 4  1.89  3.60
## 5  1.96  3.70
## 6  1.93  3.70

With this, we are all set to visually explore the results.

library('derfinderPlot')

## Overview of the candidate DERs in the genome
plotOverview(regions = fullRegions, annotation = results$annotation, type = 'fwer')

suppressPackageStartupMessages(library('TxDb.Hsapiens.UCSC.hg19.knownGene'))
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

## Base-levle coverage plots for the first 10 regions
plotRegionCoverage(regions = fullRegions, regionCoverage = regionCov, 
    groupInfo = pheno$group, nearestAnnotation = results$annotation, 
    annotatedRegions = annoRegs, whichRegions=1:10, txdb = txdb, scalefac = 1, 
    ask = FALSE)

## Cluster plot for the first region
plotCluster(idx = 1, regions = fullRegions, annotation = results$annotation, coverageInfo = fullCov$chr21, txdb = txdb, groupInfo = pheno$group, titleUse = 'fwer')

Results HTML report

We have also developed an addon package called regionReport available via GitHub. For more information check its vignette.

The function derfinderReport() in regionReport basically takes advantage of the results from mergeResults() and plotting functions available in derfinderPlot as well as other neat features from rCharts and knitrBoostrap (Hester, 2013).

The resulting HTML report promotes reproducibility of the analysis and allows you to explore in more detail the results through some diagnostic plots.

Region matrix analysis

An alternative type of analysis is driven by regionMatrix(). The idea is to consider consecutive bases with mean coverage above a given cutoff as potentially differentially expressed regions. Then, get a matrix where each row is one of these regions and each column represents a sample. Each cell of the matrix has the mean coverage for the specific region - sample pair. Then, other packages specialized in differential expression at the counts level can be used, for example limma.

In this example, we will use regionMatrix() where we filter by mean coverage greater than 30 out of counts standardized to libraries of 40 million reads. Note that read of the BrainSpan data are 76bp long.

## Use regionMatrix()
system.time(regionMat <- regionMatrix(fullCov, cutoff = 30, L = 76, totalMapped = 2^(sampleDepths), targetSize = 40e6))
## 2014-11-22 21:39:49 regionMatrix: processing chr21
## 2014-11-22 21:39:49 filterData: normalizing coverage
## 2014-11-22 21:39:50 filterData: done normalizing coverage
## 2014-11-22 21:39:52 filterData: originally there were 48129895 rows, now there are 189651 rows. Meaning that 99.61 percent was filtered.
## 2014-11-22 21:39:52 findRegions: identifying potential segments
## 2014-11-22 21:39:52 findRegions: segmenting F-stats information
## 2014-11-22 21:39:52 findRegions: identifying candidate regions
## 2014-11-22 21:39:52 findRegions: identifying region clusters
## 2014-11-22 21:39:55 getRegionCoverage: processing chr21
## 2014-11-22 21:39:56 getRegionCoverage: done processing chr21
##    user  system elapsed 
##   7.373   0.044   7.424
## Explore results
names(regionMat$chr21)
## [1] "regions"        "coverageMatrix" "bpCoverage"

regionMatrix() returns three pieces of output.

Similar to what we saw earlier, the regions are arranged in a GRanges object. In this case, the metadata is simpler because no annotation information is included.

## regions output
regionMat$chr21$regions
## GRanges object with 1822 ranges and 6 metadata columns:
##        seqnames               ranges strand   |            value
##                            |        
##      1    chr21   [9825461, 9825583]      *   | 64.1835795934005
##      2    chr21   [9825645, 9825646]      *   | 30.6206716563444
##      3    chr21   [9825872, 9826023]      *   | 41.4259978624777
##      4    chr21   [9826149, 9826225]      *   | 34.5680066935166
##      5    chr21   [9826990, 9827584]      *   | 12578.8649688978
##    ...      ...                  ...    ... ...              ...
##   1818    chr21 [48084207, 48084835]      *   | 227.754828611473
##   1819    chr21 [48085444, 48085455]      *   | 31.5989673322147
##   1820    chr21 [48085458, 48085458]      *   |  30.078238709839
##   1821    chr21 [48085469, 48085470]      *   | 30.3090276995855
##   1822    chr21 [48085475, 48085477]      *   | 30.5711564090198
##                    area indexStart  indexEnd cluster clusterL
##                         
##      1 7894.58028998826          1       123       1      765
##      2 61.2413433126887        124       125       1      765
##      3 6296.75167509661        126       277       1      765
##      4 2661.73651540077        278       354       1      765
##      5 7484424.65649418        355       949       2      595
##    ...              ...        ...       ...     ...      ...
##   1818 143257.787196617     189005    189633     800      629
##   1819 379.187607986577     189634    189645     801       34
##   1820  30.078238709839     189646    189646     801       34
##   1821  60.618055399171     189647    189648     801       34
##   1822 91.7134692270595     189649    189651     801       34
##   -------
##   seqinfo: 1 sequence from an unspecified genome
## Number of regions
length(regionMat$chr21$regions)
## [1] 1822

bpCoverage is the base level coverage list which can then be used for plotting.

## Base-level coverage matrices for each of the regions
## Useful for plotting
lapply(regionMat$chr21$bpCoverage[1:2], head, n = 2)
## $`1`
##     HSB113   HSB123   HSB126   HSB130   HSB135   HSB136 HSB145   HSB153
## 1 117.8935 4.033258 3.169279 3.723920 8.171553 6.974008      0 69.91491
## 2 117.8935 4.033258 3.169279 5.851874 8.171553 6.974008      0 69.91491
##     HSB159   HSB178    HSB92    HSB97
## 1 22.58134 7.859689 6.732025 110.7007
## 2 22.58134 7.859689 6.732025 110.7007
## 
## $`2`
##       HSB113   HSB123   HSB126   HSB130   HSB135   HSB136   HSB145  HSB153
## 124 130.4170 8.642696 12.67711 13.29971 9.079504 17.04757 9.182136 36.1036
## 125 128.6896 8.642696 12.67711 13.29971 9.079504 17.04757 9.182136 36.1036
##       HSB159   HSB178   HSB92    HSB97
## 124 38.71087 10.66672 13.9819 67.40687
## 125 38.71087 10.66672 13.9819 69.59896
## Check dimensions. First region is 123 long, second one is 2 bp long.
## The columns match the number of samples (12 in this case).
lapply(regionMat$chr21$bpCoverage[1:2], dim)
## $`1`
## [1] 123  12
## 
## $`2`
## [1]  2 12

The end result of the coverage matrix is shown below. Note that the coverage has been adjusted for read length. Because reads might not fully align inside a given region, the numbers are generally not integers but can be rounded if needed.

## Dimensions of the coverage matrix
dim(regionMat$chr21$coverageMatrix)
## [1] 1822   12
## Coverage for each region. This matrix can then be used with limma or other pkgs
head(regionMat$chr21$coverageMatrix)
##         HSB113       HSB123       HSB126       HSB130       HSB135
## 1 4.464305e+02 1.489728e+01 1.647191e+01 3.476125e+01 3.460167e+01
## 2 3.409298e+00 2.274394e-01 3.336083e-01 3.499925e-01 2.389343e-01
## 3 1.674249e+02 1.058351e+01 1.822335e+01 4.820096e+01 1.784441e+01
## 4 1.524354e+02 7.535825e+00 7.255980e+00 9.890788e+00 8.454292e+00
## 5 1.578244e+05 1.596985e+04 7.380968e+04 5.888896e+04 2.721376e+05
## 6 1.450049e+04 5.774746e+03 2.457445e+04 1.429600e+04 3.642908e+04
##         HSB136       HSB145       HSB153       HSB159       HSB178
## 1 4.344990e+01 4.733207e+00 2.160410e+02    83.678304 1.806103e+01
## 2 4.486204e-01 2.416352e-01 9.500947e-01     1.018707 2.807032e-01
## 3 3.629747e+01 1.333258e+01 4.147993e+01    18.973422 0.000000e+00
## 4 3.898409e+01 1.063905e+01 8.007187e+01     5.560443 0.000000e+00
## 5 2.159025e+05 7.185880e+04 8.379324e+04 19258.506223 7.904146e+04
## 6 4.520007e+04 1.394158e+04 1.806119e+04  4400.059413 1.092042e+04
##          HSB92        HSB97
## 1 4.329483e+01   290.091832
## 2 3.679447e-01     1.802708
## 3 1.210811e+01   609.755286
## 4 7.392963e+00    92.053499
## 5 6.054930e+04 72717.000773
## 6 6.652904e+03 11113.935080

Feature level analysis

Similar to the region level analysis, you might be interested in performing a feature level analysis. More specifically, this could be getting a count matrix at the exon level. coverageToExon() allows you to get such a matrix by taking advantage of the summarized annotation produced by makeGenomicState().

In this example, we use the genomic state included in the package which has the information for chr21 TxDb.Hsapiens.UCSC.hg19.knownGene annotation.

## Get the exon level matrix
system.time(exonCov <- coverageToExon(fullCov, genomicState$fullGenome, L = 76))
## class: SerialParam; bpisup: TRUE; bpworkers: 1; catch.errors: TRUE
## class: SerialParam; bpisup: TRUE; bpworkers: 1; catch.errors: TRUE
## 2014-11-22 21:40:02 coverageToExon: processing chromosome chr21
## class: SerialParam; bpisup: TRUE; bpworkers: 1; catch.errors: TRUE
## 2014-11-22 21:40:07 coverageToExon: processing chromosome chr21
##    user  system elapsed 
##   9.032   1.652  10.704
## Dimensions of the matrix
dim(exonCov)
## [1] 2658   12
## Explore a little bit
tail(exonCov)
##         HSB113      HSB123      HSB126      HSB130       HSB135
## 4983 0.1173684   0.1382895   0.0000000   0.0000000   0.07894737
## 4985 4.6542105   1.4557895   1.6034210   0.7798684   1.29592103
## 4986 7.1510526 291.0205261 252.4892106 152.1905258 439.69500020
## 4988 0.0000000   0.7063158   0.7223684   0.2639474   0.48657895
## 4990 2.3064474  64.9423686  69.6584212  35.5769737 101.76394746
## 4992 0.1652632   8.1834211   9.7538158   2.4193421  10.08618420
##            HSB136       HSB145     HSB153      HSB159    HSB178
## 4983   0.03947368 5.263158e-03  0.1052632  0.04934211 0.1289474
## 4985   0.72986842 9.964474e-01 10.4906578  4.39013167 7.3119736
## 4986 263.63486853 2.345414e+02  4.5310526 10.64368415 4.9807895
## 4988   0.37657895 6.156579e-01  0.0000000  0.00000000 0.0000000
## 4990  76.96736838 5.978842e+01  1.2284210  2.67486840 0.9542105
## 4992  14.41013157 5.325658e+00  0.2402632  0.73039474 0.2601316
##            HSB92       HSB97
## 4983  0.03986842  0.44605264
## 4985  1.65184211 10.87723678
## 4986 11.40447366  6.23315790
## 4988  0.05644737  0.00000000
## 4990  3.98013159  1.96131579
## 4992  0.66907895  0.07473684

With this matrix, rounded if necessary, you can proceed to use packages such as limma, DESeq, edgeR, among others.

Compare results visually

We can certainly make region level plots using plotRegionCoverage() or cluster plots using plotCluster() or overview plots using plotOveview(), all from derfinderPlot.

First we need to get the relevant annotation information.

## Annotate regions as exonic, intronic or intragenic
system.time(annoGenome <- annotateRegions(regionMat$chr21$regions, genomicState$fullGenome))
## 2014-11-22 21:40:09 annotateRegions: counting
## 2014-11-22 21:40:09 annotateRegions: annotating
##    user  system elapsed 
##   0.452   0.000   0.454
## Note that the genomicState object included in derfinder only has information
## for chr21 (hg19).

## Identify closest genes to regions
suppressPackageStartupMessages(library('bumphunter'))
system.time(annoNear <- annotateNearest(regionMat$chr21$regions, subject = 'hg19'))
## Matching regions to genes.
## nearestgene: loading bumphunter hg19 transcript database
## finding nearest transcripts...
## Annotating.Done.
##    user  system elapsed 
##  71.209   0.016  71.408

Now we can proceed to use derfinderPlot to make the region level plots for the top 100 regions.

## Identify the top regions by highest total coverage
top <- order(regionMat$chr21$regions$area, decreasing = TRUE)[1:100]

## Base-level plots for the top 100 regions with transcript information
suppressPackageStartupMessages(library('TxDb.Hsapiens.UCSC.hg19.knownGene'))
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

library('derfinderPlot')
plotRegionCoverage(regionMat$chr21$regions, regionCoverage = regionMat$chr21$bpCoverage, groupInfo = pheno$group, nearestAnnotation = annoNear, annotatedRegions = annoGenome, whichRegions = top, scalefac = 1, txdb = txdb, ask = FALSE)

However, we can alternatively use epivizr to view the candidate DERs and the region matrix results in a genome browser.

## Load epivizr, it's available from Bioconductor
library('epivizr')

## Load data to your browser
mgr <- startEpiviz()
ders_dev <- mgr$addDevice(
    fullRegions[as.logical(fullRegions$significantFWER) ], "Candidate DERs")
ders_potential_dev <- mgr$addDevice(
    fullRegions[!as.logical(fullRegions$significantFWER) ], "Potential DERs")
regs_dev <- mgr$addDevice(regionMat$chr21$regions, "Region Matrix")

## Go to a place you like in the genome
mgr$navigate("chr21", start(regionMat$chr21$regions[top[1]]) - 100, end(regionMat$chr21$regions[top[1]]) + 100)

## Stop the navigation
mgr$stopServer()

Export coverage to BigWig files

derfinder also includes createBw() with related functions createBwSample() and coerceGR() to export the output of fullCoverage() to BigWig files. These functions can be useful in the case where you start with BAM files and later on want to save the coverage data into BigWig files, which are generally smaller.

## Subset only the first sample
fullCovSmall <- lapply(fullCov, '[', 1)

## Export to BigWig
bw <- createBw(fullCovSmall)
## 2014-11-22 21:41:21 coerceGR: coercing sample HSB113
## 2014-11-22 21:41:21 createBwSample: exporting bw for sample HSB113
## See the file. Note that the sample name is used to name the file.
dir(pattern = '.bw')
## [1] "HSB113.bw"
## Internally createBw() coerces each sample to a GRanges object before 
## exporting to a BigWig file. If more than one sample was exported, the
## GRangesList would have more elements.
bw
## GRangesList object of length 1:
## $HSB113 
## GRanges object with 155950 ranges and 1 metadata column:
##         seqnames               ranges strand   |              score
##                             |          
##   chr21    chr21   [9458667, 9458741]      *   | 0.0399999991059303
##   chr21    chr21   [9540957, 9540971]      *   | 0.0399999991059303
##   chr21    chr21   [9543719, 9543778]      *   | 0.0399999991059303
##   chr21    chr21   [9651480, 9651554]      *   | 0.0399999991059303
##   chr21    chr21   [9653397, 9653471]      *   | 0.0399999991059303
##     ...      ...                  ...    ... ...                ...
##   chr21    chr21 [48093246, 48093255]      *   | 0.0399999991059303
##   chr21    chr21 [48093257, 48093331]      *   | 0.0399999991059303
##   chr21    chr21 [48093350, 48093424]      *   | 0.0399999991059303
##   chr21    chr21 [48112194, 48112268]      *   | 0.0399999991059303
##   chr21    chr21 [48115056, 48115130]      *   | 0.0399999991059303
## 
## -------
## seqinfo: 1 sequence from an unspecified genome

Advanced arguments

If you are interested in using the advanced arguments, use advancedArg() as shown below:

## URLs to advanced arguemtns
sapply(c('analyzeChr', 'loadCoverage'), advancedArg, browse = FALSE)
##                                                                                                                             analyzeChr 
## "https://github.com/search?utf8=%E2%9C%93&q=advanced_argument+filename%3Aanalyze+repo%3Alcolladotor%2Fderfinder+path%3A%2FR&type=Code" 
##                                                                                                                           loadCoverage 
##    "https://github.com/search?utf8=%E2%9C%93&q=advanced_argument+filename%3Aload+repo%3Alcolladotor%2Fderfinder+path%3A%2FR&type=Code"
## Set browse = TRUE if you want to open them in your browser

The most common advanced arguments are chrsStyle (default is UCSC) and verbose (by default TRUE). verbose controls whether to print status updates for nearly all the functions. chrsStyle is used to determine the chromosome naming style and is powered by GenomeInfoDb. Note that chrsStyle is used in any of the functions that call extendedMapSeqlevels(). If you are working with a different organism than Homo sapiens set the global species option using options(species = 'your species') with the syntax used in names(GenomeInfoDb::genomeStyles()). For example:

## Set global species and chrsStyle options
options(species = 'arabidopsis_thaliana')
options(chrsStyle = 'NCBI')

The third commonly used advanced argument is mc.cores. It controls the number of cores to use for the functions that can run with more than one core to speed up. In nearly all the cases, the maximum number of cores depends on the number of chromosomes. One notable exception is analyzeChr() where the maximum number of cores depends on the chunksize used and the dimensions of the data for the chromosome under study.

Summary

We have illustrated how to identify candidate differentially expressed regions without using annotation in the identification process by using analyzeChr(). Furthermore, we covered how to perform the region matrix analysis with regionMatrix(). We also highlighted other uses of the derfinder package.

Origins

This implementation of derfinder has its origins in Alyssa C. Frazee's derfinder (Frazee, Sabunciyan, Hansen, Irizarry, et al., 2014). The statistical methods and implementation by now are very different.

Citing derfinder

Please use:

## Citation info
citation('derfinder')
## 
## Collado-Torres L, Frazee AC, Jaffe AE and Leek JT (2014).
## _derfinder: Annotation-agnostic differential expression analysis
## of RNA-seq data at base-pair resolution_.
## https://github.com/lcolladotor/derfinder - R package version
## 1.0.10, .
## 
## Frazee AC, Sabunciyan S, Hansen KD, Irizarry RA and Leek JT
## (2014). "Differential expression analysis of RNA-seq data at
## single-base resolution." _Biostatistics_, *15 (3)*, pp. 413-426.
## , .

Acknowledgements

We would like to thank Jessica Hekman for testing derfinder with non-human data, thus discovering some small bugs or sections of the documentation that were not clear.

Reproducibility

This package was made possible thanks to:

Code for creating the vignette

## Create the vignette
library('knitrBootstrap') 

knitrBootstrapFlag <- packageVersion('knitrBootstrap') < '1.0.0'
if(knitrBootstrapFlag) {
    ## CRAN version
    library('knitrBootstrap')
    system.time(knit_bootstrap('derfinder.Rmd', chooser=c('boot', 'code'), show_code = TRUE))
    unlink('derfinder.md')
} else {
    ## GitHub version
    library('rmarkdown')
    system.time(render('derfinder.Rmd', 'knitrBootstrap::bootstrap_document'))
}
## Note: if you prefer the knitr version use:
# library('rmarkdown')
# system.time(render('derfinder.Rmd', 'html_document'))
## Clean up
unlink('analysisResults', recursive = TRUE)
file.remove('HSB113.bw')
file.remove('derfinderRef.bib')

## Extract the R code
library('knitr')
knit('derfinder.Rmd', tangle = TRUE)

Date the vignette was generated.

## [1] "2014-11-22 21:41:22 PST"

Wallclock time spent generating the vignette.

## Time difference of 2.783 mins

R session information.

## Session info---------------------------------------------------------------------------------------
##  setting  value                       
##  version  R version 3.1.2 (2014-10-31)
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_US:                      
##  collate  C                           
##  tz       
## Packages-------------------------------------------------------------------------------------------
##  package           * version  date       source        
##  AnnotationDbi       1.28.1   2014-11-23 Bioconductor  
##  BBmisc              1.8      2014-10-30 CRAN (R 3.1.2)
##  BatchJobs           1.5      2014-10-30 CRAN (R 3.1.2)
##  Biobase             2.26.0   2014-11-23 Bioconductor  
##  BiocGenerics      * 0.12.1   2014-11-23 Bioconductor  
##  BiocParallel        1.0.0    2014-11-23 Bioconductor  
##  Biostrings          2.34.0   2014-11-23 Bioconductor  
##  DBI                 0.3.1    2014-09-24 CRAN (R 3.1.2)
##  Formula             1.1.2    2014-07-13 CRAN (R 3.1.2)
##  GenomeInfoDb      * 1.2.3    2014-11-23 Bioconductor  
##  GenomicAlignments   1.2.1    2014-11-23 Bioconductor  
##  GenomicFeatures     1.18.2   2014-11-23 Bioconductor  
##  GenomicFiles        1.2.0    2014-11-23 Bioconductor  
##  GenomicRanges     * 1.18.3   2014-11-23 Bioconductor  
##  Hmisc               3.14.6   2014-11-22 CRAN (R 3.1.2)
##  IRanges           * 2.0.0    2014-11-23 Bioconductor  
##  MASS                7.3.35   2014-09-30 CRAN (R 3.1.2)
##  Matrix              1.1.4    2014-06-15 CRAN (R 3.1.2)
##  R.methodsS3         1.6.1    2014-01-05 CRAN (R 3.1.2)
##  RColorBrewer        1.0.5    2011-06-17 CRAN (R 3.1.2)
##  RCurl               1.95.4.3 2014-07-29 CRAN (R 3.1.2)
##  RJSONIO             1.3.0    2014-07-28 CRAN (R 3.1.2)
##  RSQLite             1.0.0    2014-10-25 CRAN (R 3.1.2)
##  Rcpp                0.11.3   2014-09-29 CRAN (R 3.1.2)
##  RefManageR          0.8.40   2014-10-29 CRAN (R 3.1.2)
##  Rsamtools           1.18.2   2014-11-23 Bioconductor  
##  S4Vectors         * 0.4.0    2014-11-23 Bioconductor  
##  XML                 3.98.1.1 2013-06-20 CRAN (R 3.1.2)
##  XVector             0.6.0    2014-11-23 Bioconductor  
##  acepack             1.3.3.3  2013-05-03 CRAN (R 3.1.2)
##  base64enc           0.1.2    2014-06-26 CRAN (R 3.1.2)
##  bibtex              0.3.6    2013-07-29 CRAN (R 3.1.2)
##  biomaRt             2.22.0   2014-11-23 Bioconductor  
##  bitops              1.0.6    2013-08-17 CRAN (R 3.1.2)
##  brew                1.0.6    2011-04-13 CRAN (R 3.1.2)
##  bumphunter        * 1.6.0    2014-11-23 Bioconductor  
##  checkmate           1.5.0    2014-10-19 CRAN (R 3.1.2)
##  cluster             1.15.3   2014-09-04 CRAN (R 3.1.2)
##  codetools           0.2.9    2014-08-21 CRAN (R 3.1.2)
##  colorspace          1.2.4    2013-09-30 CRAN (R 3.1.2)
##  derfinder         * 1.0.10   2014-11-23 Bioconductor  
##  derfinderData     * 0.99.2   2014-11-01 Bioconductor  
##  derfinderHelper     1.0.4    2014-11-23 Bioconductor  
##  devtools          * 1.6.1    2014-10-07 CRAN (R 3.1.2)
##  digest              0.6.4    2013-12-03 CRAN (R 3.1.2)
##  doRNG               1.6      2014-03-07 CRAN (R 3.1.2)
##  evaluate            0.5.5    2014-04-29 CRAN (R 3.1.2)
##  fail                1.2      2013-09-19 CRAN (R 3.1.2)
##  foreach           * 1.4.2    2014-04-11 CRAN (R 3.1.2)
##  foreign             0.8.61   2014-03-28 CRAN (R 3.1.2)
##  formatR             1.0      2014-08-25 CRAN (R 3.1.2)
##  ggplot2           * 1.0.0    2014-05-21 CRAN (R 3.1.2)
##  gtable              0.1.2    2012-12-05 CRAN (R 3.1.2)
##  highr               0.4      2014-10-23 CRAN (R 3.1.2)
##  httr                0.5      2014-09-02 CRAN (R 3.1.2)
##  iterators         * 1.0.7    2014-04-11 CRAN (R 3.1.2)
##  knitcitations     * 1.0.4    2014-10-28 CRAN (R 3.1.2)
##  knitr             * 1.8      2014-11-11 CRAN (R 3.1.2)
##  knitrBootstrap    * 0.9.0    2013-10-17 CRAN (R 3.1.2)
##  labeling            0.3      2014-08-23 CRAN (R 3.1.2)
##  lattice             0.20.29  2014-04-04 CRAN (R 3.1.2)
##  latticeExtra        0.6.26   2013-08-15 CRAN (R 3.1.2)
##  locfit            * 1.5.9.1  2013-04-20 CRAN (R 3.1.2)
##  lubridate           1.3.3    2013-12-31 CRAN (R 3.1.2)
##  markdown            0.7.4    2014-08-24 CRAN (R 3.1.2)
##  matrixStats         0.10.3   2014-10-15 CRAN (R 3.1.2)
##  memoise             0.2.1    2014-04-22 CRAN (R 3.1.2)
##  munsell             0.4.2    2013-07-11 CRAN (R 3.1.2)
##  nnet                7.3.8    2014-03-28 CRAN (R 3.1.2)
##  pkgmaker            0.22     2014-05-14 CRAN (R 3.1.2)
##  plyr                1.8.1    2014-02-26 CRAN (R 3.1.2)
##  proto               0.3.10   2012-12-22 CRAN (R 3.1.2)
##  qvalue              1.40.0   2014-11-23 Bioconductor  
##  registry            0.2      2012-01-24 CRAN (R 3.1.2)
##  reshape2            1.4      2014-04-23 CRAN (R 3.1.2)
##  rngtools            1.2.4    2014-03-06 CRAN (R 3.1.2)
##  rpart               4.1.8    2014-03-28 CRAN (R 3.1.2)
##  rstudioapi          0.1      2014-03-27 CRAN (R 3.1.2)
##  rtracklayer         1.26.2   2014-11-23 Bioconductor  
##  scales              0.2.4    2014-04-22 CRAN (R 3.1.2)
##  sendmailR           1.2.1    2014-09-21 CRAN (R 3.1.2)
##  stringr             0.6.2    2012-12-06 CRAN (R 3.1.2)
##  survival            2.37.7   2014-01-22 CRAN (R 3.1.2)
##  xtable              1.7.4    2014-09-12 CRAN (R 3.1.2)
##  zlibbioc            1.12.0   2014-11-23 Bioconductor

Bibliography

This vignette was generated using knitrBootstrap (Hester, 2013) with knitr (Xie, 2014) and rmarkdown (Allaire, McPherson, Xie, Wickham, et al., 2014) running behind the scenes.

Citations made with knitcitations (Boettiger, 2014).

[1] J. Allaire, J. McPherson, Y. Xie, H. Wickham, et al. rmarkdown: Dynamic Documents for R. R package version 0.3.3. 2014. URL: http://CRAN.R-project.org/package=rmarkdown.

[2] S. Arora, M. Morgan, M. Carlson and H. Pages. GenomeInfoDb: Utilities for manipulating chromosome and other 'seqname' identifiers. R package version 1.2.3. 2014.

[3] C. Boettiger. knitcitations: Citations for knitr markdown files. R package version 1.0.4. 2014. URL: http://CRAN.R-project.org/package=knitcitations.

[4] BrainSpan. “Atlas of the Developing Human Brain [Internet]. Funded by ARRA Awards 1RC2MH089921-01, 1RC2MH090047-01, and 1RC2MH089929-01.” 2011. URL: http://developinghumanbrain.org.

[5] M. Carlson. TxDb.Hsapiens.UCSC.hg19.knownGene: Annotation package for TxDb object(s). R package version 3.0.0. 2014.

[6] L. Collado-Torres, A. C. Frazee, A. E. Jaffe and J. T. Leek. derfinder: Annotation-agnostic differential expression analysis of RNA-seq data at base-pair resolution. https://github.com/lcolladotor/derfinder - R package version 1.0.10. 2014. URL: http://www.bioconductor.org/packages/release/bioc/html/derfinder.html.

[7] L. Collado-Torres, A. Jaffe and J. Leek. derfinderData: Processed BigWigs from BrainSpan for examples. R package version 0.99.2. 2014. URL: https://github.com/lcolladotor/derfinderData.

[8] A. Dabney and J. D. Storey. qvalue: Q-value estimation for false discovery rate control. R package version 1.40.0. 2014.

[9] A. C. Frazee, S. Sabunciyan, K. D. Hansen, R. A. Irizarry, et al. “Differential expression analysis of RNA-seq data at single-base resolution”. In: Biostatistics 15 (3) (2014), pp. 413-426. DOI: 10.1093/biostatistics/kxt053. URL: http://biostatistics.oxfordjournals.org/content/15/3/413.long.

[10] J. Hester. knitrBootstrap: Knitr Bootstrap framework. R package version 0.9.0. 2013. URL: http://CRAN.R-project.org/package=knitrBootstrap.

[11] R. A. Irizarry, M. Ayree, K. D. Hansen and H. C. Hansen. bumphunter: Bump Hunter. R package version 1.6.0. 2014. URL: https://github.com/ririzarr/bumphunter.

[12] Jaffe, A. E, Murakami, Peter, et al. “Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies”. In: International Journal of Epidemiology (2012).

[13] F. E. H. Jr, with contributions from Charles Dupont and many others. Hmisc: Harrell Miscellaneous. R package version 3.14-6. 2014. URL: http://CRAN.R-project.org/package=Hmisc.

[14] M. Lawrence, R. Gentleman and V. Carey. “rtracklayer: an R package for interfacing with genome browsers”. In: Bioinformatics 25 (2009), pp. 1841-1842. DOI: 10.1093/bioinformatics/btp328. URL: http://bioinformatics.oxfordjournals.org/content/25/14/1841.abstract.

[15] M. Lawrence, W. Huber, H. Pagès, P. Aboyoun, et al. “Software for Computing and Annotating Genomic Ranges”. In: PLoS Computational Biology 9 (8 2013). DOI: 10.1371/journal.pcbi.1003118. URL: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003118}.

[16] M. Morgan, M. Lang and R. Thompson. BiocParallel: Bioconductor facilities for parallel evaluation. R package version 1.0.0. 2014.

[17] M. Morgan, H. Pagès, V. Obenchain and N. Hayden. Rsamtools: Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import. R package version 1.18.2. 2014. URL: http://bioconductor.org/packages/release/bioc/html/Rsamtools.html.

[18] V. Obenchain, M. Love and M. Morgan. GenomicFiles: Distributed computing by file or by range. R package version 1.2.0. 2014.

[19] H. Pages, M. Carlson, S. Falcon and N. Li. AnnotationDbi: Annotation Database Interface. R package version 1.28.1. 2014.

[20] H. Pages, M. Lawrence and P. Aboyoun. S4Vectors: S4 implementation of vectors and lists. R package version 0.4.0. 2014.

[21] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2014. URL: http://www.R-project.org/.

[22] H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009. ISBN: 978-0-387-98140-6. URL: http://had.co.nz/ggplot2/book.

[23] H. Wickham. “testthat: Get Started with Testing”. In: The R Journal 3 (2011), pp. 5–10. URL: http://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.

[24] H. Wickham and W. Chang. devtools: Tools to make developing R code easier. R package version 1.6.1. 2014. URL: http://CRAN.R-project.org/package=devtools.

[25] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014. URL: http://www.crcpress.com/product/isbn/9781466561595.

[26] T. Yin, M. Lawrence and D. Cook. biovizBase: Basic graphic utilities for visualization of genomic data. R package version 1.14.0. 2014.