true

1 Introduction

MPRAnalyze aims to infer the transcription rate induced by each enhacer in a Massively Parallel Reporter Assay (MPRA). MPRAnalyze uses a parametric graphical model that enables direct modeling of the observed count data, without the need to use ratio-based summary statistics, and provides robust statistical methodology that addresses all major uses of MPRA.

2 Setup

An MPRA experiment is made up of two matching datasets: the RNA counts used to estimate transcription, and the DNA counts used to normalize by copy-number. MPRAnalyze assumes the input is unnormalized count data. Normalization is achieved by external factors (to correct library size) and regressing out confounding factors (by carefully designing the model).

Throughout this vignette, we’ll be using a subset of enhancers from Inoue et al., that examined a set of enhancers that were tranduced and remained episomal, and after being genomically integrated. We’ll be using a subset of the data, both in terms of number of enhancers and number of barcodes, for runtime purposes.

2.1 Formatting the data

MPRanalyze expects the input to be provided as two matrices: one for the DNA and one for the RNA counts. The formatting is fairly straightforward: each row is a single enhancer, and each column is an observation. Annotations for the columns are provided for each matrix to identify batches, barcodes and conditions. When formatting the data, note that all enhancers must have the same number of columns. In the case of missing data, padding with 0s is recommended.

library(MPRAnalyze)
data("ChrEpi")
summary(ce.colAnnot)

##  batch  condition    barcode  
##  1:20   MT:20     1      : 4  
##  2:20   WT:20     2      : 4  
##                   3      : 4  
##                   4      : 4  
##                   5      : 4  
##                   6      : 4  
##                   (Other):16

head(ce.colAnnot)

##          batch condition barcode
## MT:1:001     1        MT       1
## MT:1:002     1        MT       2
## MT:1:003     1        MT       3
## MT:1:004     1        MT       4
## MT:1:005     1        MT       5
## MT:1:006     1        MT       6

In the filtered dataset we have 110 enhancers, 40 observations each: 10 unique barcodes, 2 replicates and 2 conditions (MT stands for episomal, WT for chromosomally integrated). Note that while this datset is “paired”, and therefore the dimensionality of the two matrices and the annotations are identical, this is not always the case, and separate data frames can be used for the DNA and RNA annotations.

2.2 Creating an MpraObject object

The MpraObject is the basic class that the package works with, and is very easy to initialize once the data is properly formatted. In addition to the data itself, the user can specify certain enhancers as “controls” (usaully scrambled, random sequences included in the experiment). These will be used by MPRAnalyze to establish the null behavior. Additionally, MPRAnalyze uses parallelization to reduce runtime, with the BiocParallel package. To utilize this, the user can create a BPPARAM object and pass it to the MpraObject, it will be used throughout the analysis.

obj <- MpraObject(dnaCounts = ce.dnaCounts, rnaCounts = ce.rnaCounts, 
                  dnaAnnot = ce.colAnnot, rnaAnnot = ce.colAnnot, 
                  controls = ce.control)

## Warning in MpraObject(dnaCounts = ce.dnaCounts, rnaCounts = ce.rnaCounts, : 1
## enhancers were removed from the analysis

Note that since we are using only a subset of the data, one of the enhancers included was not detected (all observations are 0). If this is the case, MPRAnalyze removes it from the analysis. In datasets with many zeros, it is possible to add pseudo-counts. With MPRA this should be done carefully, and we recommend only adding pseudocounts in cases where the RNA counts are positive but the DNA counts are 0 (these are the only cases we know this is a false 0).

2.3 Library size normalization

MPRAnalyze allows for external factors to be used for normalization, espacially useful for library depth correction. These factors can be estimated automatically using upper quartile (default), total sum, or DESeq2-style harmonic mean normalization. If other factors are to be used, the user can provide them directly using the setDepthFactors function, and providing correction factors for the RNA and DNA counts (length of the factors must be the same as the number of columns in the respective data matrix). Note that unlike other genomic data, in which every column is a separate library, with MPRA a library is often multiple columns, since multiple barcodes can originate from a single library. For this reason, automatic estimation of library size requires the user to specify what columns belong to what library. This can be done easily by providing the names of factors from the annotations data.frame that identify library (this can be a single factor or multiple factors).

## If the library factors are different for the DNA and RNA data, separate 
## estimation of these factors is needed. We can also change the estimation 
## method (Upper quartile by default)
obj <- estimateDepthFactors(obj, lib.factor = c("batch", "condition"),
                            which.lib = "dna", 
                            depth.estimator = "uq")
obj <- estimateDepthFactors(obj, lib.factor = c("condition"),
                            which.lib = "rna", 
                            depth.estimator = "uq")

## In this case, the factors are the same - each combination of batch and 
## condition is a single library, and we'll use the default estimation
obj <- estimateDepthFactors(obj, lib.factor = c("batch", "condition"),
                            which.lib = "both")

3 Model Design

MPRAnalyze fits two nested generalized linear models. The two models have a conceptually different role: the DNA model is estimating plasmid copy numbers, and the RNA model is estimating transcription rate. Different factors should therefore be included or not included in either model, and the nested nature of the overall model requires careful thinking about the model design, and often using a different design for the DNA and the RNA model. Two common considerations are covered here:

3.1 DNA design of paired factors only

In some MPRA experiments, the DNA counts originate from pre-transduction plasmid libraries. In these cases, multiple replicates may be available, but they are independant of the multiple RNA replicates. Therefore, while there may be batch effect present in the DNA data, this effect cannot be transferred into the RNA model, and should be discarded. By not including it, the DNA estimates will essentially be averaged over replicates, but the multiple replicates will still be used to estimate the dispersion. Essentially - any factor that cannot or should not be carried from the DNA to the RNA model, should not be included in the DNA model at all.

3.2 including barcode annotations in the design

While replicate and condition factors are not always transferrable, barcode information - if available - is. Including barcode information in the DNA model allows MPRAnalyze to provide different estimated counts for each barcode, and dramatically increases the statistical power of the model. This means that barcodes should be modeled in the DNA model, but not necessarily in the RNA model. Modeling barcode effect in the RNA model essentially means different transcription rate estimates will be calculated for each different barcodes of the same enhancer, which is not usually desired. In quantification analyses, this would make comparing enhancers to eachother exceedingly complicated, since unlike batch- or condition- effects, barcodes are not comparable between enhancers. In compartive analyses, while modeling barcodes isn’t conceptually problematic, in practice this could result in overfitting the model and inlfating the results. Broadly, barcode factors should be included in the DNA model design and ignored in the RNA model design.

4 Quantification Analysis

Quantification analysis is addressing the question of what is the transription rate for each enhancer in the dataset. These estimates can then be used to identify and classify active enhancers that induce a higher transcription rate. Regarding model design - this data is from a paired experiment, so DNA factors are fully transferable to the RNA model. For the RNA, we will be interested in having a separate estimate of transcription rate for each condition (chromosomal and episomal), so this is the only factor included in the RNA model. Finally, fitting the model is done by calling the analyzeQuantification function:

obj <- analyzeQuantification(obj = obj, 
                              dnaDesign = ~ barcode + batch + condition,
                              rnaDesign = ~ condition)

## Fitting model...

## Analysis done!

We can now extract the transcription rate estimates from the model, denoted ‘alpha values’ in the MPRAnalyze model, and use the testing functionality to test for activtiy. extracting alpha values is done with the getAlpha function, that will provide separate values per-factor if a factor is provided. In this case we want a separate alpha estimate by condition:

##extract alpha values from the fitted model
alpha <- getAlpha(obj, by.factor = "condition")

##visualize the estimates
boxplot(alpha)

We can also leverage negative controls included in the data to establish a baseline rate of transcription, and use that to test for activty among the candidate enhancers. The testEmpirical function provides several statistics and p-values based on those statistics: empirical p-values, z-score based and MAD-score based p-values are currently supported. In most cases, we recommend using the MAD-score pvalues, which are median-based variant of z-scores, which makes them more robust to outliers.

## test 
res.epi <- testEmpirical(obj = obj, 
                               statistic = alpha$MT)
summary(res.epi)

##    statistic       control            zscore         mad.score      
##  Min.   : 2.947   Mode :logical   Min.   :-3.044   Min.   :-2.5859  
##  1st Qu.: 5.223   FALSE:99        1st Qu.: 0.285   1st Qu.: 0.3007  
##  Median : 5.962   TRUE :10        Median : 1.367   Median : 1.2388  
##  Mean   : 6.143                   Mean   : 1.632   Mean   : 1.4684  
##  3rd Qu.: 6.757                   3rd Qu.: 2.530   3rd Qu.: 2.2473  
##  Max.   :14.018                   Max.   :13.153   Max.   :11.4583  
##     pval.mad        pval.zscore       pval.empirical  
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.0000  
##  1st Qu.:0.01231   1st Qu.:0.005703   1st Qu.:0.0000  
##  Median :0.10770   Median :0.085820   Median :0.1000  
##  Mean   :0.23924   Mean   :0.240589   Mean   :0.2339  
##  3rd Qu.:0.38180   3rd Qu.:0.387815   3rd Qu.:0.4000  
##  Max.   :0.99514   Max.   :0.998834   Max.   :1.0000

res.chr <- testEmpirical(obj = obj,
                               statistic = alpha$WT)
par(mfrow=c(2,2))

hist(res.epi$pval.mad, main="episomal, all")
hist(res.epi$pval.mad[ce.control], main="episomal, controls")
hist(res.chr$pval.mad, main="chromosomal, all")
hist(res.chr$pval.mad[ce.control], main="chromosomal, controls")

par(mfrow=c(1,1))

P-values seem well calibrated, getting a uniform distribution for inactive enhancers, and enrichment for low p-values with the active enhancers.

5 Comparative Analysis

MPRAnalyze also supports comparative analyses, in this case: identifying enhancers that are differentially active between conditions. While we can do this indirectly by taking the quantification results and identify enhancers that are active in one condition but not the other, a direct compartive analysis is more sensitive, and allows identification of enhancers that are more or less active, avoiding the binarization of activity. MPRAnalyze also leverages negative controls to estbalish the null differential behavior, thereby correcting any systemic bias that may be present in the data. In terms of syntax, this analysis is done very similarly to quantification, with an additional reduced model that describes the null hypothesis. In this case, the null hypothesis is no differential activtiy between conditions, so the reduced model is an empty model (intercept only)

obj <- analyzeComparative(obj = obj, 
                           dnaDesign = ~ barcode + batch + condition, 
                           rnaDesign = ~ condition, 
                           reducedDesign = ~ 1)

## Fitting controls-based background model...

## iter:2   log-likelihood:-20076.1798347277

## iter:3   log-likelihood:-19664.0415669505

## iter:4   log-likelihood:-19299.2767542567

## iter:5   log-likelihood:-19025.9445481705

## iter:6   log-likelihood:-18871.5009668801

## iter:7   log-likelihood:-18808.6831629786

## iter:8   log-likelihood:-18788.2689244527

## iter:9   log-likelihood:-18782.0537783263

## iter:10  log-likelihood:-18780.251513757

## iter:11  log-likelihood:-18779.6489965293

## iter:12  log-likelihood:-18779.4839764466

## iter:13  log-likelihood:-18779.3791121402

## iter:14  log-likelihood:-18779.3487056302

## iter:15  log-likelihood:-18779.2944249404

## iter:16  log-likelihood:-18779.3425364527

## Fitting model...

## Fitting reduced model...

## Analysis Done!

with the fitted model, we can now test for differential activity, by calling testLrt

res <- testLrt(obj)

## Performing Likelihood Ratio Test...

head(res)

##                                                                   statistic
## A:HNF4A-ChMod_chr10:11917871-11917984_[chr10:11917842-11918013]  5.60981735
## A:HNF4A-ChMod_chr10:34165653-34165745_[chr10:34165613-34165784]  2.04112747
## A:HNF4A-ChMod_chr10:52009954-52010059_[chr10:52009921-52010092]  0.46927997
## A:HNF4A-ChMod_chr10:60767336-60767487_[chr10:60767326-60767497] 10.72607875
## A:HNF4A-ChMod_chr10:60797400-60797480_[chr10:60797354-60797525]  4.73566941
## A:HNF4A-ChMod_chr10:72112555-72112707_[chr10:72112545-72112716]  0.07637791
##                                                                        pval
## A:HNF4A-ChMod_chr10:11917871-11917984_[chr10:11917842-11918013] 0.017860125
## A:HNF4A-ChMod_chr10:34165653-34165745_[chr10:34165613-34165784] 0.153096136
## A:HNF4A-ChMod_chr10:52009954-52010059_[chr10:52009921-52010092] 0.493318605
## A:HNF4A-ChMod_chr10:60767336-60767487_[chr10:60767326-60767497] 0.001056361
## A:HNF4A-ChMod_chr10:60797400-60797480_[chr10:60797354-60797525] 0.029543349
## A:HNF4A-ChMod_chr10:72112555-72112707_[chr10:72112545-72112716] 0.782267329
##                                                                        fdr
## A:HNF4A-ChMod_chr10:11917871-11917984_[chr10:11917842-11918013] 0.18529365
## A:HNF4A-ChMod_chr10:34165653-34165745_[chr10:34165613-34165784] 0.47179202
## A:HNF4A-ChMod_chr10:52009954-52010059_[chr10:52009921-52010092] 0.79959406
## A:HNF4A-ChMod_chr10:60767336-60767487_[chr10:60767326-60767497] 0.02127219
## A:HNF4A-ChMod_chr10:60797400-60797480_[chr10:60797354-60797525] 0.24770962
## A:HNF4A-ChMod_chr10:72112555-72112707_[chr10:72112545-72112716] 0.92196683
##                                                                 df.test
## A:HNF4A-ChMod_chr10:11917871-11917984_[chr10:11917842-11918013]       1
## A:HNF4A-ChMod_chr10:34165653-34165745_[chr10:34165613-34165784]       1
## A:HNF4A-ChMod_chr10:52009954-52010059_[chr10:52009921-52010092]       1
## A:HNF4A-ChMod_chr10:60767336-60767487_[chr10:60767326-60767497]       1
## A:HNF4A-ChMod_chr10:60797400-60797480_[chr10:60797354-60797525]       1
## A:HNF4A-ChMod_chr10:72112555-72112707_[chr10:72112545-72112716]       1
##                                                                 df.dna
## A:HNF4A-ChMod_chr10:11917871-11917984_[chr10:11917842-11918013]     12
## A:HNF4A-ChMod_chr10:34165653-34165745_[chr10:34165613-34165784]      8
## A:HNF4A-ChMod_chr10:52009954-52010059_[chr10:52009921-52010092]      6
## A:HNF4A-ChMod_chr10:60767336-60767487_[chr10:60767326-60767497]      9
## A:HNF4A-ChMod_chr10:60797400-60797480_[chr10:60797354-60797525]     10
## A:HNF4A-ChMod_chr10:72112555-72112707_[chr10:72112545-72112716]      8
##                                                                 df.rna.full
## A:HNF4A-ChMod_chr10:11917871-11917984_[chr10:11917842-11918013]           5
## A:HNF4A-ChMod_chr10:34165653-34165745_[chr10:34165613-34165784]           5
## A:HNF4A-ChMod_chr10:52009954-52010059_[chr10:52009921-52010092]           5
## A:HNF4A-ChMod_chr10:60767336-60767487_[chr10:60767326-60767497]           5
## A:HNF4A-ChMod_chr10:60797400-60797480_[chr10:60797354-60797525]           5
## A:HNF4A-ChMod_chr10:72112555-72112707_[chr10:72112545-72112716]           5
##                                                                 df.rna.red
## A:HNF4A-ChMod_chr10:11917871-11917984_[chr10:11917842-11918013]          4
## A:HNF4A-ChMod_chr10:34165653-34165745_[chr10:34165613-34165784]          4
## A:HNF4A-ChMod_chr10:52009954-52010059_[chr10:52009921-52010092]          4
## A:HNF4A-ChMod_chr10:60767336-60767487_[chr10:60767326-60767497]          4
## A:HNF4A-ChMod_chr10:60797400-60797480_[chr10:60797354-60797525]          4
## A:HNF4A-ChMod_chr10:72112555-72112707_[chr10:72112545-72112716]          4
##                                                                       logFC
## A:HNF4A-ChMod_chr10:11917871-11917984_[chr10:11917842-11918013]  0.24774232
## A:HNF4A-ChMod_chr10:34165653-34165745_[chr10:34165613-34165784] -0.43318950
## A:HNF4A-ChMod_chr10:52009954-52010059_[chr10:52009921-52010092]  0.24812480
## A:HNF4A-ChMod_chr10:60767336-60767487_[chr10:60767326-60767497]  0.48901596
## A:HNF4A-ChMod_chr10:60797400-60797480_[chr10:60797354-60797525]  0.31247839
## A:HNF4A-ChMod_chr10:72112555-72112707_[chr10:72112545-72112716]  0.06622427

summary(res)

##    statistic              pval                fdr               df.test 
##  Min.   : 0.000441   Min.   :0.0000002   Min.   :0.0000172   Min.   :1  
##  1st Qu.: 0.146638   1st Qu.:0.0992155   1st Qu.:0.3862319   1st Qu.:1  
##  Median : 0.867634   Median :0.3516112   Median :0.6920815   Median :1  
##  Mean   : 2.206074   Mean   :0.4038146   Mean   :0.6291653   Mean   :1  
##  3rd Qu.: 2.718108   3rd Qu.:0.7017694   3rd Qu.:0.9219668   3rd Qu.:1  
##  Max.   :27.496768   Max.   :0.9832551   Max.   :0.9832551   Max.   :1  
##      df.dna        df.rna.full   df.rna.red     logFC         
##  Min.   : 5.000   Min.   :5    Min.   :4    Min.   :-0.49372  
##  1st Qu.: 8.000   1st Qu.:5    1st Qu.:4    1st Qu.:-0.02296  
##  Median : 9.000   Median :5    Median :4    Median : 0.11256  
##  Mean   : 9.349   Mean   :5    Mean   :4    Mean   : 0.12336  
##  3rd Qu.:11.000   3rd Qu.:5    3rd Qu.:4    3rd Qu.: 0.28139  
##  Max.   :13.000   Max.   :5    Max.   :4    Max.   : 0.83923

When the hypothesis teseting is simple (two-condition comparison), a fold-change estimate is also available:

## plot log Fold-Change
plot(density(res$logFC))

## plot volcano
plot(res$logFC, -log10(res$pval))

Analyzing MPRA data with MPRAnalyze

March 28, 2018

Contents