Introduction

Gene expression datasets are complicated and have multiple sources of biological and technical variation. These datasets have recently become more complex as it is now feasible to assay gene expression from the same individual in multiple tissues or at multiple time points. The variancePartition package implements a statistical method to quantify the contribution of multiple sources of variation and decouple within/between-individual variation. In addition, variancePartition produces results at the gene-level to identity genes that follow or deviate from the genome-wide trend.

The variancePartition package provides a general framework for understanding drivers of variation in gene expression in experiments with complex designs. A typical application would consider a dataset of gene expression from individuals sampled in multiple tissues or multiple time points where the goal is to understand variation within versus between individuals and tissues. variancePartition use a linear mixed model to partition the variance attributable to multiple variables in the data. The analysis is built on top of the lme4 package (Bates et al. 2015), and some basic knowledge about linear mixed models will give you some intuition about the behavior of variancePartition (Pinheiro and Bates 2000; Galecki and Burzykowski 2013).

Input data

There are three components to an analysis:

  1. Gene expression data: In general, this is a matrix of normalized gene expression values with genes as rows and experiments as columns.
  • Count-based quantification: featureCounts (Liao, Smyth, and Shi 2014), HTSeq (Anders, Pyl, and Huber 2015)

    Counts mapping to each gene can be normalized using counts per million (CPM), reads per kilobase per million (RPKM) or fragments per kilobase per million (FPKM). These count results can be processed with limma::voom() (Law et al. 2014) to model the precision of each observation or DESeq2 (Love, Huber, and Anders 2014).

  • Isoform quantification: kallisto (Bray et al. 2016), sailfish (Patro, Mount, and Kingsford 2014), salmon (Patro, Duggal, and Kingsford 2015), RSEM (Li and Dewey 2011), cufflinks (Trapnell et al. 2010)

    These perform isoform-level quantification using reads that map to multiple transcripts. Quantification values can be read directly into R, or processed with ballgown (Frazee et al. 2015) or tximport (Soneson, Love, and Robinson 2015).

  • Microarray data: any standard normalization such as rma in the oligo (Carvalho and Irizarry 2010) package can be used.

  • Any set of features: chromatin accessibility, protein quantification, etc

  1. Metadata about each experiment:

    A data.frame with information about each experiment such as patient ID, tissue, sex, disease state, time point, batch, etc.

  2. Formula indicating which metadata variables to consider:

    An R formula such as ~ Age + (1|Individual) + (1|Tissue) + (1|Batch) indicating which metadata variables should be used in the analysis.

Variance partitioning analysis will assess the contribution of each metadata variable to variation in gene expression and can report the intra-class correlation for each variable.

Running an analysis

A typical analysis with variancePartition is only a few lines of R code, assuming the expression data has already been normalized. Normalization is a separate topic addressed briefly in [Applying variancePartition to RNA-seq expression data].

The simulated dataset included as an example contains measurements of 200 genes from 100 samples. These samples include assays from 3 tissues across 25 individuals processed in 4 batches. The individuals range in age from 36 to 73. A typical variancePartition analysis will assess the contribution of each aspect of the study design (i.e. individual, tissue, batch, age) to the expression variation of each gene. The analysis will prioritize these axes of variation based on a genome-wide summary and give results at the gene-level to identity genes that follow or deviate from this genome-wide trend. The results can be visualized using custom plots and can be used for downstream analysis.

Standard application

# load library
library("variancePartition")

# load simulated data:
# geneExpr: matrix of gene expression values
# info: information/metadata about each sample
data(varPartData)

# Specify variables to consider
# Age is continuous so model it as a fixed effect
# Individual and Tissue are both categorical,
# so model them as random effects
# Note the syntax used to specify random effects
form <- ~ Age + (1 | Individual) + (1 | Tissue) + (1 | Batch)

# Fit model and extract results
# 1) fit linear mixed model on gene expression
# If categorical variables are specified,
#     a linear mixed model is used
# If all variables are modeled as fixed effects,
#       a linear model is used
# each entry in results is a regression model fit on a single gene
# 2) extract variance fractions from each model fit
# for each gene, returns fraction of variation attributable
#       to each variable
# Interpretation: the variance explained by each variables
# after correcting for all other variables
# Note that geneExpr can either be a matrix,
# and EList output by voom() in the limma package,
# or an ExpressionSet
varPart <- fitExtractVarPartModel(geneExpr, form, info)

# sort variables (i.e. columns) by median fraction
#       of variance explained
vp <- sortCols(varPart)

# Figure 1a
# Bar plot of variance fractions for the first 10 genes
plotPercentBars(vp[1:10, ])

# Figure 1b
# violin plot of contribution of each variable to total variance
plotVarPart(vp)

variancePartition includes a number of custom plots to visualize the results. Since variancePartition attributes the fraction of total variation attributable to each aspect of the study design, these fractions naturally sum to 1. plots the partitioning results for a subset of genes (Figure 1a), and shows a genome-wide violin plot of the distribution of variance explained by each variable across all genes (Figure 1b). (Note that these plots show results in terms of of variance explained, while the results are stored in terms of the fraction.)

The core functions of variancePartition work seemlessly with gene expression data stored as a matrix, data.frame, EList from limma or ExpressionSet from Biobase. fitExtractVarPartModel() returns an object that stores the variance fractions for each gene and each variable in the formula specified. These fractions can be accessed just like a data.frame:

# Access first entries
head(varPart)
##          Batch Individual Tissue      Age Residuals
## gene1 0.000158      0.890 0.0247 4.53e-05    0.0847
## gene2 0.000000      0.806 0.1010 3.34e-04    0.0926
## gene3 0.002422      0.890 0.0356 1.47e-03    0.0704
## gene4 0.000000      0.769 0.1253 1.01e-03    0.1048
## gene5 0.000000      0.700 0.2091 3.87e-05    0.0911
## gene6 0.002344      0.722 0.1679 2.72e-03    0.1048
# Access first entries for Individual
head(varPart$Individual)
## [1] 0.890 0.806 0.890 0.769 0.700 0.722
# sort genes based on variance explained by Individual
head(varPart[order(varPart$Individual, decreasing = TRUE), ])
##           Batch Individual  Tissue      Age Residuals
## gene43  0.00000      0.914 0.01174 3.78e-04    0.0735
## gene174 0.00000      0.911 0.00973 2.02e-03    0.0770
## gene111 0.00000      0.907 0.00839 9.74e-04    0.0839
## gene127 0.00000      0.904 0.01384 5.08e-04    0.0821
## gene151 0.00608      0.903 0.00000 1.35e-05    0.0910
## gene91  0.00000      0.900 0.01414 1.11e-06    0.0856

Saving plot to file

In order to save the plot to a file, use the ggsave() function:

fig <- plotVarPart(vp)
ggsave(file, fig)

Plot expression stratified by other variables

variancePartition also includes plotting functions to visualize the variation across a variable of interest. plots the expression of a gene stratified by the specified variable. In the example dataset, users can plot a gene expression trait stratified by Tissue (Figure 2a) or Individual (Figure 2b).

# get gene with the highest variation across Tissues
# create data.frame with expression of gene i and Tissue
#       type for each sample
i <- which.max(varPart$Tissue)
GE <- data.frame(Expression = geneExpr[i, ], Tissue = info$Tissue)

# Figure 2a
# plot expression stratified by Tissue
plotStratify(Expression ~ Tissue, GE, main = rownames(geneExpr)[i])

# get gene with the highest variation across Individuals
# create data.frame with expression of gene i and Tissue
#       type for each sample
i <- which.max(varPart$Individual)
GE <- data.frame(
  Expression = geneExpr[i, ],
  Individual = info$Individual
)

# Figure 2b
# plot expression stratified by Tissue
label <- paste("Individual:", format(varPart$Individual[i] * 100,
  digits = 3
), "%")
main <- rownames(geneExpr)[i]
plotStratify(Expression ~ Individual, GE,
  colorBy = NULL,
  text = label, main = main
)

For gene141, variation across tissues explains 52.9% of variance in gene expression. For gene43, variation across Individuals explains 91.4% of variance in gene expression.

Intuition about the backend

At the heart of variancePartition, a regression model is fit for each gene separately and summary statistics are extracted and reported to the user for visualization and downstream analysis. For a single model fit, calcVarPart() computes the fraction of variance explained by each variable. calcVarPart() is defined by this package, and computes these statistics from either a fixed effects model fit with lm() or a linear mixed model fit with lme4::lmer(). fitExtractVarPartModel() loops over each gene, fits the regression model and returns the variance fractions reported by calcVarPart().

Fitting the regression model and extracting variance statistics can also be done directly:

library("lme4")

# fit regression model for the first gene
form_test <- geneExpr[1, ] ~ Age + (1 | Individual) + (1 | Tissue)
fit <- lmer(form_test, info, REML = FALSE)

# extract variance statistics
calcVarPart(fit)
## Individual     Tissue        Age  Residuals 
##   8.90e-01   2.47e-02   4.35e-05   8.50e-02

Interpretation

variancePartition fits a linear (mixed) model that jointly considers the contribution of all specified variables on the expression of each gene. It uses a multiple regression model so that the effect of each variable is assessed while jointly accounting for all others. Standard ANOVA implemented in R involves refitting the model while dropping terms, but is aimed at hypothesis testing. calcVarPart() is aimed at estimating variance fractions. It uses a single fit of the linear (mixed) model and evaluates the sum of squares of each term and the sum of squares of the total model fit. However, we note that like any multiple regression model, high correlation bewtween fixed or random effect variables (see Assess correlation between all pairs of variables) can produce unstable estimates and it can be challanging to identify which variable is responsible for the expression variation.

The results of variancePartition give insight into the expression data at multiple levels. Moreover, a single statistic often has multiple equivalent interpretations while only one is relevant to the biological question. Analysis of the example data in Figure 1 gives some strong interpretations.

Considering the median across all genes,

  1. variation across individuals explains a median of 81.5% of the variation in expression, after correcting for tissue, batch and age
  2. variation across tissues explains a median of 8.2% of the variation in expression, after correcting for other the variables
  3. variation across batches is negligible after correcting for variation due to other variables
  4. the effect of age is negligible after correcting for other variables
  5. correcting for individual, tissue, batch and age leaves a median of 9.3% of the total variance in expression.

These statistics also have a natural interpretation in terms of the intra-class correlation (ICC), the correlation between observations made from samples in the same group.

Considering the median across across all genes and all experiments,

  1. the ICC for individual is 81.5%.
  2. the ICC for tissue is 8.2%.
  3. two randomly selected gene measurements from same individual, but regardless of tissue, batch or age, have a correlation of 81.5%.
  4. two randomly selected gene measurements from same tissue, but regardless of individual, batch or age, have a correlation of 8.2%.
  5. two randomly selected gene measurements from the same individual and same tissue, but regardless of batch and age, have an correlation of 81.5% + 8.2% = 89.7%.

Note that that the ICC here is interpreted as the ICC after correcting for all other variables in the model.

These conclusions are based on the genome-wide median across all genes, but the same type of statements can be made at the gene-level. Moreover, care must be taken in the interpretation of nested variables. For example, Age is nested within Individual since the multiple samples from each individual are taken at the same age. Thus the effect of Age removes some variation from being explained by Individual. This often arises when considering variation across individuals and across sexes: any cross-sex variation is a component of the cross-individual variation. So the total variation across individuals is the sum of the fraction of variance explained by Sex and Individual. This nesting/summing of effects is common for variables that are properties of the individual rather than the sample. For example, sex and ethnicity are always properties of the individual. Variables like age and disease state can be properties of the individual, but could also vary in time-course or longitudinal experiments. The the interpretation depends on the experimental design.

The real power of variancePartition is to identify specific genes that follow or deviate from the genome-wide trend. The gene-level statistics can be used to identify a subset of genes that are enriched for specific biological functions. For example, we can ask if the 500 genes with the highest variation in expression across tissues (i.e. the long tail for tissue in Figure 1a) are enriched for genes known to have high tissue-specificity.

Should a variable be modeled as fixed or random effect?

Categorical variables should (almost) always be modeled as a random effect. The difference between modeling a categorical variable as a fixed versus random effect is minimal when the sample size is large compared to the number of categories (i.e. levels). So variables like disease status, sex or time point will not be sensitive to modeling as a fixed versus random effect. However, variables with many categories like Individual must be modeled as a random effect in order to obtain statistically valid results. So to be on the safe side, categorical variable should be modeled as a random effect.

% R and variancePartition handle catagorical variables stored as a very naturally. If categorical variables are stored as an or , they must be converted to a before being used with variancePartition

variancePartition fits two types of models:

  1. linear mixed model where all categorical variables are modeled as random effects and all continuous variables are fixed effects. The function lme4::lmer() is used to fit this model.

  2. fixed effected model, where all variables are modeled as fixed effects. The function lm() is used to fit this model.

Which variables should be included?

In my experience, it is useful to include all variables in the first analysis and then drop variables that have minimal effect. However, like all multiple regression methods, variancePartition will divide the contribution over multiple variables that are strongly correlated. So, for example, including both sex and height in the model will show sex having a smaller contribution to variation gene expression than if height were omitted, since there variables are strongly correlated. This is a simple example, but should give some intuition about a common issue that arises in analyses with variancePartition.

variancePartition can naturally assess the contribution of both individual and sex in a dataset. As expected, genes for which sex explains a large fraction of variation are located on chrX and chrY. If the goal is to interpret the impact of sex, then there is no issue. But recall the issue with correlated variables and note that individual is correlated with sex, because each individual is only one sex regardless of how many samples are taken from a individual. It follows that including sex in the model reduces the apparent contribution of individual to gene expression. In other words, the ICC for individual will be different if sex is included in the model.

In general, including variables in the model that do not vary within individual will reduce the apparent contribution of individual as estimated by variancePartition. For example, sex and ethnicity never vary between multiple samples from the same individual and will always reduce the apparent contribution of individual. However, disease state and age may or may not vary depending on the study design.

In biological datasets technical variability (i.e. batch effects) can often reduce the apparent biological signal. In RNA-seq analysis, it is common for the the impact of this technical variability to be removed before downstream analysis. Instead of including these batch variable in the variancePartition analysis, it is simple to complete the expression residuals with the batch effects removed and then feeds these residuals to variancePartition. This will increase the fraction of variation explained by biological variables since technical variability is reduced.

Assess correlation between all pairs of variables

Evaluating the correlation between variables in a important part in interpreting variancePartition results. When comparing two continuous variables, Pearson correlation is widely used. But variancePartition includes categorical variables in the model as well. In order to accommodate the correlation between a continuous and a categorical variable, or two categorical variables we used canonical correlation analysis.

Canonical Correlation Analysis (CCA) is similar to correlation between two vectors, except that CCA can accommodate matricies as well. For a pair of variables, canCorPairs() assesses the degree to which they co-vary and contain the same information. Variables in the formula can be a continuous variable or a discrete variable expanded to a matrix (which is done in the backend of a regression model). For a pair of variables, canCorPairs() uses CCA to compute the correlation between these variables and returns the pairwise correlation matrix.

Statistically, let rho be the array of correlation values returned by the standard R function cancor to compute CCA. canCorPairs() returns rho / sum(rho) which is the fraction of the maximum possible correlation. Note that CCA returns correlations values between 0 and 1

form <- ~ Individual + Tissue + Batch + Age + Height

# Compute Canonical Correlation Analysis (CCA)
# between all pairs of variables
# returns absolute correlation value
C <- canCorPairs(form, info)

# Plot correlation matrix
# between all pairs of variables
plotCorrMatrix(C)

Advanced analysis

Extracting additional information from model fits

Advanced users may want to perform the model fit and extract results in separate steps in order to examine the fit of the model for each gene. Thus the work of can be divided into two steps: 1) fit the regression model, and 2) extracting variance statistics.

form <- ~ Age + (1 | Individual) + (1 | Tissue) + (1 | Batch)

# Fit model
results <- fitVarPartModel(geneExpr, form, info)

# Extract results
varPart <- extractVarPart(results)

Note that storing the model fits can use a lot of memory (~10Gb with 20K genes and 1000 experiments). I do not recommend unless you have a specific need for storing the entire model fit.

Instead, fitVarPartModel() can extract any desired information using any function that accepts the model fit from lm() or lmer(). The results are stored in a list and can be used for downstream analysis.

# Fit model and run summary() function on each model fit
vpSummaries <- fitVarPartModel(geneExpr, form, info, fxn = summary)
# Show results of summary() for the first gene
vpSummaries[[1]]
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: y.local ~ Age + (1 | Individual) + (1 | Tissue) + (1 | Batch)
##    Data: data
## Weights: data$w.local
## Control: control
## 
##      AIC      BIC   logLik deviance df.resid 
##      397      413     -193      385       94 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.0580 -0.5865  0.0147  0.6603  1.9708 
## 
## Random effects:
##  Groups     Name        Variance Std.Dev.
##  Individual (Intercept) 10.82275 3.2898  
##  Batch      (Intercept)  0.00192 0.0438  
##  Tissue     (Intercept)  0.30010 0.5478  
##  Residual                1.02997 1.0149  
## Number of obs: 100, groups:  Individual, 25; Batch, 4; Tissue, 3
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept) -10.60243    1.09493   -9.68
## Age           0.00318    0.01610    0.20
## 
## Correlation of Fixed Effects:
##     (Intr)
## Age -0.739

Removing batch effects before fitting model

Gene expression studies often have substantial batch effects, and variancePartition can be used to understand the magnitude of the effects. However, we often want to focus on biological variables (i.e. individual, tissue, disease, sex) after removing the effect of technical variables. Depending on the size of the batch effect, I have found it useful to correct for the batch effect first and then perform a variancePartition analysis afterward. Subtracting this batch effect can reduce the total variation in the data, so that the contribution of other variables become clearer.

Standard analysis:

form <- ~ (1 | Tissue) + (1 | Individual) + (1 | Batch) + Age
varPart <- fitExtractVarPartModel(geneExpr, form, info)

Analysis on residuals:

library("limma")
# subtract out effect of Batch
fit <- lmFit(geneExpr, model.matrix(~Batch, info))
res <- residuals(fit, geneExpr)

# fit model on residuals
form <- ~ (1 | Tissue) + (1 | Individual) + Age

varPartResid <- fitExtractVarPartModel(res, form, info)

Remove batch effect with linear mixed model

# subtract out effect of Batch with linear mixed model
modelFit <- fitVarPartModel(geneExpr, ~ (1 | Batch), info)
res <- residuals(modelFit)

# fit model on residuals
form <- ~ (1 | Tissue) + (1 | Individual) + Age

varPartResid <- fitExtractVarPartModel(res, form, info)

If the two-step process requires too much memory, the residuals can be computed more efficiently. Here, run the function inside the call to fitVarPartModel() to avoid storing the large intermediate results.

# extract residuals directly without storing intermediate results
residList <- fitVarPartModel(geneExpr, ~ (1 | Batch), info,
  fxn = residuals
)

# convert list to matrix
residMatrix <- do.call(rbind, residList)

Variation within multiple subsets of the data

So far, we have focused on interpreting one variable at a time. But the linear mixed model behind variancePartition is a very powerful framework for analyzing variation at multiple levels. We can easily extend the previous analysis of the contribution of individual and tissue on variation in gene expression to examine the contribution of individual within each tissue. This analysis is as easy as specifying a new formula and rerunning variancePartition. Note that is analysis will only work when there are replicates for at least some individuals within each tissue in order to assess cross-individual variance with in a tissue.

# specify formula to model within/between individual variance
# separately for each tissue
# Note that including +0 ensures each tissue is modeled explicitly
# Otherwise, the first tissue would be used as baseline
form <- ~ (Tissue + 0 | Individual) + Age + (1 | Tissue) + (1 | Batch)

# fit model and extract variance percents
varPart <- fitExtractVarPartModel(geneExpr, form, info, showWarnings = FALSE)

# violin plot
plotVarPart(sortCols(varPart), label.angle = 60)

This analysis corresponds to a varying coefficient model, where the correlation between individuals varies for each tissue ]. Since the variation across individuals is modeled within each tissue, the total variation explained does not sum to 1 as it does for standard application of variancePartition. So interpretation as intra-class does not strictly apply and use of plotPercentBars() is no longer applicable. Yet the variables in the study design are still ranked in terms of their genome-wide contribution to expression variation, and results can still be analyzed at the gene level. See Variation with multiple subsets of the data for statistical details.

Detecting problems caused by collinearity of variables

Including variables that are highly correlated can produce misleading results and overestimate the contribution of variables modeled as fixed effects. This is usually not an issue, but can arise when statistically redundant variables are included in the model. In this case, the model is "degenerate" or "computationally singular" and parameter estimates from this model are not meaningful. Dropping one or more of the covariates will fix this problem.

A check of collinearity is built into fitVarPartModel() and fitExtractVarPartModel(), so the user will be warned if this is an issue.

Alternatively, the user can use the colinearityScore() function to evaluate whether this is an issue for a single model fit:

form <- ~ (1 | Individual) + (1 | Tissue) + Age + Height

# fit model
res <- fitVarPartModel(geneExpr[1:4, ], form, info)
# evaluate the collinearity score on the first model fit
# this reports the correlation matrix between coefficient estimates
# for fixed effects
# the collinearity score is the maximum absolute correlation value
# If the collinearity score > .99 then the variance partition
# estimates may be problematic
# In that case, a least one variable should be omitted
colinearityScore(res[[1]])
## [1] 0.777
## attr(,"vcor")
##             (Intercept)     Age  Height
## (Intercept)       1.000 -0.4191 -0.7774
## Age              -0.419  1.0000 -0.0575
## Height           -0.777 -0.0575  1.0000

Including weights computed separately

variancePartition automatically used precision weights computed by limma::voom(), but the user can also specify custom weights using the weightsMatrix argument.

form <- ~ (1 | Individual) + (1 | Tissue) + Age + Height

# Specify custom weights
# In this example the weights are simulated from a
# uniform distribution and are not meaningful.
weights <- matrix(runif(length(geneExpr)), nrow = nrow(geneExpr))

# Specify custom weights
res <- fitExtractVarPartModel(geneExpr[1:4, ], form, info,
  weightsMatrix = weights[1:4, ]
)

In addition, setting the useWeights=FALSE will suppress usage of the weights in all cases, i.e. when the weights are specified manually or implicitly with the results of limma::voom().

Including interaction terms

Typical analysis assumes that the effect of each variable on gene expression does not depend on other variables in the model. Sometimes this assumption is too strict, and we want to model an interaction effect whereby the effect of Batch depends on Tissue. This can be done easly by specifying an interaction term, (1|Batch:Tissue). Since Batch has 4 categories and Tissue has 3, this interaction term implicity models a new 3*4 = 12 category variable in the analysis. This new interaction term will absorb some of the variance from the Batch and Tissue term, so an interaction model should always include the two constituent variables.

Here we fit an interaction model, but we observe that interaction between Batch and Tissue does not explain much expression variation.

form <- ~ (1 | Individual) + Age + Height + (1 | Tissue) + (1 | Batch) +
  (1 | Batch:Tissue)

# fit model
vpInteraction <- fitExtractVarPartModel(geneExpr, form, info)

plotVarPart(sortCols(vpInteraction))