Contents

1 Introduction

1.1 Differential exon usage

diffUTR wraps around three methods for different exon usage analysis:

  • The DEXSeq package (see ?DEXSeqWrapper)
  • An improved version of limma’s diffSplice method (see ?diffSpliceWrapper)
  • edgeR’s diffSpliceDGE method (see ?diffSpliceDGEWrapper)

All three wrappers have been designed to use the same input (the RangedSummarizedExperiment object created by countFeatures – see below) and produce highly similar outputs, so that they can all be used with the same downstream plotting functions (Figure 1A).

Based on various benchmarks (e.g.  Sonenson et al. 2016; see also our paper), DEXSeq is the most accurate of the three methods, and should therefore be the method of choice. It is however very slow and does not scale well to larger sample sizes. We therefore suggest our improved diffSplice2 method (?diffSpliceWrapper) when DEXSeq is not an option.

A: Overview of the diffUTR workflow. B: Bin creation scheme.

Figure 1: A: Overview of the diffUTR workflow
B: Bin creation scheme.

1.2 Differential 3’ UTR usage

A chief difficulty in analyzing 3’ UTR usage is that most UTR variants are not cataloged in standard transcript annotations, limiting the utility of standard transcript-level quantification based on reference transcripts. diffUTR leverages the differential exon usage methods for differential 3’ UTR analysis using alternative poly-adenylation site databases to create additional UTR bins (Figure 1B).



2 Getting started

2.1 Package installation

Install the package with:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("diffUTR")

We load diffUTR as well as the SummarizedExperiment package on which it depends:

suppressPackageStartupMessages({
  library(diffUTR)
  library(SummarizedExperiment)
})

2.2 Obtaining gene annotations

Prior to using diffUTR you will need a gene annotation. This can either be provided as a gtf file (with relatively standard formatting), or as an ensembldb object. For example, an EnsDb object for the latest mouse annotation can be fetched as follows:

library(AnnotationHub)
ah <- AnnotationHub()
# obtain the identifier of the latest mouse ensembl annotation:
ahid <- rev(query(ah, c("EnsDb", "Mus musculus"))$ah_id)[1]
ensdb <- ah[[ahid]]

This ensdb object could then be directly passed to the prepareBins function (see below).

For the purpose of this vignette, we will use a reduced annotation containing only 100 mouse genes:

data(example_gene_annotation, package="diffUTR")
g <- example_gene_annotation
head(g)
## GRanges object with 6 ranges and 6 metadata columns:
##                      seqnames            ranges strand | gene_name     type
##                         <Rle>         <IRanges>  <Rle> |  <factor> <factor>
##   ENSMUSE00000751848        1 36307733-36307836      + |    Arid5a     exon
##   ENSMUSG00000037447        1 36307733-36324029      + |    Arid5a     gene
##   ENSMUSE00001245300        1 36307754-36307836      + |    Arid5a     exon
##   ENSMUSE00001257314        1 36307754-36307836      + |    Arid5a     exon
##   ENSMUSE00001257314        1 36307754-36307836      + |    Arid5a     exon
##   ENSMUSE00001257314        1 36307754-36307836      + |    Arid5a     exon
##                        gene_biotype           tx_biotype                 tx
##                            <factor>             <factor>        <character>
##   ENSMUSE00000751848 protein_coding protein_coding       ENSMUST00000126413
##   ENSMUSG00000037447 NA             NA                                 <NA>
##   ENSMUSE00001245300 protein_coding protein_coding       ENSMUST00000142319
##   ENSMUSE00001257314 protein_coding retained_intron      ENSMUST00000150747
##   ENSMUSE00001257314 protein_coding processed_transcript ENSMUST00000145179
##   ENSMUSE00001257314 protein_coding retained_intron      ENSMUST00000124280
##                                    gene
##                                <factor>
##   ENSMUSE00000751848 ENSMUSG00000037447
##   ENSMUSG00000037447 ENSMUSG00000037447
##   ENSMUSE00001245300 ENSMUSG00000037447
##   ENSMUSE00001257314 ENSMUSG00000037447
##   ENSMUSE00001257314 ENSMUSG00000037447
##   ENSMUSE00001257314 ENSMUSG00000037447
##   -------
##   seqinfo: 118 sequences from GRCm38 genome

3 Workflow for differential exon usage (DEU) analysis

3.1 Preparing the annotation

Because exons partially overlap, they must be disjoined in non-overlapping bins for the purpose of analysis. This is handled by the prepareBins function:

# If you know that your data will be stranded, use
# bins <- prepareBins(g, stranded=TRUE)
# Otherwise use
bins <- prepareBins(g)
## Preparing annotation
## Merging and disjoining bins

3.2 Counting reads in bins

Counting reads (or fragments) overlapping bins is done using the countFeatures function, with a vector of paths to bam files as input. Under the hood, this calls the featureCounts function of the Rsubread package, so any argument of that function can be passed to countFeatures. For example:

bamfiles <- c("path/to/sample1.bam", "path/to/sample2.bam", ...)
rse <- countFeatures(bamfiles, bins, strandSpecific=2, nthreads=6, isPairedEnd=FALSE)

Note that strandSpecific should be set correctly (i.e. to 0 if the data is unstranded, and to 1 or 2 if stranded – see ?Rsubread::featureCounts), and isPairedEnd=TRUE should be used if the data is paired-end.

For the purpose of this vignette, we load a pre-computed object containing data from Whipple et al. (2020):

data(example_bin_se, package="diffUTR")
rse <- example_bin_se
rse
## class: RangedSummarizedExperiment 
## dim: 920 6 
## metadata(0):
## assays(1): counts
## rownames(920): ENSMUSG00000000552.1 ENSMUSG00000000552.11 ...
##   ENSMUSG00000098794.6 ENSMUSG00000098794.7
## rowData names(7): gene_name type ... meanLogDensity geneAmbiguous
## colnames(6): CTRL1 CTRL2 ... LTP2 LTP3
## colData names(1): condition

The output of countFeatures is a RangedSummarizedExperiment object (see SummarizedExperiment) containing the samples’ counts (as well as normalized assays) across bins, as well as all the bin information in the rowRanges(se):

head(rowRanges(rse))
## GRanges object with 6 ranges and 7 metadata columns:
##                         seqnames              ranges strand |       gene_name
##                            <Rle>           <IRanges>  <Rle> | <CharacterList>
##    ENSMUSG00000000552.1    chr15 103340057-103340086      - |         Zfp385a
##   ENSMUSG00000000552.11    chr15 103313895-103314997      - |         Zfp385a
##    ENSMUSG00000000552.2    chr15 103340047-103340056      - |         Zfp385a
##    ENSMUSG00000000552.3    chr15 103333065-103333201      - |         Zfp385a
##    ENSMUSG00000000552.4    chr15 103320290-103320400      - |         Zfp385a
##    ENSMUSG00000000552.5    chr15 103317949-103318111      - |         Zfp385a
##                             type               gene meanLogCPM  logWidth
##                         <factor>           <factor>  <numeric> <numeric>
##    ENSMUSG00000000552.1 UTR      ENSMUSG00000000552   0.134936   3.43399
##   ENSMUSG00000000552.11 UTR/3UTR ENSMUSG00000000552   0.454942   7.00670
##    ENSMUSG00000000552.2 CDS      ENSMUSG00000000552   0.136078   2.39790
##    ENSMUSG00000000552.3 CDS      ENSMUSG00000000552   0.233634   4.92725
##    ENSMUSG00000000552.4 CDS      ENSMUSG00000000552   0.206431   4.71850
##    ENSMUSG00000000552.5 CDS      ENSMUSG00000000552   0.211493   5.09987
##                         meanLogDensity geneAmbiguous
##                              <numeric>     <logical>
##    ENSMUSG00000000552.1     0.03746455         FALSE
##   ENSMUSG00000000552.11     0.00142892         FALSE
##    ENSMUSG00000000552.2     0.10854836         FALSE
##    ENSMUSG00000000552.3     0.00918315         FALSE
##    ENSMUSG00000000552.4     0.01102348         FALSE
##    ENSMUSG00000000552.5     0.00755899         FALSE
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths

In this example only a subset of information was retained, but the original file stores any information present in the annotation (e.g. transcripts, biotype, etc.)

3.3 Differential analysis

For the purpose of this example, we will use the improved diffSplice method:

rse <- diffSpliceWrapper(rse, design = ~condition)
## Testing coefficient conditionLTP
## Total number of exons:  920 
## Total number of genes:  80 
## Number of genes with 1 exon:  0 
## Mean number of exons in a gene:  12 
## Max number of exons in a gene:  45

The ~condition formula indicates that the samples should be split according to the group column of colData(rse). Alternatively to the formula inferace, a model.matrix can be directly passed. For more complex models involving several terms, you will have to specify the coefficient to test using the coef argument.

The bin-wise results of the the differential analysis have been saved in the rowData of the object:

head(rowData(rse))
## DataFrame with 6 rows and 11 columns
##                             gene_name     type               gene meanLogCPM
##                       <CharacterList> <factor>           <factor>  <numeric>
## ENSMUSG00000000552.1          Zfp385a UTR      ENSMUSG00000000552   0.134936
## ENSMUSG00000000552.11         Zfp385a UTR/3UTR ENSMUSG00000000552   0.454942
## ENSMUSG00000000552.2          Zfp385a CDS      ENSMUSG00000000552   0.136078
## ENSMUSG00000000552.3          Zfp385a CDS      ENSMUSG00000000552   0.233634
## ENSMUSG00000000552.4          Zfp385a CDS      ENSMUSG00000000552   0.206431
## ENSMUSG00000000552.5          Zfp385a CDS      ENSMUSG00000000552   0.211493
##                        logWidth meanLogDensity geneAmbiguous logDensityRatio
##                       <numeric>      <numeric>     <logical>       <numeric>
## ENSMUSG00000000552.1    3.43399     0.03746455         FALSE        2.070401
## ENSMUSG00000000552.11   7.00670     0.00142892         FALSE       -1.214023
## ENSMUSG00000000552.2    2.39790     0.10854836         FALSE        3.170190
## ENSMUSG00000000552.3    4.92725     0.00918315         FALSE        0.650240
## ENSMUSG00000000552.4    4.71850     0.01102348         FALSE        0.833778
## ENSMUSG00000000552.5    5.09987     0.00755899         FALSE        0.454890
##                       coefficient bin.p.value   bin.FDR
##                         <numeric>   <numeric> <numeric>
## ENSMUSG00000000552.1   -0.3957715    0.371207         1
## ENSMUSG00000000552.11  -0.0974467    0.701927         1
## ENSMUSG00000000552.2   -0.4178833    0.344547         1
## ENSMUSG00000000552.3    0.0279535    0.932203         1
## ENSMUSG00000000552.4    0.2551920    0.468900         1
## ENSMUSG00000000552.5    0.2423841    0.487095         1

The gene-wise aggregation has been saved in the object’s metadata:

perGene <- metadata(rse)$geneLevel
head(perGene)
## DataFrame with 6 rows and 11 columns
##                           name   nb.bins    w.coef w.abs.coef w.sqWidth
##                    <character> <numeric> <numeric>  <numeric> <integer>
## ENSMUSG00000038290        Smg6        44     1.116   1.559791        33
## ENSMUSG00000038268       Ovca2         3     1.432   1.432243        68
## ENSMUSG00000038894        Irs2         4     0.557   0.851392        57
## ENSMUSG00000033730        Egr3        12     0.344   0.842254        67
## ENSMUSG00000071076        Jund         6     0.657   1.166027        26
## ENSMUSG00000022174        Dad1        11     0.390   0.842301        64
##                    w.density sizeScore abs.sizeScore geneMeanDensity
##                    <numeric> <numeric>     <numeric>       <numeric>
## ENSMUSG00000038290    -6.174     0.043         0.049           0.011
## ENSMUSG00000038268   -11.421     0.946         0.946           0.005
## ENSMUSG00000038894    -7.889     0.194         0.283           0.021
## ENSMUSG00000033730    -8.224     0.163         0.188           0.011
## ENSMUSG00000071076    -6.169     0.113         0.230           0.017
## ENSMUSG00000022174    -6.169     0.142         0.158           0.050
##                    density.ratio     q.value
##                        <numeric>   <numeric>
## ENSMUSG00000038290      60.81827 4.73785e-39
## ENSMUSG00000038268       0.10500 9.54745e-07
## ENSMUSG00000038894       1.52049 9.93522e-07
## ENSMUSG00000033730       3.64790 2.98904e-06
## ENSMUSG00000071076       1.94223 4.50984e-06
## ENSMUSG00000022174      44.51854 5.69730e-06

The results are described in more detail below.



4 Workflow for differential 3’ UTR usage analysis

4.1 Obtaining alternative poly-adenlyation sites and preparing the bins

The preparation of the bins is done as in the standard DEU case, except that an additional argument should be given providing the alternative poly-adenylation (APA) sites. These will be used to break and extend UTRs into further bins. APA sites can be provided as a GenomicRanges object or the path to a bed file containing the coordinates. For mouse, human and C. elegans, an atlas of poly-A sites can be downloaded from https://polyasite.unibas.ch/atlas (). You can download the mouse file:

download.file("https://polyasite.unibas.ch/download/atlas/2.0/GRCm38.96/atlas.clusters.2.0.GRCm38.96.bed.gz",
              destfile="apa.mm38.bed.gz")
bins <- prepareBins(g, "apa.mm38.bed.gz")
# (Again, if you know that your data will be stranded, use `stranded=TRUE`)

Unfortunately, the polyASite database does not contain APA sites for the rat. A somewhat older similar database, PolyA_DB, does include the rat, however with an obsolete annotation; for convenience we lifted it over to Rno6, and made it available in the package data:

data(rn6_PAS)
# bins <- prepareBins(g, rn6_PAS)

If there is no APA sites database that includes your species of interest, you can still perform differential UTR analysis, however you will be limited to bins defined by the UTRs ends included in the gene annotation.

Note that if the gene annotation and the APA sites use different chromosome notation styles, you can enforce a given notation using the chrStyle argument of prepareBins. Beware however that there is no internal check that the genome builds match – be sure that they do!

For the purpose of this example, we’ll just use the same bins as in the first example.

4.2 Counting and differential analysis:

Counting reads in bins works as in the standard DEU case:

bamfiles <- c("path/to/sample1.bam", "path/to/sample2.bam", ...)
se <- countFeatures(bamfiles, bins, strandSpecific=2, nthreads=6, isPairedEnd=TRUE)

Differential analysis also works in the same way as for DEU:

rse <- diffSpliceWrapper( rse, design = ~condition )
## Testing coefficient conditionLTP
## Total number of exons:  920 
## Total number of genes:  80 
## Number of genes with 1 exon:  0 
## Mean number of exons in a gene:  12 
## Max number of exons in a gene:  45

By default, this will look for changes in any bin type. To look specifically for changes in certain types of bins, you can perform the gene-level aggregation using different filters. This can be done using the geneLevelStats() function, which accepts the arguments excludeTypes and includeTypes to decide which bin types to include in the gene-level estimates. For example:

# we can then only look for changes in CDS bins:
cds <- geneLevelStats(rse, includeTypes="CDS", returnSE=FALSE)
head(cds)
## DataFrame with 6 rows and 11 columns
##                           name   nb.bins    w.coef w.abs.coef w.sqWidth
##                    <character> <numeric> <numeric>  <numeric> <integer>
## ENSMUSG00000071076        Jund         6    -0.758   0.758416        31
## ENSMUSG00000079037        Prnp         8    -0.430   0.429916        27
## ENSMUSG00000039114        Nrn1        13     1.516   1.551778        10
## ENSMUSG00000038290        Smg6        44    -0.720   0.720179        13
## ENSMUSG00000018102   Hist1h2bc        12    -0.393   0.392852        19
## ENSMUSG00000019505         Ubb        11    -0.234   0.234007         9
##                    w.density sizeScore abs.sizeScore geneMeanDensity
##                    <numeric> <numeric>     <numeric>       <numeric>
## ENSMUSG00000071076    -5.238    -0.184         0.184           0.017
## ENSMUSG00000079037    -2.317    -0.048         0.048           0.062
## ENSMUSG00000039114    -4.982     0.086         0.088           0.094
## ENSMUSG00000038290    -5.881    -0.008         0.008           0.011
## ENSMUSG00000018102    -4.727    -0.038         0.038           0.082
## ENSMUSG00000019505     0.256    -0.008         0.008           0.333
##                    density.ratio     q.value
##                        <numeric>   <numeric>
## ENSMUSG00000071076       3.70700 4.50984e-06
## ENSMUSG00000079037      96.80200 3.13531e-05
## ENSMUSG00000039114       5.04196 1.68484e-04
## ENSMUSG00000038290      74.39720 3.82690e-04
## ENSMUSG00000018102      25.09800 2.48807e-02
## ENSMUSG00000019505     783.23642 3.33429e-02
# or only look for changes in UTR bins:
utr <- geneLevelStats(rse, includeTypes="UTR", returnSE=FALSE)
head(utr)
## DataFrame with 6 rows and 11 columns
##                           name   nb.bins    w.coef w.abs.coef w.sqWidth
##                    <character> <numeric> <numeric>  <numeric> <integer>
## ENSMUSG00000038894        Irs2         4     1.149   1.149478        59
## ENSMUSG00000022174        Dad1        11     0.360   0.888266        74
## ENSMUSG00000039114        Nrn1        13     1.246   1.246068         8
## ENSMUSG00000038290        Smg6        44    -0.887   0.886915        28
## ENSMUSG00000059991       Nptx2        12     0.439   0.786649        22
## ENSMUSG00000019505         Ubb        11    -0.230   0.230037         9
##                    w.density sizeScore abs.sizeScore geneMeanDensity
##                    <numeric> <numeric>     <numeric>       <numeric>
## ENSMUSG00000038894    -9.594     0.390         0.390           0.021
## ENSMUSG00000022174    -8.036     0.161         0.184           0.050
## ENSMUSG00000039114    -5.311     0.047         0.047           0.094
## ENSMUSG00000038290    -6.782    -0.013         0.013           0.011
## ENSMUSG00000059991    -7.579     0.078         0.102           0.031
## ENSMUSG00000019505     0.217    -0.008         0.008           0.333
##                    density.ratio     q.value
##                        <numeric>   <numeric>
## ENSMUSG00000038894       0.46500 9.93522e-07
## ENSMUSG00000022174      11.84141 5.69730e-06
## ENSMUSG00000039114       3.76888 1.58361e-03
## ENSMUSG00000038290      39.68855 7.34288e-03
## ENSMUSG00000059991       1.68266 1.01491e-02
## ENSMUSG00000019505     760.47865 3.33429e-02
# or only look for changes in bins that _could be_ UTRs:
# geneLevelStats(rse, excludeTypes=c("CDS","non-coding"), returnSE=FALSE)



5 Exploring the results

5.1 Top genes

plotTopGenes() provides gene-level statistic plots (similar to a ‘volcano plot’), which come in two variations in terms of aggregated effect size (x-axis). For standard DEU analysis, we take the weighted average of absolute bin-level coefficients, using the bins’ -log10(p-value) as weights, to produce gene-level estimates of effect sizes. For differential 3’ UTR usage, where bins are expected to have consistent directions (i.e. lengthening or shortening of the UTR) and where their size is expected to have a strong impact on biological function, the signed bin-level coefficients are weighted both by both size and significance to produce (signed) gene-level estimates of effect sizes.

plotTopGenes(rse)

By default, the size of the points reflects the relative expression of the genes, and the colour the relative expression of the significant bins with respect to the gene (density ratio). This enables the rapid identification of changes occuring in highly-expressed genes, as well as discount changes happening in bins which tend not to be included in transcripts from that gene (low density ratio). Note that it is possible to use filter by density ratio during gene-level aggregation (see ?geneLevelStats).

The product is a ggplot object, and can be manipulated as such.

By default, the gene-level statistics computed use all bin types. However, plotTopGenes also accepts directly the gene-level stats, as outputted by geneLevelStats (see above), enabling the visualization of top genes separately for CDS and UTRs, for instance:

# `utr` being the output of the above
# geneLevelStats(rse, includeTypes="UTR", returnSE=FALSE)
plotTopGenes(utr, diffUTR = TRUE)

Here we also indicated that we are in differential UTR mode, to use signed and width-weighted aggregation.

Note that when there are too many gene labels to be plotted nearby, ggrepel will omit them and produce the warning above. To plot them nevertheless, you can pass the argument max.overlaps with a higher value (see ?ggrepel::geom_text_repel).

5.2 Gene profiles

The package also offers two functions for plotting bin-level profiles for a specific gene. Before doing so, we generate normalized assays if this was not already done:

rse <- addNormalizedAssays(rse)

deuBinPlot provides plots similar to those produced by DEXSeq and limma, but offering more flexibility. They can be plotted as overall bin statistics, per condition, or per sample, and can plot various types of values. Importantly, since all data and annotation is contained in the object, these can easily be included in the plots, mapping for instance to colour or shape:

deuBinPlot(rse,"Jund")
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

deuBinPlot(rse,"Jund", type="condition", colour="condition")
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

deuBinPlot(rse,"Jund", type="sample", colour="condition") # shows separate samples
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.