# 1 Introduction

This document briefly how to use CAGEr CAGEr, a Bioconductor package designed to process, analyse and visualise Cap Analysis of Gene Expression (CAGE)sequencing data. CAGE (Kodzius et al. 2006) is a high-throughput method for transcriptome analysis that utilizes cap trapping (Carninci et al. 1996), a technique based on the biotinylation of the 7-methylguanosine cap of Pol II transcripts, to pulldown the 5′-complete cDNAs reversely transcribed from the captured transcripts. A linker sequence is ligated to the 5′ end of the cDNA and a specific restriction enzyme is used to cleave off a short fragment from the 5′ end. Resulting fragments are then amplified and sequenced using massive parallel high-throughput sequencing technology, which results in a large number of short sequenced tags that can be mapped back to the referent genome to infer the exact position of the transcription start sites (TSSs) used for transcription of captured RNAs (Figure 1). The number of CAGE tags supporting each TSS gives the information on the relative frequency of its usage and can be used as a measure of expression from that specific TSS. Thus, CAGE provides information on two aspects of capped transcriptome: genome-wide 1bp-resolution map of TSSs and transcript expression levels. This information can be used for various analyses, from 5′ centered expression profiling (Takahashi et al. 2012) to studying promoter architecture (Carninci et al. 2006).

CAGE samples derived from various organisms (genomes) can be analysed by CAGEr and the only limitation is the availability of the referent genome as a BSgenome package in case when raw mapped CAGE tags are processed. CAGEr provides a comprehensive workflow that starts from mapped CAGE tags and includes reconstruction of TSSs and promoters and their visualisation, as well as more specialized downstream analyses like promoter width, expression profiling and differential TSS usage. It can use both Binary Sequence Alignment Map (BAM) files of aligned CAGE tags or files with genomic locations of TSSs and number of supporting CAGE tags as input. If BAM files are provided CAGEr constructs TSSs from aligned CAGE tags and counts the number of tags supporting each TSS, while allowing filtering out low-quality tags and removing technology-specific bias. It further performs normalization of raw CAGE tag count, clustering of TSSs into tag clusters (TC) and their aggregation across multiple CAGE experiments into promoters to construct the promoterome. Various methods for normalization and clustering of TSSs are supported. Exporting data into different types of track files allows various visualisations of TSSs and clusters (promoters) in the UCSC Genome Browser, which facilitate generation of hypotheses. CAGEr manipulates multiple CAGE experiments at once and performs analyses across datasets, including expression profiling and detection of differential TSS usage (promoter shifting). Multicore option for parallel processing is supported on Unix-like platforms, which significantly reduces computing time.

Here are some of the functionalities provided in this package:

• Reading in multiple CAGE datasets from various sources; user provided BAM or TSS input files, public CAGE datasets from accompanying data package.

• Correcting systematic G nucleotide addition bias at the 5′ end of CAGE tags.

• Plotting pairwise scatter plots, calculating correlation between datasets and merging datasets.

• Normalizing raw CAGE tag count: simple tag per million (tpm) or power-law based normalization (Balwierz et al. 2009).

• Clustering individual TSSs into tag clusters (TCs) and aggregating clusters across multiple CAGE datasets to create a set of consensus promoters.

• Making bedGraph or BED files of individual TSSs or clusters for visualisation in the genome browser.

• Expression clustering of individual TSSs or consensus promoters into distinct expression profiles using common clustering algorithms.

• Calculating promoter width based on the cumulative distribution of CAGE signal along the promoter.

• Scoring and statistically testing differential TSS usage (promoter shifting) and detecting promoters that shift between two samples.

Several data packages are accompanying CAGEr package. They contain majority of the up-to-date publicly available CAGE data produced by major consortia including FANTOM and ENCODE. These include FANTOM3and4CAGE package available from Bioconductor, as well as ENCODEprojectCAGE and ZebrafishDevelopmentalCAGE packages available from http://promshift.genereg.net/CAGEr/. In addition, direct fetching of TSS data from FANTOM5 web resource (the largest collection of TSS data for human and mouse) from within CAGEr is also available. These are all valuable resources of genome-wide TSSs in various tissue/cell types for various model organisms that can be used directly in R. Section 5 in this vignette describes how these public datasets can be included into a workflow provided by CAGEr. For further information on the content of the data packages and the list of available CAGE datasets please refer to the vignette of the corresponding data package.

For further details on the implemented methods and for citing the CAGEr package in your work please refer to (Haberle et al. 2015).

# 2 Input data for CAGEr

CAGEr package supports three types of CAGE data input:

• Sequenced CAGE tags mapped to the genome: either BAM (Binary Sequence Alignment Map) files of sequenced CAGE tags aligned to the referent genome (including the paired-end data such as CAGEscan) or BED files of CAGE tags (fragments).

• CAGE detected TSSs (CTSSs): tab separated files with genomic coordinates of TSSs and number of tags supporting each TSS. The file should not contain a header and the data must be organized in four columns:

• name of the chromosome: names must match the names of chromosomes in the corresponding BSgenome package.
• 1-based coordinate of the TSS on the chromosome
• genomic strand: should be either + or -
• number of CAGE tags supporting that TSS
• Publicly available CAGE datasets from R data package: Several data packages containing CAGE data for various organisms produced by major consortia are accompanying this package. Selected subset of these data can be used as input for .

The type and the format of the input files is specified at the beginning of the workflow, when the CAGEset object is created (section 4.2). This is done by setting the inputFilesType argument, which accepts the following self-explanatory options referring to formats mentioned above: "bam", "bamPairedEnd", "bed", "ctss", "CTSStable".

In addition, the package provides a method for coercing a data.frame object containing single base-pair TSS information into a CAGEset object (as described in section 4.2.1), which can be further used in the workflow described below.

# 3 The CAGEr workflow

## 3.1 Getting started

We start the workflow by creating a CAGEexp object, which is a container for storing CAGE datasets and all the results that will be generated by applying specific functions. The CAGEexp objects are an extension of the MultiAssayExperiment class, and therefore can use all their methods. The expression data is stored in CAGEexp using SummarizedExperiment objects, and can also access their methods.

To load the CAGEr package and the other libraries into your R envirnoment type:

library(CAGEr)

## 3.2 Creating a CAGEexp object

### 3.2.1 Specifying a genome assembly

In this tutorial we will be using data from zebrafish Danio rerio that was mapped to the danRer7 assembly of the genome. Therefore, the corresponding genome package BSgenome.Drerio.UCSC.danRer7 has to be installed. It will be automatically loaded by CAGEr commands when needed.

In case the data is mapped to a genome that is not readily available through BSgenome package (not in the list returned by BSgenome::available.genomes() function), a custom BSgenome package has to be build and installed first. (See the vignette within the BSgenome package for instructions on how to build a custom genome package). The genomeName argument can then be set to the name of the build genome package when creating a CAGEexp object (see the section Creating CAGEexp object below).

The BSgenome package is used to access information about the chromosomes (sequence, length, circularity). By default, CAGEr will discard alignments that are not on chromosomes named in the BSgenome package. The genomeName argument can be set to NULL in order to suppress this behaviour, or as a last resort when no BSgenome package is available. However, CAGEr functions that need access to the genome sequence, for instance for G-correction will not work in that case.

### 3.2.2 Specifying input files

The subset of zebrafish (Danio rerio) developmental time-series CAGE data generated by (Nepal et al. 2013) will be used in the following demonstration of the CAGEr workflow.

Files with genomic coordinates of TSSs detected by CAGE in 4 zebrafish developmental stages are included in this package in the extdata subdirectory. The files contain TSSs from a part of chromosome 17 (26,000,000-46,000,000), and there are two files for one of the developmental stages (two independent replicas). The data in files is organized in four tab-separated columns as described above in section 2.

inputFiles = list.files( system.file("extdata", package = "CAGEr")
, "ctss$" , full.names = TRUE) ### 3.2.3 Creating the object The CAGEexp object is crated with the CAGEexp contstructor, that requries information on file path an type, sample names and reference genome name. ce <- CAGEexp( genomeName = "BSgenome.Drerio.UCSC.danRer7" , inputFiles = inputFiles , inputFilesType = "ctss" , sampleLabels = sub( ".chr17.ctss", "", basename(inputFiles)) ) To display the created object type: ce ## A CAGEexp object of 0 listed ## experiments with no user-defined names and respective classes. ## Containing an ExperimentList class object of length 0: ## Functionality: ## experiments() - obtain the ExperimentList instance ## colData() - the primary/phenotype DataFrame ## sampleMap() - the sample coordination DataFrame ## $, [, [[ - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save all data to files

The supplied information can be seen in the Input data information section, whereas all other slots are still empty, since no data has been read yet and no analysis conducted.

## 3.3 Reading in the data

In case when the CAGE / TSS data is to be read from input files, an empty CAGEexp object with information about the files is first created as described above in section 3.2. To actually read in the data into the object we use getCTSS() function, that will add an experiment called tagCountMatrix to the CAGEexp object.

getCTSS(ce)
ce
## A CAGEexp object of 1 listed
##  experiment with a user-defined name and respective class.
##  Containing an ExperimentList class object of length 1:
##  [1] tagCountMatrix: RangedSummarizedExperiment with 23343 rows and 5 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  \$, [, [[ - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save all data to files

This function reads the provided files in the order they were specified in the inputFiles argument. It creates a single set of all TSSs detected across all input datasets (union of TSSs) and a table with counts of CAGE tags supporting each TSS in every dataset. (Note that in case when a CAGEr object is created by coercion from an existing expression table there is no need to call getCTSS()).

Genomic coordinates of all TSSs and numbers of supporting CAGE tags in every input sample can be retrieved using the CTSStagCountSE() function. CTSScoordinatesGR() accesses the CTSS coordinates and CTSStagCountDF() accesses the CTSS expression values.1 Data can also be accessed directly using the native methods of the MultiAssayExperiment and SummarizedExperiment classes, for example ce[["tagCountMatrix"]], rowRanges(ce[["tagCountMatrix"]]) and assay(ce[["tagCountMatrix"]]).

CTSStagCountSE(ce)
## class: RangedSummarizedExperiment
## dim: 23343 5
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames(5): Zf.30p.dome Zf.high Zf.prim6.rep1 Zf.prim6.rep2
##   Zf.unfertilized.egg
## colData names(0):
CTSScoordinatesGR(ce)
## CTSS object with 23343 positions and 0 metadata columns:
##           seqnames       pos strand
##              <Rle> <integer>  <Rle>
##       [1]    chr17  26027430      +
##       [2]    chr17  26050540      +
##       [3]    chr17  26118088      +
##       [4]    chr17  26142853      +
##       [5]    chr17  26166954      +
##       ...      ...       ...    ...
##   [23339]    chr17  45975041      -
##   [23340]    chr17  45975540      -
##   [23341]    chr17  45975544      -
##   [23342]    chr17  45982697      -
##   [23343]    chr17  45999921      -
##   -------
##   seqinfo: 26 sequences (1 circular) from danRer7 genome
##   BSgenome name: BSgenome.Drerio.UCSC.danRer7
CTSStagCountDF(ce)
## DataFrame with 23343 rows and 5 columns
##       Zf.30p.dome Zf.high Zf.prim6.rep1 Zf.prim6.rep2 Zf.unfertilized.egg
##             <Rle>   <Rle>         <Rle>         <Rle>               <Rle>
## 1               0       0             1             0                   0
## 2               0       0             0             0                   1
## 3               0       0             1             0                   0
## 4               0       0             0             1                   0
## 5               0       0             1             0                   0
## ...           ...     ...           ...           ...                 ...
## 23339           1       0             0             0                   0
## 23340           0       2             0             0                   0
## 23341           0       1             0             0                   0
## 23342           0       0             1             0                   0
## 23343           1       0             0             0                   0

For compatiblity with earlier CAGEr works using CAGEset objects, and to provide simpler data formats, the coordinates and expression values can also be accessed as simple data.frames. Note howerver that with large data sets it can cause extreme performance issues.

head(CTSScoordinates(ce))
##     chr      pos strand
## 1 chr17 26027430      +
## 2 chr17 26050540      +
## 3 chr17 26118088      +
## 4 chr17 26142853      +
## 5 chr17 26166954      +
## 6 chr17 26222417      +
head(CTSStagCountDf(ce))
##   Zf.30p.dome Zf.high Zf.prim6.rep1 Zf.prim6.rep2 Zf.unfertilized.egg
## 1           0       0             1             0                   0
## 2           0       0             0             0                   1
## 3           0       0             1             0                   0
## 4           0       0             0             1                   0
## 5           0       0             1             0                   0
## 6           1       1             0             0                   0
head(CTSStagCount(ce))
##     chr      pos strand Zf.30p.dome Zf.high Zf.prim6.rep1 Zf.prim6.rep2
## 1 chr17 26027430      +           0       0             1             0
## 2 chr17 26050540      +           0       0             0             0
## 3 chr17 26118088      +           0       0             1             0
## 4 chr17 26142853      +           0       0             0             1
## 5 chr17 26166954      +           0       0             1             0
## 6 chr17 26222417      +           1       1             0             0
##   Zf.unfertilized.egg
## 1                   0
## 2                   1
## 3                   0
## 4                   0
## 5                   0
## 6                   0

Note that the samples are ordered in the way they were supplied when creating the CAGEexp object and will be presented in that order in all the results and plots. To check sample labels and their ordering type:

sampleLabels(ce)
##               #FF0000               #CCFF00               #00FF66
##         "Zf.30p.dome"             "Zf.high"       "Zf.prim6.rep1"
##               #0066FF               #CC00FF
##       "Zf.prim6.rep2" "Zf.unfertilized.egg"

In addition, a colour is assigned to each sample, which is consistently used to depict that sample in all the plots. By default a rainbow palette of colours is used and the hexadecimal format of the assigned colours can be seen as names attribute of sample labels shown above. The colours can be changed to taste at any point in the workflow using the setColors() function.

## 3.4 Quality controls and preliminary analyses

### 3.4.1 Genome annotations

By design, CAGE tags map transcription start sites and therefore detect promoters. Quantitatively, the proportion of tags that map to promoter regions will depend both on the quality of the libraries and the quality of the genome annotation, which may be incomplete. Nevertheless, strong variations between libraries prepared in the same experiment may be used for quality controls.

CAGEr can intersect CTSSes with reference transcript models and annotate them with the name(s) of the models, and the region categories promoter, exon, intron and unknown, by using the annotateCTSS function. The reference models can be GENCODE loaded with the import.gff function of the rtracklayer package, or any other input that has the same structure, see help("annotateCTSS") for details. In this example, we will use a sample annotation for zebrafish (see help("exampleZv9_annot")).

annotateCTSS(ce, exampleZv9_annot)

The annotation results are stored as tag counts in the sample metadata, and as new columns in the CTSS genomic ranges

colData(ce)[,c("librarySizes", "promoter", "exon", "intron", "unknown")]
## DataFrame with 5 rows and 5 columns
##                     librarySizes  promoter      exon    intron   unknown
##                        <integer> <integer> <integer> <integer> <integer>
## Zf.30p.dome                41814     37843      2352       594      1025
## Zf.high                    45910     41671      2848       419       972
## Zf.prim6.rep1              34053     29531      2714       937       871
## Zf.prim6.rep2              34947     30799      2320       834       994
## Zf.unfertilized.egg        56140     51114      2860       400      1766
CTSScoordinatesGR(ce)
## CTSS object with 23343 positions and 2 metadata columns:
##           seqnames       pos strand |  genes annotation
##              <Rle> <integer>  <Rle> |  <Rle>      <Rle>
##       [1]    chr17  26027430      + |           unknown
##       [2]    chr17  26050540      + | grid1a   promoter
##       [3]    chr17  26118088      + | grid1a       exon
##       [4]    chr17  26142853      + | grid1a     intron
##       [5]    chr17  26166954      + | grid1a       exon
##       ...      ...       ...    ... .    ...        ...
##   [23339]    chr17  45975041      - |           unknown
##   [23340]    chr17  45975540      - |           unknown
##   [23341]    chr17  45975544      - |           unknown
##   [23342]    chr17  45982697      - |           unknown
##   [23343]    chr17  45999921      - |           unknown
##   -------
##   seqinfo: 26 sequences (1 circular) from danRer7 genome
##   BSgenome name: BSgenome.Drerio.UCSC.danRer7

A function plotAnnot is provided to plot the annotations as stacked bar plots. Here, all the CAGE libraries look very promoter-specific.

plotAnnot(ce, "counts")
## Warning: Removed 20 rows containing missing values (geom_segment).
## Warning: Removed 20 rows containing missing values (geom_point).

### 3.4.2 Correlation between samples

As part of the basic sanity checks, we can explore the data by looking at the correlation between the samples. The plotCorrelation2() function will plot pairwise scatter plots of expression scores per TSS or consensus cluster and calculate correlation coefficients between all possible pairs of samples2 Alternatively, the plotCorrelation() function does the same and colors the scatterplots according to point density, but is much slower.. A threshold can be set, so that only regions with an expression score (raw or normalized) above the threshold (either in one or both samples) are considered when calculating correlation. Three different correlation measures are supported: Pearson’s, Spearman’s and Kendall’s correlation coefficients. Note that while the scatterplots are on a logarithmic scale with pseudocount added to the zero values, the correlation coefficients are calculated on untransformed (but thresholded) data.

corr.m <- plotCorrelation2( ce, samples = "all"
, tagCountThreshold = 1, applyThresholdBoth = FALSE
, method = "pearson")

## 3.5 Merging of replicates

Based on calculated correlation we might want to merge and/or rearrange some of the datasets. To rearrange the samples in the temporal order of the zebrafish development (unfertilized egg -> high -> 30 percent dome -> prim6) and to merge the two replicas for the prim6 developmental stage we use the mergeSamples() function:

mergeSamples(ce, mergeIndex = c(3,2,4,4,1),
mergedSampleLabels = c("zf_unfertilized_egg", "zf_high", "zf_30p_dome", "zf_prim6"))
annotateCTSS(ce, exampleZv9_annot)

The mergeIndex argument controls which samples will be merged and how the final dataset will be ordered. Samples labeled by the same number (in our case samples three and four) will be merged together by summing number of CAGE tags per TSS. The final set of samples will be ordered in the ascending order of values provided in mergeIndex and will be labeled by the labels provided in the mergedSampleLabels argument. Note that mergeSamples function resets all slots with results of downstream analyses, so in case there were any results in the CAGEexp object prior to merging, they will be removed. Thus, annotation has to be redone.

## 3.6 Normalization

Library sizes (number of total sequenced tags) of individual experiments differ, thus normalization is required to make them comparable. The librarySizes function returns the total number of CAGE tags in each sample:

librarySizes(ce)
## [1] 56140 45910 41814 69000

The CAGEr package supports both simple tags per million normalization and power-law based normalization. It has been shown that many CAGE datasets follow a power-law distribution (Balwierz et al. 2009). Plotting the number of CAGE tags (X-axis) against the number of TSSs that are supported by <= of that number of tags (Y-axis) results in a distribution that can be approximated by a power-law. On a log-log scale this reverse cumulative distribution will manifest as a monotonically decreasing linear function, which can be defined as

$y = -1 * \alpha * x + \beta$

and is fully determined by the slope $$\alpha$$ and total number of tags T (which together with $$\alpha$$ determines the value of $$\beta$$).

To check whether our CAGE datasets follow power-law distribution and in which range of values, we can use the plotReverseCumulatives function:

plotReverseCumulatives(ce, fitInRange = c(5, 1000), onePlot = TRUE)