metagene: A package to produce Metafeature plots

Charles Joly Beauparlant, Fabien Claude Lamaze, Rawane Samb, Astrid Louise Deschenes and Arnaud Droit.

This package and the underlying metagene code are distributed under the Artistic license 2.0. You are free to use and redistribute this software.

Introduction

This package produces Metagene-like plots to compare the behavior of DNA-interacting proteins at selected groups of features. A typical analysis can be done in viscinity of transcription start sites (TSS) of genes or at any regions of interest (such as enhancers). Multiple combinations of group of features and/or group of bam files can be compared in a single analysis. Bootstraping analysis is used to compare the groups and locate regions with statistically different enrichment profiles. In order to increase the sensitivity of the analysis, alignment data is used instead of peaks produced with peak callers (i.e.: MACS2 or PICS). The metagene package uses bootstrap to obtain a better estimation of the mean enrichment and the confidence interval for every group of samples.

Currently supported species are human and mouse.

This vignette will introduce all the main features of the metagene package.

Loading metagene package

suppressMessages(library(metagene))

Inputs

Alignment files (BAM files)

There is no hard limit in the number of BAM files that can be included in an analysis (but with too many BAM files, memory may become an issue). To speed up the analysis, it is best to (but not mandatory) to index BAM files before starting the analysis. If no index is found for a file, the metagene package will use Rsamtools to sort and index it.

The path (relative or absolute) to the BAM files must be in a vector:

bamFile1Rep1 <- system.file("extdata/align1_rep1.bam", package="metagene")
bamFile1Rep2 <- system.file("extdata/align1_rep2.bam", package="metagene")
bamFile2Rep1 <- system.file("extdata/align2_rep1.bam", package="metagene")
bamFile2Rep2 <- system.file("extdata/align2_rep2.bam", package="metagene")
bamFileCTRL <- system.file("extdata/ctrl.bam", package="metagene")
bamFiles <- c(bamFile1Rep1, bamFile1Rep2, bamFile2Rep1, bamFile2Rep2, bamFileCTRL)
bamFiles

## [1] "/tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/align1_rep1.bam"
## [2] "/tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/align1_rep2.bam"
## [3] "/tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/align2_rep1.bam"
## [4] "/tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/align2_rep2.bam"
## [5] "/tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/ctrl.bam"

For this demo, we have 2 samples (each with 2 replicates) and 1 control.

Design groups

A design group contains a set of BAM files that, when pull togheter, represent a logical analysis. Furthermore, a design group contains the relationship between every BAM files present. Samples (with or without replicates) and controls can be assigned to a same design group. There can be as many groups as necessary. A BAM file can be assigned to more than one group.

To represent the relationship between every BAM files, design groups must ha the following columns:

The list of paths to every BAM files related to an analysis.
One column per group of files (replicates and/or controls).

There is two possible way to create design groups, by reading a file or by directly creating a design object in R.

Design groups from a file

Design groups can be loaded into the metagene package by using a text file. As the relationship between BAM files as to be specified, the following columns are mandatory :

First column: The list of paths (absolute or relative) to every BAM files for all the design groups.
Following columns: One column per design group (replicates and/or controls). The column can take only three valid values:
- 0: ignore file
- 1: input
- 2: control

The file must also contain a header. It is recommander to use Samples for the name of the first column, but the value is not checked. The other columns in the design file will be used for naming design groups, and must be unique.

fileDesign <- system.file("extdata/design.txt", package="metagene")
design <- read.table(fileDesign, header=TRUE, stringsAsFactors=FALSE)
design$Samples <- paste(system.file("extdata", package="metagene"), design$Samples, sep="/")
design

##                                                              Samples
## 1 /tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/align1_rep1.bam
## 2 /tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/align1_rep2.bam
## 3 /tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/align2_rep1.bam
## 4 /tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/align2_rep2.bam
## 5        /tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/ctrl.bam
##   align1 align2
## 1      1      0
## 2      1      0
## 3      0      1
## 4      0      1
## 5      2      2

Design groups from R

It is not obligatory to use a design file, you can create the design data.frame using your prefered method (as long as the restrictions on the values mentioned previously are respected).

For instance, the previous design data.frame could have been create directly in R:

design <- data.frame(Samples = c("align1_rep1.bam", "align1_rep2.bam", "align2_rep1.bam", "align2_rep2.bam", "ctrl.bam"),
             align1 = c(1,1,0,0,2), align2 = c(0,0,1,1,2))
design$Samples <- paste0(system.file("extdata", package="metagene"), "/", design$Samples)
design

##                                                              Samples
## 1 /tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/align1_rep1.bam
## 2 /tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/align1_rep2.bam
## 3 /tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/align2_rep1.bam
## 4 /tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/align2_rep2.bam
## 5        /tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/ctrl.bam
##   align1 align2
## 1      1      0
## 2      1      0
## 3      0      1
## 4      0      1
## 5      2      2

Features groups

Features groups represents all genes, TSS or genome regions that form a logical collection for an analysis. Genome regions, which are called later regions groups, are a subclass of features groups and require special treatment in the analysis steps. There is different ways to provide features groups to the metagene package.

Every known TSS (default) - Features

When no features groups are provided to the metagene package, it use every known TSS for the selected specie (specified using the specie argument, default value: “human”).

By default, 5000 base pairs aroud each TSS will be used, but this can be changed with the maxDistance argument.

Ensembl Gene ID(s) files - Features

It is possible to provide, to metagene package, a text file that contains a list of Ensembl Gene ID(s) as input. The metagene package will use the Gene ID to extact the TSS information. The analysis will be done around the TSS of all genes. The files must be formated in the following way:

1 Ensembl Gene ID(s) per line.

The file name (minus the extension) will be used to name the corresponding group (called a features group).

fileExample <- system.file("extdata/list1.txt", package="metagene")
featuresFileExample <- read.table(fileExample, header=FALSE, stringsAsFactors=FALSE)
head(featuresFileExample)

##                   V1
## 1 ENSMUSG00000043716
## 2 ENSMUSG00000026155
## 3 ENSMUSG00000026123
## 4 ENSMUSG00000010453
## 5 ENSMUSG00000061518
## 6 ENSMUSG00000037351

The metagene package require a vector of the paths (absolute or relative) for every Ensembl Gene ID(s) files to be included in the analysis:

fileList1 <- system.file("extdata/list1.txt", package="metagene")
fileList2 <- system.file("extdata/list2.txt", package="metagene")
features <- c(fileList1, fileList2)
features

## [1] "/tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/list1.txt"
## [2] "/tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/list2.txt"

Note: When there are multiple design groups and features groups, the metagene package will produce a group for every possible combinations of the two.

BED files - Regions

To compare custom regions, it is possible to use a list of one or more BED files.

fileBed1 <- system.file("extdata/list1.bed", package="metagene")
fileBed2 <- system.file("extdata/list2.bed", package="metagene")
regions <- c(fileBed1, fileBed2)
regions

## [1] "/tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/list1.bed"
## [2] "/tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/list2.bed"

The name of the files (without the extension) will be used as features groups names inside the metagene package.

GRanges or GRangesList objects - Regions

As an alternative to a list of bed files, GRanges or GRangesList objects can be used.

Analysis steps

Parsing input files

The main input of the metagene package are alignment files in bam format. It is also possible to include a design file to represent the relations between samples (replicates and controls). The regions to analyse can be either in BED format (for user-specified regions) or in Ensembl Gene ID(s) format (in which case, the analysis will be done around the TSS of the genes).

Parsing features

To parse the sequences overlapping a list of features, you must use the parseFeatures function:

groupsFeatures <- parseFeatures(bamFiles=bamFiles, features=features, design=design, specie="mouse", maxDistance=1000)

## Step 1: Prepare bam files... Done!
## Step 2: Prepare regions... Done!
## Step 3: Parse bam files...
## [1] "Current bam: /tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/align1_rep1.bam"
## [1] "Current bam: /tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/align1_rep2.bam"
## [1] "Current bam: /tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/align2_rep1.bam"
## [1] "Current bam: /tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/align2_rep2.bam"
## [1] "Current bam: /tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/ctrl.bam"
## [1] "Current bam: /tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/align1_rep1.bam"
## [1] "Current bam: /tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/align1_rep2.bam"
## [1] "Current bam: /tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/align2_rep1.bam"
## [1] "Current bam: /tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/align2_rep2.bam"
## [1] "Current bam: /tmp/RtmpytaXjD/Rinst4cca13bc9018/metagene/extdata/ctrl.bam"
## Step 3: Parse bam files... Done!
## Step 4: Merge matrix... Done!

If no feature files are provided, the package will use every known TSS for the selected specie (use the specie argument, default value: “human”)

By default, 5000 base pairs aroud each TSS will be used, but this can be changed with the maxDistance argument.

Parsing regions

Parsing regions is very similar to parsing features, except that we must use bed files instead of features files:

groupsRegions <- parseRegions(bamFiles=bamFiles, regions=regions, design=design, specie="mouse", paddingSize=0)

Since it's possible that the regions will have different length, metagene will scale them so they all have the length of the median region. This is necessary for the bootstrapping analysis that uses matrices.

There is no maxDistance argument because the size of the regions are already defined in the bed files. It is possible to add paddings on each side of every regions with the paddingSize argument. Please note that the padding is added after the main regions are scaled.

Plotting results

During the plotting step, it is possible to specify which groups of features or regions to add into the plot. The results of the parseFeatures or parseRegions can be used directly by the plotMatrices function.

Before producing the plot, the metagene package will do a bootstrap analysis to obtain better estimates of the mean and confidence intervals for each groups. This step can be computationally expensive and the running time will directly depends on the binSize and sampleSize arguments.

When there are multiple design groups and features groups, the metagene package produces a group for every possible combinations of the two. The name of each combination group is a concatenation of the feature group name and the design group name separated by an underscore (_). Those combination groups can be freely intermixed to genere different type of plotting results.

You must use a list object to represent the groups to use and how to name them. For example, if we want to combine all the samples (defined as design names align1 and align2) for each list (defined as features groups list1 and list2), we could do the following:

names(groupsFeatures$matrix)

## [1] "list1_align1" "list1_align2" "list2_align1" "list2_align2"

groupsToPlot <- list(group1=c("list1_align1", "list1_align2"), group2=c("list2_align1", "list2_align2"))
DF <- plotMatrices(groupsToPlot, groupsFeatures)

plot of chunk plotMatricesExample1

If we wanted to compare the behavior of our 2 samples (defined as design names align1 and align2) for the first list of genes (defined as features group list1), we could do the following:

groupsToPlot <- list(group1="list1_align1", group2="list1_align2")
DF <- plotMatrices(groupsToPlot, groupsFeatures)

plot of chunk plotMatricesExample2

The plotMatrices function returns the data.frame that was used by ggplot2 to produce the results.

head(DF)

##   Groups distances     means     qinf      qsup
## 1 group1      -950  48.83094 32.89430  67.50748
## 2 group1      -850  77.57053 51.94780 109.04238
## 3 group1      -750  91.27509 57.15875 139.43487
## 4 group1      -650  90.37535 67.09998 115.56556
## 5 group1      -550 110.65427 82.38905 139.52067
## 6 group1      -450 126.08510 86.85923 179.69791