metagene2
objectsPackage: metagene2
Modified: April 2nd, 2019
Compiled: Wed May 1 18:03:17 2024
License: Artistic-2.0
This package produces metagene plots, and is the successor to the metagene
package. Users of metagene
can find a list of differences between metagene2
and metagene
in the Differences with metagene section of this vignette.
Metagene plots aggregate coverages from multiple sources (bam files) over multiple regions (genes, cofactor binding sites, etc.) to provide profiles of average coverage. They are useful for many different purposes, such as comparing the binding profiles of DNA-interacting proteins at selected groups of features. In a typical analysis, these features will be the transcription start sites (TSS) of genes, transcription factor binding sites, or enhancer regions. Multiple combinations of groups of features and/or groups of bam files can be compared in a single analysis. The metagene2 package uses bootstrap analysis to provide an estimation of the mean enrichment and a confidence interval for each group of samples.
This vignette will introduce the main features of the metagene2 package. You can load
the metagene2 package by calling library(metagene2)
:
library(metagene2)
metagene2
objects are used to perform all of the analysis steps necessary
to produce metagene plots. Calling metagene2$new
creates a metagene2
object
and requires only two mandatory parameters: bam_files
, which is the list of
bam files from which coverages should be extracted, and regions
, which is the
list of regions over which said coverages are computed. We also recommend using
the optional assay
parameter, which can be one of 'chipseq'
or 'rnaseq'
,
and will automatically set other optional parameters to convenient defaults.
We discuss each of these arguments below.
# metagene objects are created by calling metagene2$new and providing
# regions and bam files:
mg <- metagene2$new(regions = get_demo_regions(),
bam_files = get_demo_bam_files(),
assay='chipseq')
# We can then plot coverage over those regions across all bam files.
mg$produce_metagene(title = "Demo metagene plot")
There is no hard limit on the number of BAM files that can be included in an
analysis. However, loading a large number of bam files might also require large amounts
of memory. The provided bam files must be indexed: a file named file.bam
,
must have an accompanying file.bam.bai
or file.bai
in its directory.
The paths (relative or absolute) to the BAM files must be provided in a vector.
If the vector is named, then those names will be used to refer to the bam files
in subsequent steps. Otherwise, metagene2
will attempt to generate appropriate
names.
# We create a vector with paths to the bam files of interest.
bam_files <- get_demo_bam_files()
basename(bam_files)
## [1] "align1_rep1.bam" "align1_rep2.bam" "align2_rep1.bam" "align2_rep2.bam"
## [5] "ctrl.bam"
Each bam file must have a corresponding index file:
# List .bai matching our specified bam files.
basename(Sys.glob(gsub(".bam", ".ba*", bam_files)))
## [1] "align1_rep1.bam" "align1_rep1.bam.bai" "align1_rep2.bam"
## [4] "align1_rep2.bam.bai" "align2_rep1.bam" "align2_rep1.bam.bai"
## [7] "align2_rep2.bam" "align2_rep2.bam.bai" "ctrl.bam"
## [10] "ctrl.bam.bai"
If no names were provided for the bam files, metagene automatically generates some:
mg <- metagene2$new(regions = get_demo_regions(), bam_files = bam_files)
names(mg$get_params()[["bam_files"]])
## [1] "align1_rep1" "align1_rep2" "align2_rep1" "align2_rep2" "ctrl"
We also could have explicitly named our bam files.
names(bam_files) = c("a1_1", "a1_2", "a2_1", "a2_2", "ctrl")
mg <- metagene2$new(regions = get_demo_regions(), bam_files = bam_files)
names(mg$get_params()[["bam_files"]])
## [1] "a1_1" "a1_2" "a2_1" "a2_2" "ctrl"
The regions for the metagene analysis can be provided in one of three different formats:
character
vector, containing the paths to bed, narrowPeak, broadPeak or gtf
files describing the regions to be used.GRanges
or GRangesList
object defining a set of contiguous regions.GRangesList
where each element defines a set of regions to be stitched
together to be considered as a single logical region.metagene2
can automatically import your regions of interest if they are already
defined in a file with one of the following formats:
A file’s extension will usually reflect the format it is stored in.
regions <- get_demo_region_filenames()
regions
## [1] "/tmp/RtmpkpvGSz/Rinst98b4749c215f3/metagene2/extdata/list1.bed"
## [2] "/tmp/RtmpkpvGSz/Rinst98b4749c215f3/metagene2/extdata/list2.bed"
By providing those two file names to metagene2$new
, they will be loaded
and converted into appropriate objects:
mg <- metagene2$new(regions = get_demo_region_filenames(),
bam_files = get_demo_bam_files())
mg$get_regions()
## GRanges object with 100 ranges and 3 metadata columns:
## seqnames ranges strand | name score region_name
## <Rle> <IRanges> <Rle> | <character> <numeric> <character>
## [1] chr1 16103663-16105662 + | list1_1 0 list1
## [2] chr1 23921318-23923317 + | list1_2 0 list1
## [3] chr1 34848977-34850976 + | list1_3 0 list1
## [4] chr1 36368182-36370181 + | list1_4 0 list1
## [5] chr1 36690488-36692487 + | list1_5 0 list1
## ... ... ... ... . ... ... ...
## [96] chr1 81075951-81077950 - | list2_46 0 list2
## [97] chr1 85108854-85110853 - | list2_47 0 list2
## [98] chr1 85960056-85962055 - | list2_48 0 list2
## [99] chr1 86110971-86112970 - | list2_49 0 list2
## [100] chr1 87155522-87157521 - | list2_50 0 list2
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
As an alternative to a list of BED files, GRanges
objects can be used
to define contiguous regions of interest. Each range defined within the GRanges
object is treated separately from the others. GRangesList
objects
are also accepted, but they are automatically coerced into GRanges
objects, and a column named region_name
bearing the name of the list elements
is added to the coerced GRanges
. Here is an example of valid regions
provided as a GRangesList
:
regions <- get_demo_regions()
regions
## GRangesList object of length 2:
## $list1
## GRanges object with 50 ranges and 2 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr1 16103663-16105662 + | list1_1 0
## [2] chr1 23921318-23923317 + | list1_2 0
## [3] chr1 34848977-34850976 + | list1_3 0
## [4] chr1 36368182-36370181 + | list1_4 0
## [5] chr1 36690488-36692487 + | list1_5 0
## ... ... ... ... . ... ...
## [46] chr1 172081530-172083529 + | list1_46 0
## [47] chr1 172081796-172083795 + | list1_47 0
## [48] chr1 172147016-172149015 + | list1_48 0
## [49] chr1 172205805-172207804 + | list1_49 0
## [50] chr1 172260642-172262641 + | list1_50 0
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
## $list2
## GRanges object with 50 ranges and 2 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr1 3670499-3672498 - | list2_1 0
## [2] chr1 5916399-5918398 - | list2_2 0
## [3] chr1 9699210-9701209 - | list2_3 0
## [4] chr1 9907639-9909638 - | list2_4 0
## [5] chr1 10718946-10720945 - | list2_5 0
## ... ... ... ... . ... ...
## [46] chr1 81075951-81077950 - | list2_46 0
## [47] chr1 85108854-85110853 - | list2_47 0
## [48] chr1 85960056-85962055 - | list2_48 0
## [49] chr1 86110971-86112970 - | list2_49 0
## [50] chr1 87155522-87157521 - | list2_50 0
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
When loaded by metagene2
, they are converted to a GRanges
:
mg <- metagene2$new(regions = regions,
bam_files = get_demo_bam_files())
mg$get_regions()
## GRanges object with 100 ranges and 3 metadata columns:
## seqnames ranges strand | name score region_name
## <Rle> <IRanges> <Rle> | <character> <numeric> <character>
## [1] chr1 16103663-16105662 + | list1_1 0 list1
## [2] chr1 23921318-23923317 + | list1_2 0 list1
## [3] chr1 34848977-34850976 + | list1_3 0 list1
## [4] chr1 36368182-36370181 + | list1_4 0 list1
## [5] chr1 36690488-36692487 + | list1_5 0 list1
## ... ... ... ... . ... ... ...
## [96] chr1 81075951-81077950 - | list2_46 0 list2
## [97] chr1 85108854-85110853 - | list2_47 0 list2
## [98] chr1 85960056-85962055 - | list2_48 0 list2
## [99] chr1 86110971-86112970 - | list2_49 0 list2
## [100] chr1 87155522-87157521 - | list2_50 0 list2
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
For more details about each datasets, please refer to their documentation
(i.e.:?promoters_hg19
).
For certain types of analyses, it is useful to stitch together several regions into one logical unit. This is the case in RNA-seq data, where exons are individual regions which make more sense when grouped together into a single transcript.
For these cases, regions
can be a GRangesList
object where each element
is one such logical region. One must also specify the region_mode="stitch"
parameter when creating the new metagene object. When assay='rnaseq'
,
region_mode
is automatically set to "stitch"
.
regions <- get_demo_rna_regions()
regions
## GRangesList object of length 2:
## $DPM1
## GRanges object with 10 ranges and 2 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr20 50934868-50935236 - | <NA> NA
## [2] chr20 50936149-50936262 - | <NA> NA
## [3] chr20 50940866-50940955 - | <NA> NA
## [4] chr20 50941106-50941209 - | <NA> NA
## [5] chr20 50942032-50942126 - | <NA> NA
## [6] chr20 50945738-50945762 - | <NA> NA
## [7] chr20 50945848-50945923 - | <NA> NA
## [8] chr20 50948630-50948662 - | <NA> NA
## [9] chr20 50955187-50955285 - | <NA> NA
## [10] chr20 50958364-50958555 - | <NA> NA
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
##
## $NDUFAB1
## GRanges object with 7 ranges and 2 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr16 23581003-23581173 - | <NA> NA
## [2] chr16 23581872-23582375 - | <NA> NA
## [3] chr16 23585337-23585423 - | <NA> NA
## [4] chr16 23587198-23587319 - | <NA> NA
## [5] chr16 23590887-23591153 - | <NA> NA
## [6] chr16 23595430-23595638 - | <NA> NA
## [7] chr16 23596124-23596356 - | <NA> NA
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
In stitch mode, the loaded regions remain in a GRangesList
, rather than being coerced
into a GRanges
.
mg <- metagene2$new(regions = regions,
bam_files = get_demo_rna_bam_files(),
region_mode="stitch")
mg$get_regions()
## GRangesList object of length 2:
## $DPM1
## GRanges object with 10 ranges and 3 metadata columns:
## seqnames ranges strand | name score region_name
## <Rle> <IRanges> <Rle> | <character> <numeric> <character>
## [1] chr20 50934868-50935236 - | <NA> NA DPM1
## [2] chr20 50936149-50936262 - | <NA> NA DPM1
## [3] chr20 50940866-50940955 - | <NA> NA DPM1
## [4] chr20 50941106-50941209 - | <NA> NA DPM1
## [5] chr20 50942032-50942126 - | <NA> NA DPM1
## [6] chr20 50945738-50945762 - | <NA> NA DPM1
## [7] chr20 50945848-50945923 - | <NA> NA DPM1
## [8] chr20 50948630-50948662 - | <NA> NA DPM1
## [9] chr20 50955187-50955285 - | <NA> NA DPM1
## [10] chr20 50958364-50958555 - | <NA> NA DPM1
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
##
## $NDUFAB1
## GRanges object with 7 ranges and 3 metadata columns:
## seqnames ranges strand | name score region_name
## <Rle> <IRanges> <Rle> | <character> <numeric> <character>
## [1] chr16 23581003-23581173 - | <NA> NA NDUFAB1
## [2] chr16 23581872-23582375 - | <NA> NA NDUFAB1
## [3] chr16 23585337-23585423 - | <NA> NA NDUFAB1
## [4] chr16 23587198-23587319 - | <NA> NA NDUFAB1
## [5] chr16 23590887-23591153 - | <NA> NA NDUFAB1
## [6] chr16 23595430-23595638 - | <NA> NA NDUFAB1
## [7] chr16 23596124-23596356 - | <NA> NA NDUFAB1
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
Some common ranges that can be useful for plotting include the set of all TSSes or gene bodies. While metagene2 does not provide those, they can easily be generated using packages from BioConductor:
# First locate the TxDb package containing the geneset of interest.
# Some of the most common TxDb packages include:
# - TxDb.Hsapiens.UCSC.hg38.knownGene
# - TxDb.Hsapiens.UCSC.hg19.knownGene
# - TxDb.Mmusculus.UCSC.mm10.knownGene
# - TxDb.Mmusculus.UCSC.mm10.ensGene
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
# We'll use the GenomicFeatures package to obtain gene/TSS coordinates
# from the TxDb package.
library(GenomicFeatures)
# The GenomicFeatures::genes function provides us with gene bodies.
all_gene_bodies = GenomicFeatures::genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
# The GenomicFeatures::promoters function gets a region flanking the TSS.
# By using it directly on TxDb.Hsapiens.UCSC.hg38.knownGene, we would get
# the TSSes of all transcripts. Here, we use it on the gene_bodies GRanges
# we've just created, and limit ourselves to one TSS per gene.
all_TSS = GenomicFeatures::promoters(all_gene_bodies,
upstream=2000, downstream=2000)
By default, metagene2
aggregates all passed-in regions together, and treats all
bam files separately. However, most non-trivial analyses will benefit from more
granularity. Bam files can be split among different ChIP-seq experiments and/or
multiple replicates. Regions can likewise be split according to multiple criteria:
is the underlying gene up- or down-regulated? Is the enhancer bound by a cofactor
of interest? Below, we discuss how metagene2
allows the user to specify those
groupings to produce relevant analyses.
In metagene2
, an experimental design is a set of design groups, each of which
is defined as a set of “input” bam files and a set of “control” bam files.
There is no limit to the number of design groups, though a large number of
design groups will require a proportionately large amount of memory. A
BAM file can be assigned to more than one design group.
The experimental design is expressed using a data-frame, where each row represents
a bam file. The very first column of the data-frame must identify the bam files,
using either their paths or their names as specified in the bam_files
argument.
Each subsequent column then represents an individual design group. The column name
defines the design group’s name, and the column values determine how each bam file
relates to the design group:
* 0: ignore file
* 1: input
* 2: control
A design group does not need to have a control, but it must have at least one input. Control samples are ignored when no normalization or “RPM” normalization is chosen. However, they are used to remove background noise using “NCIS” normalization is selected, or to compute coverage ratios with a control sample when “log2_ratio” normalization is applied.
example_design <- data.frame(Samples = bam_files,
align1 = c(1,1,0,0,2),
align2 = c(0,0,1,1,2))
kable(example_design)
Samples | align1 | align2 | |
---|---|---|---|
a1_1 | /tmp/RtmpkpvGSz/Rinst98b4749c215f3/metagene2/extdata/align1_rep1.bam | 1 | 0 |
a1_2 | /tmp/RtmpkpvGSz/Rinst98b4749c215f3/metagene2/extdata/align1_rep2.bam | 1 | 0 |
a2_1 | /tmp/RtmpkpvGSz/Rinst98b4749c215f3/metagene2/extdata/align2_rep1.bam | 0 | 1 |
a2_2 | /tmp/RtmpkpvGSz/Rinst98b4749c215f3/metagene2/extdata/align2_rep2.bam | 0 | 1 |
ctrl | /tmp/RtmpkpvGSz/Rinst98b4749c215f3/metagene2/extdata/ctrl.bam | 2 | 2 |
# Initializing the metagene object.
mg <- metagene2$new(regions = get_demo_regions(),
bam_files = get_demo_bam_files(),
assay='chipseq')
# Plotting while grouping the bam files by design group
mg$produce_metagene(design=example_design)