1 Introduction

Recent development of high-throughput sequencing technology has generated large amounts of data at transcriptome level such as RNA-binding protein target sites, RNA modifications and other RNA-related genomic features. These transcriptome attributes can be converted to a set of locations or regions.

Currently, colocalization analysis of two genomic region sets has been widely used to infer potential interactions between corresponding biological attributes. A number of methods have been developed for this. For example, regioneR has been created to statistically estimate association between genomic region sets using a permutation test framework. All its functions are genome and mask-aware. However, there has been no such tools enable researchers to make data analysis and interpretation at transcriptome level. Here RgnTX is invented to fill this blank.

RgnTX extends an assumption-free permutation test framework transcriptome-wide. Different from existing approaches, RgnTX allows the integration of transcriptome annotations so as to model the complex alternative splicing patterns. Importantly, it also supports the testing of transcriptome elements without clear isoform association, which is often the real scenario due to technical limitations.

RgnTX makes a useful tool for transcriptome region set association analysis. In this vignette, with some simulated datasets, we show that permutation tests conducted in the genome and the transcriptome are significantly different and may return distinct results in section 2. The most core functions for conducting permutation tests and examples with real datasets are introduced in section 6.

2 What is the difference between permutation tests in the genome and the transcriptome?

2.1 What does a permutation test do?

Suppose we want to know if a region set A has some association with another region set B. We could pick some random region sets, and calculate the number of overlaps between them and B. If the overlap numbers of these random region sets and B are generally smaller than that of A and B, it is more likely to say there exists some association between A and B, otherwise not.

Such permutation test framework mainly involves two steps: the randomization step (picking random region sets) and the evaluation step (evaluating overlapping counts or other test statistic). Importantly, performing it over the transcriptome and the genome can be different in both of these two steps.

2.2 Randomization step

Compared with the linear genome space, transcriptome space is heterogeneous due to the existence of multiple isoform transcripts, the splicing junctions and exons used by different frequencies. The isoform specificity of transcriptome elements is often lost in the process of mapping due to technical limitations. To perform isoform-aware permutation for transcriptome element with isoform ambiguity, RgnTX considers only the transcripts that overlap with it when projected to the genome, so as to retrain maximal isoform information. In this example shown by the following figures, feature 3 will be permutated on transcript Tx3 only, while feature 2 will be permutated on all the 3 transcripts.

The genome space is linear. The figure shows four genomic features in the genome.

Figure 1: The genome space is linear
The figure shows four genomic features in the genome.

The transcriptome space is heterogeneous. It is usually unclear which specific isoform transcript is associated with the transcriptome element because it overlaps with multiple isoforms when mapped to the genome, which is often the real scenario in biological research.

Figure 2: The transcriptome space is heterogeneous
It is usually unclear which specific isoform transcript is associated with the transcriptome element because it overlaps with multiple isoforms when mapped to the genome, which is often the real scenario in biological research.

Permutation space for each feature is distinct. Each feature will be permutated only within the set of isoform transcripts it is associated with.

Figure 3: Permutation space for each feature is distinct
Each feature will be permutated only within the set of isoform transcripts it is associated with.

This is supported by the RgnTX function randomizeFeaturesTx and the function randomizeFeaturesTxIA. The former is for features with known isoform belongings. The latter is for features with isoform ambiguity (IA). Besides, RgnTX also provides function randomizeTx supporting picking random region sets within any specified transcriptome space. Users can choose a suitable strategy to do randomization conveniently within only a few lines of codes. More details about this are introduced in section 5.

2.3 Evaluation step

Another non-trivial task is to develop evaluation function for a test statistic, such as the number of overlaps, to be tested between two region sets. A key limitation of existing evaluation approaches is that, they were designed primarily for genome-based evaluation. Two transnscriptome elements with shared exons but are located on the independent transcripts are also considered to have overlaps by them. The information of transcripts is totally lost during comparison, which can induce substantial bias when applying to analyze transcriptome region sets.

RgnTX provides several strategies for transcriptome-level evaluation. Function overlapCountsTX counts the number of overlaps between two transcriptome elements in the transcriptome directly. Function overlapCountsTXIA calculates a weighted number of overlaps between a feature set (with isoform ambiguity) and a transcriptome region set (without isoform ambiguity). It is also possible to write a custom evaluation function and pass it to the main permutation test function. Details about this part are introduced in section 7.

2.4 Comparison of permutations over different kinds of spaces

We perform a series of simulations to show the difference of permutation tests conducted in the transcriptome and in the genome.

In this simulation, we randomly generated 1000 pairs of region sets in the following four spaces, with each region set containing 500 regions of 200nt long on the same 300 genes, and then count the number of overlaps between each pair.

  • DNA: This space contains randomly picked 300 genes from a hg19 TxDb object.

  • pre-mRNA: It contains all of the pre-mRNAs that pertain to the previously picked 300 genes.

  • mRNA: It contains all of the mRNAs that pertain to the previously picked genes.

  • Exonic DNA: This space involves only the part of genome that is related to the previously picked mRNAs. It is actually a linear space of these mRNAs.

Figure 4 shows the evaluation results. In the first four cases, overlaps are counted based on the genome (or after projection). In the last case, overlaps are counted in the transcriptome directly. The former four cases are evaluated by regioneR function numOverlaps, while the last is evaluated by RgnTX function overlapCountsTx.

Overlapping counts between random region sets in different permutation spaces. Box boundaries represent the 25th and 75th percentiles; center line represents the median.

Figure 4: Overlapping counts between random region sets in different permutation spaces
Box boundaries represent the 25th and 75th percentiles; center line represents the median.

As the results show, permutations over different genome and transcriptome space return distinct results. A small number of overlaps are observed in the transcriptome (the last case), which is not surprising due to the existence of multiple isoform transcripts. Interestingly, the results from exonic DNA and mRNA (projected to genome) are also different, which is due to exons used at different frequencies by isoform transcripts. These results suggest that the permutation of heterogeneous transcriptome elements can be substantially more complex than genome-based elements. Genome-based methods are not appropriate to analyze transcriptome elements. That’s why we develop RgnTX to do this.

3 Installation

To install this package via devtools, please use the following codes.

if (!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")

devtools::install_github("yue-wang-biomath/RgnTX", build_vignettes = TRUE)

To install this package via BiocManager, please use the following codes.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("RgnTX")

4 Feature/region format in RgnTX

RgnTX is based on the GenomicRanges and GenomicFeatures structure. In this section, we introduce the format of region sets and other associated metadata used in RgnTX.

  • TxDb object used in RgnTX

A TxDb object stores transcripts, exons, CDS and other types of genomic features. A TxDb object is usually required by RgnTX functions. Currently, all the examples in this vignette are based on the TxDb.Hsapiens.UCSC.hg19.knownGene. Users can assign other TxDb objects via related function arguments.

  • Transcript ids and names

A TxDb object provides default ids and names for transcripts. According to GenomicFeatures, a transcript is guaranteed to be related to a unique transcript id. But a transcript’s name is not guaranteed to be unique or even defined. In this vignette, all the examples use transcript ids instead of names to indicate specific transcripts.

Please pay attention not to confuse the transcript ids with other types of ids, such as ids for CDS, exons and other genomic features provided by a TxDb object, or external ids for genes like Entrez Gene and Ensembl ids. Functions in RgnTX will not work with wrong types of ids.

  • Input feature set should be either GRanges or GRangesList

The feature/region sets in RgnTX should be either GRanges or GRangesList.

For a GRangesList feature set, every list element that contains a set of ranges is a feature. For a GRanges feature set without special instructions, a single range is a feature. For a GRanges feature set containing a metadata column named “group”, the set of ranges having the same group number is a feature.

The transcript ids (if available) should be provided as follows. For GRangesList objects, the name of each list element (feature) should be the id of the transcript that this feature is located on. For GRanges objects, transcript ids should be contained in a metadata column named “transcriptsHits”.

Here is an example of GRangesList feature set following the format required by RgnTX. Its names should be transcript ids.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

set.seed(12345677)
example.list <- randomizeTx(txdb, random_num = 10, random_length = 200)
example.list
## GRangesList object of length 10:
## $`9974`
## GRanges object with 1 range and 0 metadata columns:
##       seqnames              ranges strand
##          <Rle>           <IRanges>  <Rle>
##   [1]     chr2 202725672-202725871      +
##   -------
##   seqinfo: 8 sequences from an unspecified genome; no seqlengths
## 
## $`52204`
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames              ranges strand
##          <Rle>           <IRanges>  <Rle>
##   [1]    chr14 104143758-104143860      +
##   [2]    chr14 104145721-104145817      +
##   -------
##   seqinfo: 8 sequences from an unspecified genome; no seqlengths
## 
## $`73524`
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames            ranges strand
##          <Rle>         <IRanges>  <Rle>
##   [1]    chr22 18893876-18893997      +
##   [2]    chr22 18894078-18894155      +
##   -------
##   seqinfo: 8 sequences from an unspecified genome; no seqlengths
## 
## ...
## <7 more elements>

The following GRanges object represents the same region set as the above one. Ranges having the same group number represent a feature. The transcriptsHits column contains corresponding transcript ids. Both formats are accepted by RgnTX functions and can be transformed to each other via function GRangesList2GRanges and function GRanges2GRangesList.

example.gr <- GRangesList2GRanges(example.list)
example.gr
## GRanges object with 17 ranges and 2 metadata columns:
##        seqnames              ranges strand | transcriptsHits     group
##           <Rle>           <IRanges>  <Rle> |     <character> <integer>
##    [1]     chr2 202725672-202725871      + |            9974         1
##    [2]    chr14 104143758-104143860      + |           52204         2
##    [3]    chr14 104145721-104145817      + |           52204         2
##    [4]    chr22   18893876-18893997      + |           73524         3
##    [5]    chr22   18894078-18894155      + |           73524         3
##    ...      ...                 ...    ... .             ...       ...
##   [13]    chr17   73917325-73917345      - |           64316         8
##   [14]    chr17   73916355-73916510      - |           64316         8
##   [15]    chr17   73916166-73916188      - |           64316         8
##   [16]    chr12 122989759-122989958      - |           49210         9
##   [17]     chr7 121960010-121960209      - |           31300        10
##   -------
##   seqinfo: 8 sequences from an unspecified genome; no seqlengths
example.list <- GRanges2GRangesList(example.gr)

5 Randomization function

In RgnTX, there are mainly two kinds of randomization functions. One can pick random regions over a specified space (see section 5.1). One can also randomize specified features into transcriptome, i.e., input a set of features and get a randomized result of it (see section 5.2).

5.1 Pick random regions over a specified space

In this strategy, users need to specify a scope that random region sets will be picked from.

5.1.1 randomizeTx function

In this function, users can pass transcript ids via an argument to make random regions to be picked from the transcripts having these ids.

Step 1: specify a randomization space

One needs to determine three arguments: a TxDb data, a set of transcript ids and the type of randomized regions the function will return. As an example, we specify a randomization space formed by five mRNAs.

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene 
type <- "mature"
trans.ids <- c("170", "782", "974", "1364", "1387")

We feed them into the randomizeTX function.

Step 2: randomizeTx function

We use this function to pick 10 random regions, and each region is 100 nt long. Each region is picked from a random transcript among the input transcripts.

set.seed(12345677)
randomResults <- randomizeTx(txdb, trans_ids = trans.ids, random_num = 10,
                             random_length = 100)
  • txdb: a TxDb obejct.

  • trans_ids (default: “all”): the ids of transcripts, which should be a character object. If this argument takes the default value all, random regions will be picked from the whole transcriptome.

  • random_num: the number of regions to be picked.

  • random_length: the length of each region to be picked.

  • type (default: “mature”): this argument receives options mature, full, fiveUTR, CDS or threeUTR, with which users can get corresponding types of random regions. Default is “mature”, with which the function picks regions over mRNAs.

  • N (default: 1): randomization times. The number of sets of region sets the function will return.

randomResults 
## GRangesList object of length 10:
## $`1364`
## GRanges object with 1 range and 0 metadata columns:
##       seqnames            ranges strand
##          <Rle>         <IRanges>  <Rle>
##   [1]     chr1 54433399-54433498      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## 
## $`170`
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames          ranges strand
##          <Rle>       <IRanges>  <Rle>
##   [1]     chr1 3599695-3599744      +
##   [2]     chr1 3624113-3624162      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## 
## $`170`
## GRanges object with 1 range and 0 metadata columns:
##       seqnames          ranges strand
##          <Rle>       <IRanges>  <Rle>
##   [1]     chr1 3638649-3638748      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## 
## ...
## <7 more elements>
width(randomResults)
## IntegerList of length 10
## [["1364"]] 100
## [["170"]] 50 50
## [["170"]] 100
## [["974"]] 6 94
## [["974"]] 100
## [["974"]] 100
## [["1387"]] 100
## [["1387"]] 100
## [["1364"]] 63 37
## [["170"]] 100

The function returns a GRangesList object. Every list element has a total length of 100 nt and represents a region on mRNA but possibly separated by intron parts. The name of each element is its relative transcript id.

5.1.2 randomizeTransByOrder function

This is another randomization function. The difference of this function and randomizeTx is that it does not take an argument about transcript ids, instead, it takes a GRangesList region set and picks only one random region within the space indicated by every GRangesList element. So the number of regions in a set this function returns is equal to the number of elements in the input GRangesList object.

Step 1: specify a randomization space

We pick a set of five mRNAs to be the scope of randomization.

trans.ids <- c("170", "782", "974", "1364", "1387")
exons.tx0 <- exonsBy(txdb)
regions.A <- exons.tx0[trans.ids]
regions.A
## GRangesList object of length 5:
## $`170`
## GRanges object with 13 ranges and 3 metadata columns:
##        seqnames          ranges strand |   exon_id   exon_name exon_rank
##           <Rle>       <IRanges>  <Rle> | <integer> <character> <integer>
##    [1]     chr1 3569129-3569205      + |       474        <NA>         1
##    [2]     chr1 3598897-3598994      + |       475        <NA>         2
##    [3]     chr1 3599624-3599744      + |       476        <NA>         3
##    [4]     chr1 3624113-3624355      + |       479        <NA>         4
##    [5]     chr1 3638585-3638771      + |       481        <NA>         5
##    ...      ...             ...    ... .       ...         ...       ...
##    [9]     chr1 3644693-3644781      + |       486        <NA>         9
##   [10]     chr1 3645891-3646012      + |       487        <NA>        10
##   [11]     chr1 3646564-3646712      + |       489        <NA>        11
##   [12]     chr1 3647491-3647629      + |       490        <NA>        12
##   [13]     chr1 3649311-3652765      + |       492        <NA>        13
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
## 
## ...
## <4 more elements>

Step 2: randomizeTransByOrder function

randomizeTransByOrder requires two inputs: a region set with GRangesList format as the randomization scope, and the length of each random region is going to be picked from every GRangesList element.

In the following example, we input a region set containing five regions. The function picks five 100 nt regions from each of them in order.

set.seed(12345677)
random.regions.A <- randomizeTransByOrder(regions_A = regions.A, 
random_length = 100)
width(regions.A)
## IntegerList of length 5
## [["170"]] 77 98 121 243 187 116 110 143 89 122 149 139 3455
## [["782"]] 76 82 51 188 180 97 123 156 120 153 1365
## [["974"]] 631 110 77 2040
## [["1364"]] 224 34 487 132 119 89 114 85 504
## [["1387"]] 191 138 407
width(random.regions.A)
## IntegerList of length 5
## [["170"]] 100
## [["782"]] 50 50
## [["974"]] 100
## [["1364"]] 69 31
## [["1387"]] 1 99

5.2 Randomize features into transcriptome

Functions introduced in this section support randomizing two kinds of RNA-related genomic features:

  • Features without isoform ambiguity: for features without isoform ambiguity, it is clear which specific isoform each feature is located on (See Figure 5). Each feature will be randomized within its relative transcript.

  • Features with isoform ambiguity: the isoform specificity of transcriptome elements is unclear to us because each of them overlaps with multiple isoforms when mapped to the genome (See Figure 6). Each feature will be randomized into the set of isoform transcripts it is associated with.

Features without isoform ambiguity.

Figure 5: Features without isoform ambiguity