# 1. Introduction

In alternative splicing analysis of RNA-seq data, one popular approach is to first identify gene features (e.g. exons or junctions) significantly associated with splicing using methods such as DEXSeq (Anders, Reyes, and Huber 2012) or JunctionSeq (Hartley and Mullikin 2016), and then perform pathway analysis based on the list of genes associated with the significant gene features. For results from the DEXSeq package, gene features represent non-overlapping exon counting bins (Anders, Reyes, and Huber (2012); Figure 1), while for results from the JunctionSeq package, they represent non-overlapping exon or splicing-junction counting bins.

Pathway analysis without explicit adjustment can be biased toward pathways that include genes with a large number of gene features; this is because these genes are more likely to be selected as “significant genes” in pathway analysis. This is a major challenge in RNA-seq data analysis, and this package offers a solution. The main features of the PathwaySplice package include:

1. Perform pathway analysis that explicitly adjusts for the number of exons or junctions associated with each gene.
2. Visualize selection bias due to different number of exons or junctions across different genes.
3. Formally test for selection bias using logistic regression.
4. Support gene sets based on Gene Ontology categories, user defined gene sets, and other broadly defined gene sets (e.g. MSigDB).
5. Identify specific genes that drive pathway significance.
6. Organize significant pathways within an enrichment map, where pathways with large numbers of overlapping genes are grouped together in a network graph.

# 2. Package quick start guide

After installing the PathwaySplice package, load the package namespace via:

# if (!requireNamespace("BiocManager", quietly=TRUE))
# install.packages("BiocManager")
# BiocManager::install("PathwaySplice")
library(PathwaySplice)

library(devtools)
install_github("SCCC-BBC/PathwaySplice")

The input file of the PathwaySplice functions are a vector of $$p$$-values associated with gene features and a mapping between gene features and genes. This information can be obtained from DEXSeq or JunctionSeq output files. As an example, PathwaySplice includes a gene-feature annotated dataset, based on an RNA-seq study of bone marrow CD34+ cells from eight myelodysplastic syndrome (MDS) patients with SF3B1 mutations and four patients without the mutation (Dolatshad et al. 2015). This dataset was downloaded from the GEO database with accession key GSE63569.

First, we use the JunctionSeq package to perform differential exon usage analysis. For demonstration, we selected gene features mapped to a random subset of 5,000 genes. The example dataset can be loaded directly:

data(featureBasedData)
head(featureBasedData)

Next, convert the example data to a gene-based table with the makeGeneTable function. The $$p$$-value method is the default.

gene.based.table <- makeGeneTable(featureBasedData, stat = "pvalue")
head(gene.based.table)

Here, the geneWisePvalue column is simply the minimum feature-based $$p$$-value among gene features for gene ENSG00000000938, numFeature represents the number of gene features for that gene, fdr measures the false discovery rate (FDR) adjustment for the $$p$$-value given in the genewisePvalue column, and sig.gene = 1 indicates that the gene is significant at the $$\alpha = 0.05$$ level.

To assess selection bias, i.e. whether genes with more features are more likely to be selected as significant, the lrTestBias function fits a logistic regression to test the association between sig.gene and numFeature.

lrTestBias(gene.based.table, boxplot.width = 0.3)
#> [1] "P-value from logistic regression is 3.98e-205"

Note that typically selection bias remains even when the stat = "fdr" option is used to select significant genes in makeGeneTable instead of stat = "pvalue" (the default option). This is because FDR has a monotonic relationship with $$p$$-values. For example, in our featureBasedData example data set, when selecting significant genes using the "fdr" option, substantial selection bias remained (as shown in the boxplot below).

gene.based.table.fdr <- makeGeneTable(featureBasedData, stat = "fdr")
lrTestBias(gene.based.table.fdr, boxplot.width = 0.3)
#> [1] "P-value from logistic regression is 2.18e-93"

To perform pathway analysis that adjusts for the number of gene features, we use the runPathwaySplice function, which implements the methodology described in Young et al. (2010). The runPathwaySplice function returns a tibble object. Tibbles are similar in structure to data frames but use a different print method: they show only the first few rows and as many columns that will fit on the screen. Therefore, your tibbles may print differently than what is shown here.

result.adjusted <- runPathwaySplice(genewise.table = gene.based.table.fdr,
genome = "hg19",
id = "ensGene",
test.cats = c("GO:BP"),
go.size.limit = c(5, 30),
method = "Wallenius",
use.genes.without.cat = TRUE)

head(result.adjusted)

The returned result.adjusted tibble has the statistical significance of each pathway shown in the over_represented_pvalue column, significant genes which drive pathway significance are given in the SIGgene_ensembl column (not shown), and the corresponding gene symbols is returned in the SIGgene_symbol column (also not shown). An additional bias plot is also generated; this visualizes the relationship between the proportion of significant genes and the mean number of gene features within gene bins. Note that the default option for use.genes.without.cat is FALSE, meaning that genes not mapped to any tested categories are ignored in the analysis. In order for all genes in genewise.table to be counted towards the total number of genes outside the category (i.e. genes in the background), we set this option to TRUE.

To visualize pathway analysis results in a pathway enrichment network, we use the enrichmentMap function. The pathway.res input of this function is the output of runPathwaySplice function shown above. The n = 7 argument means that the produced figure will show the 7 most significant pathways in an enrichment map. A file named network.layout.for.cytoscape.gml is generated in the output.file.dir directory specified by users. This file can be used as an input file for Cytoscape software (Shannon et al. 2003), which allows users to further manually adjust the appearance of the generated network.

enmap <- enrichmentMap(pathway.res = result.adjusted,
n = 7,
output.file.dir = tempdir(),
similarity.threshold = 0.5,
scaling.factor = 2)

In the enrichment map, the node size indicates the number of significant genes within the pathway. The node hue indicates pathway significance, where smaller $$p$$-values correspond to dark red color. Pathways with Jaccard coefficient greater than the value of similarity.thereshold will be connected on the network, indicating that these connected pathways contain common set of genes. The edge thickness corresponds to Jaccard similarity coefficient between the two pathways, scaled by the value supplied to scaling.factor. Note that the input of enrichmentMap function is an object returned from runPathwaySplice function, so enrichmentMap function supports both GO annotation as well as customized genesets.

# 3. Testing customized gene sets

### MSigDB hallmark pathways and Reactome pathways

To perform pathway analysis with other user-defined gene sets, one needs to specify the pathway database in .gmt format first, then use the gmtGene2Cat function before passing this output as the value of the gene2cat argument in the pathwaySplice function. The MSigDB hallmark gene sets can be downloaded from the Broad institute. Similarly, Reactome pathways in .gmt format can be downloaded from REACTOME: in the “Specialized data formats” page section click the “Reactome Pathways Gene Set” link to download the zipped file.

For demonstration, we have included a sample of hallmark gene sets in the package. The gmtGene2Cat function takes in a directory wherein a .gmt file is stored and imports that file for internal use. Analysis for hallmark gene sets can be accomplished using the following code:

dir.name <- system.file("extdata", package = "PathwaySplice")
hallmark.local.pathways <- file.path(dir.name, "h.all.v6.0.symbols.gmt.txt")
hlp <- gmtGene2Cat(hallmark.local.pathways, genomeID = "hg19")

result.hallmark <- runPathwaySplice(genewise.table = gene.based.table.fdr,
genome = "hg19",
id = "ensGene",
gene2cat = hlp,
go.size.limit = c(5, 100),
method = "Wallenius",
binsize = 20,
use.genes.without.cat = TRUE)

### KEGG pathways

The KEGG pathways in .gmt format can be obtained by calling the outKegg2gmt function, which creates a .gmt file based on the organism specified. This is a wrapper for the get.kegg.genesets function from the EnrichmentBrowser package and modifies the output into a .gmt file (Geistlinger, Csaba, and Zimmer 2016). This function takes in an organism id ("hsa" is for homo sapiens or "mmu" for Mus musculus (mouse)) and a file path where the .gmt file created by the outKegg2Gmt function should be stored. The following code can be used to perform pathway analysis on KEGG pathways. The first command (outKegg2Gmt) writes a .gmt file for homo sapiens in the directory specified; the second command reads this .gmt file, transforms these gene sets, and saves them as the kegg.pathways object; then the last command performs the pathway splice analysis.

outKegg2Gmt("hsa", file.path(dir.name, "kegg.gmt.txt"))

kegg.pathways <- gmtGene2Cat(file.path(dir.name, "kegg.gmt.txt"),
genomeID = "hg19")

result.kegg <- runPathwaySplice(genewise.table = gene.based.table.fdr,
genome = "hg19",
id = "ensGene",
gene2cat = kegg.pathways,
go.size.limit = c(5, 100),
method = "Wallenius",
use.genes.without.cat = TRUE)

# 4. Comparison of results before and after bias correction

To demonstrate the advantage of using the PathwaySplice package, we will illustrate pathway analysis using the entire 23,520 genes in the MDS dataset described in Section 2. Initial analysis was conducted using the JunctionSeq package to obtain $$p$$-values for differential exon usage. Because each gene may contain multiple exon counting bins or gene features (Anders, Reyes, and Huber 2012), the initial analysis can return multiple $$p$$-values associated with gene features belonging to that gene. We then used the makeGeneTable function to represent each gene by the minimum $$p$$-value among all gene features mapped to it. We saved this output in the "AllGeneTable.rds" file, which can be loaded using the following code.

dir <- system.file("extdata", package="PathwaySplice")
all.gene.table <- readRDS(file.path(dir, "AllGeneTable.rds"))

We next compared splicing pathway analysis results before and after adjusting for the number of exon counting bins associated with each gene. This can be accomplished using the compareResults function:

res.adj <- runPathwaySplice(genewise.table = all.gene.table,
genome = "hg19",
id = "ensGene",
test.cats = "GO:BP",
go.size.limit = c(5, 30),
method = "Wallenius")

genome = "hg19",
id = "ensGene",
test.cats = "GO:BP",
go.size.limit = c(5, 30),
method = "Hypergeometric")
compareResults(n.go = 20,
gene.based.table = all.gene.table,
output.dir = tempdir(),
type.boxplot = "Only3")

This function produces several files in the output directory specified by output.dir, including:

• A Venn diagram that compares the overlap of the most significant n.go gene sets using adjusted and unadjusted analysis.
• A boxplot that compares the distribution of the numbers of
• gene features belonging to all genes in the top n.go significant pathways identified by both adjusted and unadjusted analysis (All),
• genes within the significant gene sets uniquely identified using adjusted analysis (Adjusted), and
• genes within the significant gene sets uniquely identified using unadjusted analysis (Unadjusted).
• A .csv file with gene set names corresponding to different sections of the Venn diagram.
The Venn diagram below shows that among the top 20 most significant GO categories identified using adjusted and unadjusted analysis, there were 12 common GO categories.