Besides being used as a QC-tool for gene expression data, BioQC can be used for general-purpose gene-set enrichment analysis. Its core function, wmwTest, can handle not only “normal”, unsigned gene sets, but also signed genesets where two sets of genes represent signatures that are positively and negatively regulated respectively.
This vignette describes an example of such applications. First we load the BioQC library.
We first open a toy GMT file containing signed genesets with readSignedGmt. The function reads the file into a SignedGenesets object.
gmtFile <- system.file("extdata/test.gmt", package="BioQC")
## print the file context
cat(readLines(gmtFile), sep="\n")
## GS_A TestGeneSet_A AKT1 AKT2 AKT3
## GS_B TestGeneSet_B MAPK1 MAPK3 MAPK8
## GS_C_UP TestGeneSet_C ERBB2 ERBB3
## GS_C_DN TestGeneSet_C EGFR ERBB4
## GS_D_UP TestGeneSet_D GATA2 GATA4
## GS_D_DN TestGeneSet_D GATA1 GATA3
## GS_E_DN TestGeneSet_E TSC1 TSC2
## A list of 5 signed gene-sets
## GS_A (no genes)
## GS_B (no genes)
## GS_C[n=4]
## positive[n=2]:ERBB2,ERBB3
## negative[n=2]:EGFR,ERBB4
## GS_D[n=4]
## positive[n=2]:GATA2,GATA4
## negative[n=2]:GATA1,GATA3
## GS_E[n=2]
## positive[n=0]:
## negative[n=2]:TSC1,TSC2
Note that though the GMT file contains 6 non-empty lines, only five signed genesets are returned by readSignedGmt because a pair of negative and negative geneset of GS_C is available.
Note that in its current form, no genes are reported for GS_A and GS_B, because the names of both sets do not match the patterns. User can decide how to handle such genesets that match patterns of neither positive nor negative sets (GS_A and GS_B in this case). By default, such genesets are ignored; however, user can decide to treat them as either positive or negative genesets. For the purpose of this tutorial, we treat these genesets as positive.
## A list of 5 signed gene-sets
## GS_A[n=3]
## positive[n=3]:AKT1,AKT2,AKT3
## negative[n=0]:
## GS_B[n=3]
## positive[n=3]:MAPK1,MAPK3,MAPK8
## negative[n=0]:
## GS_C[n=4]
## positive[n=2]:ERBB2,ERBB3
## negative[n=2]:EGFR,ERBB4
## GS_D[n=4]
## positive[n=2]:GATA2,GATA4
## negative[n=2]:GATA1,GATA3
## GS_E[n=2]
## positive[n=0]:
## negative[n=2]:TSC1,TSC2
Next we construct an ExpressionSet object containing 100 genes, with some of the genes in the GMT file present.
set.seed(1887)
testN <- 100L
testSampleCount <- 3L
testGenes <- c("AKT1", "AKT2", "ERBB2", "ERBB3", "EGFR","TSC1", "TSC2", "GATA2", "GATA4", "GATA1", "GATA3")
testRows <- c(testGenes, paste("Gene", (length(testGenes)+1):testN, sep=""))
testMatrix <- matrix(rnorm(testN*testSampleCount, sd=0.1), nrow=testN, dimnames=list(testRows, NULL))
testMatrix[1:2,] <- testMatrix[1:2,]+10
testMatrix[6:7,] <- testMatrix[6:7,]-10
testMatrix[3:4,] <- testMatrix[3:4,]-5
testMatrix[5,] <- testMatrix[5,]+5
testEset <- new("ExpressionSet", exprs=testMatrix)
The expression of somes genes are delibrately changed so that some genesets are expected to be significantly enriched.
Considering the geneset composition printed above, we can expect that * GS_A shall be significantly up-regulated; * GS_B shall not be significant, because its genes are missing; * GS_C shall be significantly down-regulated, with positive targets down-regulated and negative targets up-regulated; * GS_D shall not always be significant, because its genes are comparable to the background; * GS_E shall be significantly up-regulated, because its negative targets are down-regulated.
With both SignedGenesets and ExpressionSet objects at hand, we can query the indices of genes of genesets in the ExpressionSet object.
## A list of 5 signed indices with offset=1
## Options: NA removed: TRUE; duplicates removed: TRUE
## GS_A[n=2]
## positive[n=2]:1,2
## negative[n=0]:
## GS_B (no genes)
## GS_C[n=3]
## positive[n=2]:3,4
## negative[n=1]:5
## GS_D[n=4]
## positive[n=2]:8,9
## negative[n=2]:10,11
## GS_E[n=2]
## positive[n=0]:
## negative[n=2]:6,7
If the ExpressionSet object has GeneSymbols in its featureData other than the feature names, user can specify the parameter col in matchGenes to make the match working. matchGene also works with characters and matrix as input, please check out the help pages for such uses.
## 1 2 3
## GS_A 0.008185751 0.008185751 0.008185751
## GS_B 1.000000000 1.000000000 1.000000000
## GS_C 0.997942914 0.997942914 0.997942914
## GS_D 0.924153922 0.114118870 0.211756278
## GS_E 0.008185751 0.008185751 0.008185751
As expected, GS_A and GS_E are significant when the alternative hypothesis is greater.
## 1 2 3
## GS_A 0.992348913 0.992348913 0.992348913
## GS_B 1.000000000 1.000000000 1.000000000
## GS_C 0.002192387 0.002192387 0.002192387
## GS_D 0.078389208 0.889240837 0.793302033
## GS_E 0.992348913 0.992348913 0.992348913
As expected, GS_C is significant when the alternative hypothesis is less.
wmwResult.two.sided <- wmwTest(testEset, testIndex, valType="p.two.sided")
print(wmwResult.two.sided)
## 1 2 3
## GS_A 0.016371502 0.016371502 0.016371502
## GS_B 1.000000000 1.000000000 1.000000000
## GS_C 0.004384774 0.004384774 0.004384774
## GS_D 0.156778415 0.228237741 0.423512555
## GS_E 0.016371502 0.016371502 0.016371502
As expected, GS_A, GS_C, and GS_E are significant when the alternative hypothesis is two.sided.
A simple way to integrate the results of both alternative hypotheses into one numerical value is the Q value, which is defined as absolute log10 transformation of the smaller p value of the two hypotheses, times the sign defined by greater=1 and less=-1.
## 1 2 3
## GS_A 1.7859115 1.7859115 1.7859115
## GS_B 0.0000000 0.0000000 0.0000000
## GS_C -2.3580528 -2.3580528 -2.3580528
## GS_D -0.8047137 0.6416125 0.3731337
## GS_E 1.7859115 1.7859115 1.7859115
As expected, genesets GS_A, GS_C, and GS_E have large absolute values, suggesting they are potentially enriched; while GS_A and GS_E are positively enriched, GS_C is negatively enriched.
In summary, BioQC can be used to perform gene-set enrichment analysis with signed genesets. Its speed is comparable to that of unsigned genesets.
I would like to thank the authors of the gCMAP package with the idea of integrating signed genesets into the framework of Wilcoxon-Mann-Whitney test.
## R version 4.4.0 RC (2024-04-16 r86468)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] rbenchmark_1.0.0 gplots_3.1.3.1 gridExtra_2.3
## [4] latticeExtra_0.6-30 lattice_0.22-6 org.Hs.eg.db_3.19.1
## [7] AnnotationDbi_1.67.0 IRanges_2.39.0 S4Vectors_0.43.0
## [10] testthat_3.2.1.1 limma_3.61.0 RColorBrewer_1.1-3
## [13] BioQC_1.33.0 Biobase_2.65.0 BiocGenerics_0.51.0
## [16] knitr_1.46
##
## loaded via a namespace (and not attached):
## [1] KEGGREST_1.45.0 gtable_0.3.5 xfun_0.43
## [4] bslib_0.7.0 caTools_1.18.2 vctrs_0.6.5
## [7] tools_4.4.0 bitops_1.0-7 RSQLite_2.3.6
## [10] highr_0.10 blob_1.2.4 pkgconfig_2.0.3
## [13] KernSmooth_2.23-22 desc_1.4.3 lifecycle_1.0.4
## [16] GenomeInfoDbData_1.2.12 compiler_4.4.0 deldir_2.0-4
## [19] Biostrings_2.73.0 brio_1.1.5 statmod_1.5.0
## [22] GenomeInfoDb_1.41.0 htmltools_0.5.8.1 sass_0.4.9
## [25] yaml_2.3.8 crayon_1.5.2 jquerylib_0.1.4
## [28] cachem_1.0.8 gtools_3.9.5 locfit_1.5-9.9
## [31] digest_0.6.35 rprojroot_2.0.4 fastmap_1.1.1
## [34] grid_4.4.0 cli_3.6.2 magrittr_2.0.3
## [37] edgeR_4.3.0 UCSC.utils_1.1.0 bit64_4.0.5
## [40] rmarkdown_2.26 XVector_0.45.0 httr_1.4.7
## [43] jpeg_0.1-10 bit_4.0.5 interp_1.1-6
## [46] png_0.1-8 memoise_2.0.1 evaluate_0.23
## [49] rlang_1.1.3 Rcpp_1.0.12 glue_1.7.0
## [52] DBI_1.2.2 pkgload_1.3.4 jsonlite_1.8.8
## [55] R6_2.5.1 zlibbioc_1.51.0