The phantasusLite
package contains a set functions that facilitate working with public
gene expression datasets originally developed for phantasus package. Unlike phantasus
, phantasusLite
aims to limit the amount of dependencies.
The current functionality includes:
It is recommended to install the release version of the package from Bioconductor using the following commands:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("phantasusLite")
Alternatively, the most recent version of the package can be installed from the GitHub repository:
library(devtools)
install_github("ctlab/phantasusLite")
library(GEOquery)
library(phantasusLite)
Let’s load dataset GSE53053 from GEO using GEOquery
package:
ess <- getGEO("GSE53053")
es <- ess[[1]]
RNA-seq dataset from GEO do not contain the expression matrix, thus exprs(es)
is empty:
head(exprs(es))
## GSM1281300 GSM1281301 GSM1281302 GSM1281303 GSM1281304 GSM1281305
## GSM1281306 GSM1281307
However, a number of precomputed gene count tables are available at HSDS server ‘https://alserglab.wustl.edu/hsds/’. It features HDF5 files with counts from ARCHS4 and DEE2 projects:
url <- 'https://alserglab.wustl.edu/hsds/?domain=/counts'
getHSDSFileList(url)
## [1] "/counts/archs4/Arabidopsis_thaliana_count_matrix.h5"
## [2] "/counts/archs4/Bos_taurus_count_matrix.h5"
## [3] "/counts/archs4/Caenorhabditis_elegans_count_matrix.h5"
## [4] "/counts/archs4/Danio_rerio_count_matrix.h5"
## [5] "/counts/archs4/Drosophila_melanogaster_count_matrix.h5"
## [6] "/counts/archs4/Gallus_gallus_count_matrix.h5"
## [7] "/counts/archs4/Rattus_norvegicus_count_matrix.h5"
## [8] "/counts/archs4/Saccharomyces_cerevisiae_count_matrix.h5"
## [9] "/counts/archs4/human_gene_v2.3.h5"
## [10] "/counts/archs4/meta.h5"
## [11] "/counts/archs4/mouse_gene_v2.3.h5"
## [12] "/counts/dee2/athaliana_star_matrix_20240409.h5"
## [13] "/counts/dee2/celegans_star_matrix_20240409.h5"
## [14] "/counts/dee2/dmelanogaster_star_matrix_20240409.h5"
## [15] "/counts/dee2/drerio_star_matrix_20240409.h5"
## [16] "/counts/dee2/ecoli_star_matrix_20240409.h5"
## [17] "/counts/dee2/hsapiens_star_matrix_20240409.h5"
## [18] "/counts/dee2/meta.h5"
## [19] "/counts/dee2/mmusculus_star_matrix_20240409.h5"
## [20] "/counts/dee2/rnorvegicus_star_matrix_20240409.h5"
## [21] "/counts/dee2/scerevisiae_star_matrix_20240409.h5"
## [22] "/counts/index.h5"
## [23] "/counts/priority.h5"
GSE53053 dataset was sequenced from Mus musculus and we can get an expression matrix from the corresponding HDF5-file with DEE2 data:
file <- "dee2/mmusculus_star_matrix_20240409.h5"
es <- loadCountsFromH5FileHSDS(es, url, file)
head(exprs(es))
## GSM1281300 GSM1281301 GSM1281302 GSM1281303 GSM1281304
## ENSMUSG00000102693 0 0 0 0 0
## ENSMUSG00000064842 0 0 0 0 0
## ENSMUSG00000051951 0 0 135 120 132
## ENSMUSG00000102851 0 0 0 0 0
## ENSMUSG00000103377 0 0 0 0 1
## ENSMUSG00000104017 0 0 0 0 0
## GSM1281305 GSM1281306 GSM1281307
## ENSMUSG00000102693 0 0 0
## ENSMUSG00000064842 0 0 0
## ENSMUSG00000051951 0 0 0
## ENSMUSG00000102851 0 0 0
## ENSMUSG00000103377 0 0 0
## ENSMUSG00000104017 0 0 0
Normally loadCountsFromHSDS
can be used to automatically select the HDF5-file with the largest
number of quantified samples:
es <- ess[[1]]
es <- loadCountsFromHSDS(es, url)
head(exprs(es))
## GSM1281300 GSM1281301 GSM1281302 GSM1281303 GSM1281304
## ENSMUSG00000000001 1015 603 561 549 425
## ENSMUSG00000000003 0 0 0 0 0
## ENSMUSG00000000028 109 34 0 14 9
## ENSMUSG00000000031 0 18 0 0 0
## ENSMUSG00000000037 0 0 0 0 0
## ENSMUSG00000000049 0 0 0 0 0
## GSM1281305 GSM1281306 GSM1281307
## ENSMUSG00000000001 853 407 479
## ENSMUSG00000000003 0 0 0
## ENSMUSG00000000028 165 0 15
## ENSMUSG00000000031 0 0 0
## ENSMUSG00000000037 0 0 0
## ENSMUSG00000000049 0 0 0
The counts are different from the previous values as ARCHS4 counts were used – ARCHS4 is prioritized when there are several files with the same number of samples:
preproc(experimentData(es))$gene_counts_source
## [1] "archs4/mouse_gene_v2.3.h5"
Further, gene symbols are also imported from ARCHS4 database and are available as feature data:
head(fData(es))
## Gene symbol ENSEMBLID
## ENSMUSG00000000001 Gnai3 ENSMUSG00000000001
## ENSMUSG00000000003 Pbsn ENSMUSG00000000003
## ENSMUSG00000000028 Cdc45 ENSMUSG00000000028
## ENSMUSG00000000031 H19 ENSMUSG00000000031
## ENSMUSG00000000037 Scml2 ENSMUSG00000000037
## ENSMUSG00000000049 Apoh ENSMUSG00000000049
For some of the GEO datasets, such as GSE53053, the sample annotation is not fully available. However, frequently sample titles are structured in a way that allows to infer the groups. For example, for GSE53053 we can see there are three groups: Ctrl, MandIL4, MandLPSandIFNg, with up to 3 replicates:
es$title
## [1] "Ctrl_1" "Ctrl_2" "MandIL4_1" "MandIL4_2"
## [5] "MandIL4_3" "MandLPSandIFNg_1" "MandLPSandIFNg_2" "MandLPSandIFNg_3"
For such well-structured titles, inferCondition
function can be used to automatically
identify the sample conditions and replicates:
es <- inferCondition(es)
print(es$condition)
## [1] "Ctrl" "Ctrl" "MandIL4" "MandIL4"
## [5] "MandIL4" "MandLPSandIFNg" "MandLPSandIFNg" "MandLPSandIFNg"
print(es$replicate)
## [1] "1" "2" "1" "2" "3" "1" "2" "3"
GCT text format can be used to store annotated gene expression matrices and load them in software such as Morpheus or Phantasus.
For example, we can save the ExpressionSet
object that we defined previously:
f <- file.path(tempdir(), "GSE53053.gct")
writeGct(es, f)
And the load the file back:
es2 <- readGct(f)
## Warning in readGct(f): duplicated row IDs: Gm16364 Pakap Pakap A530058N18Rik
## Snora43 1700030C10Rik; they were renamed
print(es2)
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 53511 features, 8 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: Ctrl_1 Ctrl_2 ... MandLPSandIFNg_3 (8 total)
## varLabels: title geo_accession ... replicate (43 total)
## varMetadata: labelDescription
## featureData
## featureNames: Gnai3 Pbsn ... ENSMUSG00002076992 (53511 total)
## fvarLabels: Gene symbol ENSEMBLID
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] phantasusLite_1.3.2 GEOquery_2.73.1 Biobase_2.65.0
## [4] BiocGenerics_0.51.0 BiocStyle_2.33.1
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3
## [4] tidyr_1.3.1 SparseArray_1.5.25 xml2_1.3.6
## [7] stringi_1.8.4 lattice_0.22-6 hms_1.1.3
## [10] digest_0.6.36 magrittr_2.0.3 evaluate_0.24.0
## [13] grid_4.4.1 bookdown_0.40 fastmap_1.2.0
## [16] R.oo_1.26.0 jsonlite_1.8.8 Matrix_1.7-0
## [19] R.utils_2.12.3 limma_3.61.5 BiocManager_1.30.23
## [22] httr_1.4.7 purrr_1.0.2 fansi_1.0.6
## [25] jquerylib_0.1.4 abind_1.4-5 cli_3.6.3
## [28] crayon_1.5.3 rlang_1.1.4 XVector_0.45.0
## [31] R.methodsS3_1.8.2 withr_3.0.0 cachem_1.1.0
## [34] DelayedArray_0.31.9 yaml_2.3.9 S4Arrays_1.5.5
## [37] tools_4.4.1 tzdb_0.4.0 dplyr_1.1.4
## [40] curl_5.2.1 vctrs_0.6.5 R6_2.5.1
## [43] matrixStats_1.3.0 stats4_4.4.1 lifecycle_1.0.4
## [46] stringr_1.5.1 zlibbioc_1.51.1 S4Vectors_0.43.2
## [49] IRanges_2.39.2 pkgconfig_2.0.3 pillar_1.9.0
## [52] bslib_0.7.0 data.table_1.15.4 glue_1.7.0
## [55] statmod_1.5.0 xfun_0.46 tibble_3.2.1
## [58] tidyselect_1.2.1 MatrixGenerics_1.17.0 knitr_1.48
## [61] rjson_0.2.21 htmltools_0.5.8.1 rmarkdown_2.27
## [64] rhdf5client_1.27.1 readr_2.1.5 compiler_4.4.1
```