easierData
The easierData
package includes an exemplary cancer
dataset from Mariathasan et al. (2018) to
showcase the easier
package:
Mariathasan2018_PDL1_treatment: exemplary
bladder cancer dataset with samples from 192 patients. This is provided
as a SummarizedExperiment
object containing:
counts
and tpm
expression
values.colData
slot,
including pat_id (the id of the patient in the original study), BOR, and
TMB (Tumor Mutational Burden).The processed data is publicly available from Mariathasan et al. “TGF-B attenuates tumour response to PD-L1 blockade by contributing to exclusion of T cells”, published in Nature, 2018 doi:10.1038/nature25501 via IMvigor210CoreBiologies package under the CC-BY license.
The easierData
data package also includes multiple data
objects so-called internal data of easier
package since
they are indispensable for the functional performance of the package.
This includes:
opt_models: the cancer-specific model feature parameters learned in Lapuente-Santana et al. (2021). For each quantitative descriptor (e.g. pathway activity), models were trained using multi-task learning with randomized cross-validation repeated 100 times. For each quantitative descriptor, 1000 models are available (100 per task). This is provided as a list containing, for each cancer type and quantitative descriptor, a matrix of feature coefficient values across different tasks.
opt_xtrain_stats: the cancer-specific features mean and standard deviation of each quantitative descriptor (e.g. pathway activity) training set used in Lapuente-Santana et al. (2021) during randomized cross-validation repeated 100 times, required for normalization of the test set. This is provided as a list containing, for each cancer type and quantitative descriptor, a matrix with feature mean and sd values across the 100 cross-validation runs.
TCGA_mean_pancancer: a numeric vector with the mean of the TPM expression of each gene across all TCGA cancer types, required for normalization of input TPM gene expression data.
TCGA_sd_pancancer: a numeric vector with the standard deviation (sd) of the TPM expression of each gene across all TCGA cancer types, required for normalization of input TPM gene expression data.
cor_scores_genes: a character vector with the list of genes used to define correlated scores of immune response. These scores were found to be highly correlated across all 18 cancer types (Lapuente-Santana et al. 2021).
intercell_networks: a list with the cancer-specific intercellular networks, including a pan-cancer network.
lr_frequency_TCGA: a numeric vector containing the frequency of each ligand-receptor pair feature across the whole TCGA database.
group_lrpairs: a list with the information on how to group ligand-receptor pairs because of sharing the same gene, either as ligand or receptor.
HGNC_annotation: a data.frame with the gene symbols approved annotations obtained from https://www.genenames.org/tools/multi-symbol-checker/ (Tweedie et al. 2020).
scores_signature_genes: a list with the gene signatures for each score of immune response: CYT (Rooney et al. 2015), TLS (Cabrita et al. 2020), IFNy (McClanahan 2017), Ayers_expIS (McClanahan 2017), Tcell_inflamed (McClanahan 2017), Roh_IS (Roh et al. 2017), Davoli_IS (Davoli et al. 2017), chemokines (Messina et al. 2012), IMPRES (Auslander et al. 2018), MSI (Fu et al. 2019) and RIR (Jerby-Arnon et al. 2018).
Starting R, this package can be installed as follows:
BiocManager::install("easierData")
The contents of the package can be seen by querying ExperimentHub for the package name:
suppressPackageStartupMessages({
library("ExperimentHub")
library("easierData")
})
eh <- ExperimentHub()
query(eh, "easierData")
#> ExperimentHub with 11 records
#> # snapshotDate(): 2024-10-24
#> # $dataprovider: NA, IMvigor210CoreBiologies package; Mariathasan S, Turley...
#> # $species: Homo sapiens
#> # $rdataclass: list, numeric, data.frame, character, SummarizedExperiment
#> # additional mcols(): taxonomyid, genome, description,
#> # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> # rdatapath, sourceurl, sourcetype
#> # retrieve records with, e.g., 'object[["EH6677"]]'
#>
#> title
#> EH6677 | Mariathasan2018_PDL1_treatment
#> EH6678 | opt_models
#> EH6679 | opt_xtrain_stats
#> EH6680 | TCGA_mean_pancancer
#> EH6681 | TCGA_sd_pancancer
#> ... ...
#> EH6683 | intercell_networks
#> EH6684 | lr_frequency_TCGA
#> EH6685 | group_lrpairs
#> EH6686 | HGNC_annotation
#> EH6687 | scores_signature_genes
An overview is provided also in tabular form:
list_easierData()
#> eh_id title
#> 1 EH6677 Mariathasan2018_PDL1_treatment
#> 2 EH6678 opt_models
#> 3 EH6679 opt_xtrain_stats
#> 4 EH6680 TCGA_mean_pancancer
#> 5 EH6681 TCGA_sd_pancancer
#> 6 EH6682 cor_scores_genes
#> 7 EH6683 intercell_networks
#> 8 EH6684 lr_frequency_TCGA
#> 9 EH6685 group_lrpairs
#> 10 EH6686 HGNC_annotation
#> 11 EH6687 scores_signature_genes
The individual data objects can be accessed using either their
ExperimentHub accession number, or the convenience functions provided in
this package - both calls are equivalent. For instance to access the
Mariathasan2018_PDL1_treatment
example dataset:
mariathasan_dataset <- eh[["EH6677"]]
mariathasan_dataset
#> class: SummarizedExperiment
#> dim: 31087 192
#> metadata(1): cancertype
#> assays(2): counts tpm
#> rownames(31087): A1BG NAT2 ... CASP8AP2 SCO2
#> rowData names(0):
#> colnames(192): SAM7f0d9cc7f001 SAM4305ab968b90 ... SAMda4d892fddc8
#> SAMe3d4266775a9
#> colData names(3): pat_id BOR TMB
mariathasan_dataset <- get_Mariathasan2018_PDL1_treatment()
mariathasan_dataset
#> class: SummarizedExperiment
#> dim: 31087 192
#> metadata(1): cancertype
#> assays(2): counts tpm
#> rownames(31087): A1BG NAT2 ... CASP8AP2 SCO2
#> rowData names(0):
#> colnames(192): SAM7f0d9cc7f001 SAM4305ab968b90 ... SAMda4d892fddc8
#> SAMe3d4266775a9
#> colData names(3): pat_id BOR TMB
sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SummarizedExperiment_1.37.0 Biobase_2.67.0
#> [3] GenomicRanges_1.59.0 GenomeInfoDb_1.43.0
#> [5] IRanges_2.41.0 S4Vectors_0.45.0
#> [7] MatrixGenerics_1.19.0 matrixStats_1.4.1
#> [9] ExperimentHub_2.15.0 AnnotationHub_3.15.0
#> [11] BiocFileCache_2.15.0 dbplyr_2.5.0
#> [13] BiocGenerics_0.53.1 generics_0.1.3
#> [15] easierData_1.13.0
#>
#> loaded via a namespace (and not attached):
#> [1] KEGGREST_1.47.0 xfun_0.49 bslib_0.8.0
#> [4] lattice_0.22-6 vctrs_0.6.5 tools_4.5.0
#> [7] curl_5.2.3 tibble_3.2.1 fansi_1.0.6
#> [10] AnnotationDbi_1.69.0 RSQLite_2.3.7 blob_1.2.4
#> [13] pkgconfig_2.0.3 Matrix_1.7-1 lifecycle_1.0.4
#> [16] GenomeInfoDbData_1.2.13 compiler_4.5.0 Biostrings_2.75.0
#> [19] htmltools_0.5.8.1 sass_0.4.9 yaml_2.3.10
#> [22] pillar_1.9.0 crayon_1.5.3 jquerylib_0.1.4
#> [25] DelayedArray_0.33.1 cachem_1.1.0 abind_1.4-8
#> [28] mime_0.12 tidyselect_1.2.1 digest_0.6.37
#> [31] purrr_1.0.2 dplyr_1.1.4 BiocVersion_3.21.1
#> [34] grid_4.5.0 fastmap_1.2.0 SparseArray_1.7.0
#> [37] cli_3.6.3 magrittr_2.0.3 S4Arrays_1.7.1
#> [40] utf8_1.2.4 withr_3.0.2 filelock_1.0.3
#> [43] UCSC.utils_1.3.0 rappdirs_0.3.3 bit64_4.5.2
#> [46] rmarkdown_2.29 XVector_0.47.0 httr_1.4.7
#> [49] bit_4.5.0 png_0.1-8 memoise_2.0.1
#> [52] evaluate_1.0.1 knitr_1.48 rlang_1.1.4
#> [55] glue_1.8.0 DBI_1.2.3 BiocManager_1.30.25
#> [58] jsonlite_1.8.9 R6_2.5.1 zlibbioc_1.53.0