easierData
The easierData
package includes an exemplary cancer
dataset from Mariathasan et al. (2018) to
showcase the easier
package:
Mariathasan2018_PDL1_treatment: exemplary
bladder cancer dataset with samples from 192 patients. This is provided
as a SummarizedExperiment
object containing:
counts
and tpm
expression
values.colData
slot,
including pat_id (the id of the patient in the original study), BOR, and
TMB (Tumor Mutational Burden).The processed data is publicly available from Mariathasan et al. “TGF-B attenuates tumour response to PD-L1 blockade by contributing to exclusion of T cells”, published in Nature, 2018 doi:10.1038/nature25501 via IMvigor210CoreBiologies package under the CC-BY license.
The easierData
data package also includes multiple data
objects so-called internal data of easier
package since
they are indispensable for the functional performance of the package.
This includes:
opt_models: the cancer-specific model feature parameters learned in Lapuente-Santana et al. (2021). For each quantitative descriptor (e.g. pathway activity), models were trained using multi-task learning with randomized cross-validation repeated 100 times. For each quantitative descriptor, 1000 models are available (100 per task). This is provided as a list containing, for each cancer type and quantitative descriptor, a matrix of feature coefficient values across different tasks.
opt_xtrain_stats: the cancer-specific features mean and standard deviation of each quantitative descriptor (e.g. pathway activity) training set used in Lapuente-Santana et al. (2021) during randomized cross-validation repeated 100 times, required for normalization of the test set. This is provided as a list containing, for each cancer type and quantitative descriptor, a matrix with feature mean and sd values across the 100 cross-validation runs.
TCGA_mean_pancancer: a numeric vector with the mean of the TPM expression of each gene across all TCGA cancer types, required for normalization of input TPM gene expression data.
TCGA_sd_pancancer: a numeric vector with the standard deviation (sd) of the TPM expression of each gene across all TCGA cancer types, required for normalization of input TPM gene expression data.
cor_scores_genes: a character vector with the list of genes used to define correlated scores of immune response. These scores were found to be highly correlated across all 18 cancer types (Lapuente-Santana et al. 2021).
intercell_networks: a list with the cancer-specific intercellular networks, including a pan-cancer network.
lr_frequency_TCGA: a numeric vector containing the frequency of each ligand-receptor pair feature across the whole TCGA database.
group_lrpairs: a list with the information on how to group ligand-receptor pairs because of sharing the same gene, either as ligand or receptor.
HGNC_annotation: a data.frame with the gene symbols approved annotations obtained from https://www.genenames.org/tools/multi-symbol-checker/ (Tweedie et al. 2020).
scores_signature_genes: a list with the gene signatures for each score of immune response: CYT (Rooney et al. 2015), TLS (Cabrita et al. 2020), IFNy (McClanahan 2017), Ayers_expIS (McClanahan 2017), Tcell_inflamed (McClanahan 2017), Roh_IS (Roh et al. 2017), Davoli_IS (Davoli et al. 2017), chemokines (Messina et al. 2012), IMPRES (Auslander et al. 2018), MSI (Fu et al. 2019) and RIR (Jerby-Arnon et al. 2018).
Starting R, this package can be installed as follows:
BiocManager::install("easierData")
The contents of the package can be seen by querying ExperimentHub for the package name:
suppressPackageStartupMessages({
library("ExperimentHub")
library("easierData")
})
eh <- ExperimentHub()
query(eh, "easierData")
#> ExperimentHub with 11 records
#> # snapshotDate(): 2024-10-24
#> # $dataprovider: NA, IMvigor210CoreBiologies package; Mariathasan S, Turley...
#> # $species: Homo sapiens
#> # $rdataclass: list, numeric, data.frame, character, SummarizedExperiment
#> # additional mcols(): taxonomyid, genome, description,
#> # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> # rdatapath, sourceurl, sourcetype
#> # retrieve records with, e.g., 'object[["EH6677"]]'
#>
#> title
#> EH6677 | Mariathasan2018_PDL1_treatment
#> EH6678 | opt_models
#> EH6679 | opt_xtrain_stats
#> EH6680 | TCGA_mean_pancancer
#> EH6681 | TCGA_sd_pancancer
#> ... ...
#> EH6683 | intercell_networks
#> EH6684 | lr_frequency_TCGA
#> EH6685 | group_lrpairs
#> EH6686 | HGNC_annotation
#> EH6687 | scores_signature_genes
An overview is provided also in tabular form:
list_easierData()
#> eh_id title
#> 1 EH6677 Mariathasan2018_PDL1_treatment
#> 2 EH6678 opt_models
#> 3 EH6679 opt_xtrain_stats
#> 4 EH6680 TCGA_mean_pancancer
#> 5 EH6681 TCGA_sd_pancancer
#> 6 EH6682 cor_scores_genes
#> 7 EH6683 intercell_networks
#> 8 EH6684 lr_frequency_TCGA
#> 9 EH6685 group_lrpairs
#> 10 EH6686 HGNC_annotation
#> 11 EH6687 scores_signature_genes
The individual data objects can be accessed using either their
ExperimentHub accession number, or the convenience functions provided in
this package - both calls are equivalent. For instance to access the
Mariathasan2018_PDL1_treatment
example dataset:
mariathasan_dataset <- eh[["EH6677"]]
mariathasan_dataset
#> class: SummarizedExperiment
#> dim: 31087 192
#> metadata(1): cancertype
#> assays(2): counts tpm
#> rownames(31087): A1BG NAT2 ... CASP8AP2 SCO2
#> rowData names(0):
#> colnames(192): SAM7f0d9cc7f001 SAM4305ab968b90 ... SAMda4d892fddc8
#> SAMe3d4266775a9
#> colData names(3): pat_id BOR TMB
mariathasan_dataset <- get_Mariathasan2018_PDL1_treatment()
mariathasan_dataset
#> class: SummarizedExperiment
#> dim: 31087 192
#> metadata(1): cancertype
#> assays(2): counts tpm
#> rownames(31087): A1BG NAT2 ... CASP8AP2 SCO2
#> rowData names(0):
#> colnames(192): SAM7f0d9cc7f001 SAM4305ab968b90 ... SAMda4d892fddc8
#> SAMe3d4266775a9
#> colData names(3): pat_id BOR TMB
sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SummarizedExperiment_1.35.5 Biobase_2.65.1
#> [3] GenomicRanges_1.57.2 GenomeInfoDb_1.41.2
#> [5] IRanges_2.39.2 S4Vectors_0.43.2
#> [7] MatrixGenerics_1.17.1 matrixStats_1.4.1
#> [9] ExperimentHub_2.13.1 AnnotationHub_3.13.3
#> [11] BiocFileCache_2.13.2 dbplyr_2.5.0
#> [13] BiocGenerics_0.51.3 easierData_1.11.0
#>
#> loaded via a namespace (and not attached):
#> [1] KEGGREST_1.45.1 xfun_0.48 bslib_0.8.0
#> [4] lattice_0.22-6 vctrs_0.6.5 tools_4.5.0
#> [7] generics_0.1.3 curl_5.2.3 tibble_3.2.1
#> [10] fansi_1.0.6 AnnotationDbi_1.67.0 RSQLite_2.3.7
#> [13] blob_1.2.4 pkgconfig_2.0.3 Matrix_1.7-1
#> [16] lifecycle_1.0.4 GenomeInfoDbData_1.2.13 compiler_4.5.0
#> [19] Biostrings_2.73.2 htmltools_0.5.8.1 sass_0.4.9
#> [22] yaml_2.3.10 pillar_1.9.0 crayon_1.5.3
#> [25] jquerylib_0.1.4 DelayedArray_0.31.14 cachem_1.1.0
#> [28] abind_1.4-8 mime_0.12 tidyselect_1.2.1
#> [31] digest_0.6.37 purrr_1.0.2 dplyr_1.1.4
#> [34] BiocVersion_3.20.0 grid_4.5.0 fastmap_1.2.0
#> [37] SparseArray_1.5.45 cli_3.6.3 magrittr_2.0.3
#> [40] S4Arrays_1.5.11 utf8_1.2.4 withr_3.0.1
#> [43] filelock_1.0.3 UCSC.utils_1.1.0 rappdirs_0.3.3
#> [46] bit64_4.5.2 rmarkdown_2.28 XVector_0.45.0
#> [49] httr_1.4.7 bit_4.5.0 png_0.1-8
#> [52] memoise_2.0.1 evaluate_1.0.1 knitr_1.48
#> [55] rlang_1.1.4 glue_1.8.0 DBI_1.2.3
#> [58] BiocManager_1.30.25 jsonlite_1.8.9 R6_2.5.1
#> [61] zlibbioc_1.51.2