Contents

Progenetix is an open data resource that provides curated individual cancer copy number variation (CNV) profiles along with associated metadata sourced from published oncogenomic studies and various data repositories. This vignette provides a comprehensive guide on accessing and utilizing metadata for samples or their corresponding individuals within the Progenetix database. If your focus lies in cancer cell lines, you can access data from cancercelllines.org by specifying the dataset parameter as “cancercelllines”. This data repository originates from CNV profiling data of cell lines initially collected as part of Progenetix and currently includes additional types of genomic mutations.

1 Load library

library(pgxRpi)

1.1 pgxLoader function

This function loads various data from Progenetix database.

The parameters of this function used in this tutorial:

  • type A string specifying output data type. Available options are “biosample”, “individual”, “variant” or “frequency”.
  • filters Identifiers for cancer type, literature, cohorts, and age such as c(“NCIT:C7376”, “pgx:icdom-98353”, “PMID:22824167”, “pgx:cohort-TCGAcancers”, “age:>=P50Y”). For more information about filters, see the documentation.
  • filterLogic A string specifying logic for combining multiple filters when query metadata. Available options are “AND” and “OR”. Default is “AND”. An exception is filters associated with age that always use AND logic when combined with any other filter, even if filterLogic = “OR”, which affects other filters.
  • individual_id Identifiers used in Progenetix database for identifying individuals.
  • biosample_id Identifiers used in Progenetix database for identifying biosamples.
  • codematches A logical value determining whether to exclude samples from child concepts of specified filters that belong to cancer type/tissue encoding system (NCIt, icdom/t, Uberon). If TRUE, retrieved samples only keep samples exactly encoded by specified filters. Do not use this parameter when filters include ontology-irrelevant filters such as PMID and cohort identifiers. Default is FALSE.
  • limit Integer to specify the number of returned samples/individuals/coverage profiles for each filter. Default is 0 (return all).
  • skip Integer to specify the number of skipped samples/individuals/coverage profiles for each filter. E.g. if skip = 2, limit=500, the first 2*500 =1000 profiles are skipped and the next 500 profiles are returned. Default is NULL (no skip).
  • dataset A string specifying the dataset to query. Default is “progenetix”. Other available options are “cancercelllines”.

2 Retrieve meatdata of samples

2.1 Relevant parameters

type, filters, filterLogic, individual_id, biosample_id, codematches, limit, skip, dataset

2.2 Search by filters

Filters are a significant enhancement to the Beacon query API, providing a mechanism for specifying rules to select records based on their field values. To learn more about how to utilize filters in Progenetix, please refer to the documentation.

The pgxFilter function helps access available filters used in Progenetix. Here is the example use:

# access all filters
all_filters <- pgxFilter()
# get all prefix
all_prefix <- pgxFilter(return_all_prefix = TRUE)
# access specific filters based on prefix
ncit_filters <- pgxFilter(prefix="NCIT")
head(ncit_filters)
#> [1] "NCIT:C28076" "NCIT:C18000" "NCIT:C14158" "NCIT:C14161" "NCIT:C28077"
#> [6] "NCIT:C28078"

The following query is designed to retrieve metadata in Progenetix related to all samples of lung adenocarcinoma, utilizing a specific type of filter based on an NCIt code as an ontology identifier.

biosamples <- pgxLoader(type="biosample", filters = "NCIT:C3512")
# data looks like this
biosamples[c(1700:1705),]
#>        biosample_id biosample_label biosample_legacy_id   individual_id
#> 1700 pgxbs-kftvj3y5              NA                  NA pgxind-kftx4x5b
#> 1701 pgxbs-kftvkvef              NA                  NA pgxind-kftx7204
#> 1702 pgxbs-kftvkwfd              NA                  NA pgxind-kftx73ah
#> 1703 pgxbs-kftvl99a              NA                  NA pgxind-kftx7jcp
#> 1704 pgxbs-kftvl9tn              NA                  NA pgxind-kftx7k2d
#> 1705 pgxbs-kftvj6tn              NA                  NA pgxind-kftx50oe
#>         callset_ids group_id group_label     pubmed_id
#> 1700 pgxcs-kftweo97       NA          NA PMID:20215515
#> 1701 pgxcs-kftwuxdr       NA          NA PMID:28336552
#> 1702 pgxcs-kftwv8r2       NA          NA              
#> 1703 pgxcs-kftx0s1n       NA          NA PMID:28481359
#> 1704 pgxcs-kftx0yfl       NA          NA PMID:28481359
#> 1705 pgxcs-kftwfk6p       NA          NA PMID:21521776
#>                                                                                          pubmed_label
#> 1700     Rothenberg SM, Mohapatra G et al. (2010): A genome-wide screen for microdeletions reveals...
#> 1701 Jordan EJ, Kim HR et al. (2017): Prospective Comprehensive Molecular Characterization of Lung...
#> 1702                                                                                                 
#> 1703          Zehir A, Benayed R et al. (2017): Mutational landscape of metastatic cancer revealed...
#> 1704          Zehir A, Benayed R et al. (2017): Mutational landscape of metastatic cancer revealed...
#> 1705             Broët P, Dalmasso C et al. (2011): Genomic profiles specific to patient ethnicity...
#>             cellosaurus_id cellosaurus_label              cbioportal_id
#> 1700 cellosaurus:CVCL_1475         NCI-H1563                           
#> 1701                                           cbioportal:lung_msk_2017
#> 1702                                            cbioportal:lung_msk_pdx
#> 1703                                         cbioportal:msk_impact_2017
#> 1704                                         cbioportal:msk_impact_2017
#> 1705                                                                   
#>      cbioportal_label tcgaproject_id tcgaproject_label
#> 1700               NA                                 
#> 1701               NA                                 
#> 1702               NA                                 
#> 1703               NA                                 
#> 1704               NA                                 
#> 1705               NA                                 
#>      external_references_id___arrayexpress
#> 1700                                      
#> 1701                                      
#> 1702                                      
#> 1703                                      
#> 1704                                      
#> 1705                                      
#>      external_references_label___arrayexpress cohort_ids
#> 1700                                                  NA
#> 1701                                                  NA
#> 1702                                                  NA
#> 1703                                                  NA
#> 1704                                                  NA
#> 1705                                                  NA
#>                                       legacy_ids
#> 1700                         PGX_AM_BS_GSM827536
#> 1701   PGX_AM_BS_LUNG_MSK_2017-P_0002091_T02_IM5
#> 1702    PGX_AM_BS_LUNG_MSK_PDX-P_0012001_T01_IM5
#> 1703 PGX_AM_BS_MSK_IMPACT_2017-P_0011030_T01_IM5
#> 1704 PGX_AM_BS_MSK_IMPACT_2017-P_0011548_T01_IM5
#> 1705                         PGX_AM_BS_GSM837804
#>                                          notes histological_diagnosis_id
#> 1700 Lung adenocarcinoma [cell line NCI-H1563]                NCIT:C3512
#> 1701                       Lung Adenocarcinoma                NCIT:C3512
#> 1702                       Lung Adenocarcinoma                NCIT:C3512
#> 1703                       Lung Adenocarcinoma                NCIT:C3512
#> 1704                       Lung Adenocarcinoma                NCIT:C3512
#> 1705    lung adenocarcinoma [Western European]                NCIT:C3512
#>      histological_diagnosis_label icdo_morphology_id icdo_morphology_label
#> 1700          Lung Adenocarcinoma    pgx:icdom-81403   Adenocarcinoma, NOS
#> 1701          Lung Adenocarcinoma    pgx:icdom-81403   Adenocarcinoma, NOS
#> 1702          Lung Adenocarcinoma    pgx:icdom-81403   Adenocarcinoma, NOS
#> 1703          Lung Adenocarcinoma    pgx:icdom-81403   Adenocarcinoma, NOS
#> 1704          Lung Adenocarcinoma    pgx:icdom-81403   Adenocarcinoma, NOS
#> 1705          Lung Adenocarcinoma    pgx:icdom-81403   Adenocarcinoma, NOS
#>      icdo_topography_id icdo_topography_label pathological_stage_id
#> 1700    pgx:icdot-C34.9             Lung, NOS                      
#> 1701    pgx:icdot-C34.9             Lung, NOS           NCIT:C92207
#> 1702    pgx:icdot-C34.9             Lung, NOS           NCIT:C92207
#> 1703    pgx:icdot-C34.9             Lung, NOS           NCIT:C92207
#> 1704    pgx:icdot-C34.9             Lung, NOS           NCIT:C92207
#> 1705    pgx:icdot-C34.9             Lung, NOS           NCIT:C92207
#>      pathological_stage_label biosample_status_id  biosample_status_label
#> 1700                                  EFO:0030035 cancer cell line sample
#> 1701            Stage Unknown         EFO:0009656       neoplastic sample
#> 1702            Stage Unknown         EFO:0009656       neoplastic sample
#> 1703            Stage Unknown         EFO:0009656       neoplastic sample
#> 1704            Stage Unknown         EFO:0009656       neoplastic sample
#> 1705            Stage Unknown         EFO:0009656       neoplastic sample
#>      sampled_tissue_id sampled_tissue_label tnm stage grade age_iso sex_id
#> 1700    UBERON:0002048                 lung  NA    NA    NA             NA
#> 1701    UBERON:0002048                 lung  NA    NA    NA    P38Y     NA
#> 1702    UBERON:0002048                 lung  NA    NA    NA    P69Y     NA
#> 1703    UBERON:0002048                 lung  NA    NA    NA    P69Y     NA
#> 1704    UBERON:0002048                 lung  NA    NA    NA    P69Y     NA
#> 1705    UBERON:0002048                 lung  NA    NA    NA             NA
#>      sex_label followup_state_id followup_state_label followup_time
#> 1700        NA       EFO:0030039   no followup status            NA
#> 1701        NA       EFO:0030039   no followup status            NA
#> 1702        NA       EFO:0030039   no followup status            NA
#> 1703        NA       EFO:0030039   no followup status            NA
#> 1704        NA       EFO:0030039   no followup status            NA
#> 1705        NA       EFO:0030039   no followup status            NA
#>       geoprov_city          geoprov_country geoprov_iso_alpha3 geoprov_long_lat
#> 1700   Charlestown United States of America                USA    -71.06::42.38
#> 1701 New York City United States of America                USA    -74.01::40.71
#> 1702 New York City United States of America                USA    -74.01::40.71
#> 1703 New York City United States of America                USA    -74.01::40.71
#> 1704 New York City United States of America                USA    -74.01::40.71
#> 1705          Evry                   France                FRA      2.45::48.63
#>      cnv_fraction cnv_del_fraction cnv_dup_fraction cell_line
#> 1700           NA               NA               NA          
#> 1701           NA               NA               NA          
#> 1702           NA               NA               NA          
#> 1703           NA               NA               NA          
#> 1704           NA               NA               NA          
#> 1705           NA               NA               NA

The data contains many columns representing different aspects of sample information.

2.3 Search by biosample id and individual id

In Progenetix, biosample id and individual id serve as unique identifiers for biosamples and the corresponding individuals. You can obtain these IDs through metadata search with filters as described above, or through website interface query.

biosamples_2 <- pgxLoader(type="biosample", biosample_id = "pgxbs-kftvgioe",individual_id = "pgxind-kftx28q5")

metainfo <- c("biosample_id","individual_id","pubmed_id","followup_state_label","followup_time")
biosamples_2[metainfo]
#>     biosample_id   individual_id     pubmed_id     followup_state_label
#> 1 pgxbs-kftvgioe pgxind-kftx28pu PMID:24174329 alive (follow-up status)
#> 2 pgxbs-kftvgiom pgxind-kftx28q5 PMID:24174329  dead (follow-up status)
#>   followup_time
#> 1            NA
#> 2            NA

It’s also possible to query by a combination of filters, biosample id, and individual id.

2.4 Access a subset of samples

By default, it returns all related samples (limit=0). You can access a subset of them via the parameter limit and skip. For example, if you want to access the first 1000 samples , you can set limit = 1000, skip = 0.

biosamples_3 <- pgxLoader(type="biosample", filters = "NCIT:C3512",skip=0, limit = 1000)
# Dimension: Number of samples * features
print(dim(biosamples))
#> [1] 4641   49
print(dim(biosamples_3))
#> [1] 1000   49

2.5 Query the number of samples in Progenetix

The number of samples in specific group can be queried by pgxCount function.

pgxCount(filters = "NCIT:C3512")
#>      filters               label total_count exact_match_count
#> 1 NCIT:C3512 Lung Adenocarcinoma        4641              4505

2.6 Parameter codematches use

The NCIt code of retrieved samples doesn’t only contain specified filters but contains child terms.

unique(biosamples$histological_diagnosis_id)
#> [1] "NCIT:C3512" "NCIT:C5649" "NCIT:C5650" "NCIT:C7270" "NCIT:C2923"
#> [6] "NCIT:C7269" "NCIT:C7268"

Setting codematches as TRUE allows this function to only return biosamples with exact match to the filter.

biosamples_4 <- pgxLoader(type="biosample", filters = "NCIT:C3512",codematches = TRUE)

unique(biosamples_4$histological_diagnosis_id)
#> [1] "NCIT:C3512"

2.7 Parameter filterLogic use

This function supports querying samples that belong to multiple filters. For example, If you want to retrieve information about lung adenocarcinoma samples from the literature PMID:24174329, you can specify multiple matching filters and set filterLogic to “AND”.

biosamples_5 <- pgxLoader(type="biosample", filters = c("NCIT:C3512","PMID:24174329"), 
                          filterLogic = "AND")

3 Retrieve meatdata of individuals

If you want to query metadata (e.g. survival data) of individuals where the samples of interest come from, you can follow the tutorial below.

3.1 Relevant parameters

type, filters, filterLogic, individual_id, biosample_id, codematches, limit, skip, dataset

3.2 Search by filters

individuals <- pgxLoader(type="individual",filters="NCIT:C3270")
# Dimension: Number of individuals * features
print(dim(individuals))
#> [1] 2001   25
# data looks like this
individuals[c(36:40),]
#>      individual_id individual_legacy_id
#> 36 pgxind-kftx49h3                   NA
#> 37 pgxind-kftx7kn0                   NA
#> 38 pgxind-kftx27yo                   NA
#> 39 pgxind-kftx49ik                   NA
#> 40 pgxind-kftx3xon                   NA
#>                                         legacy_ids       sex_id
#> 36                               PGX_IND_GSM135032 PATO:0020000
#> 37 PGX_IND_NBL_TARGET_2018_PUB-TARGET_30_PAKADI_01 PATO:0020002
#> 38                       PGX_IND_9006325_NB-pla-11 PATO:0020000
#> 39                               PGX_IND_GSM135069 PATO:0020000
#> 40                               PGX_IND_GSM313917 PATO:0020000
#>               sex_label age_iso age_days data_use_conditions_id
#> 36        genotypic sex               NA                     NA
#> 37 female genotypic sex     P4Y 1460.970                     NA
#> 38        genotypic sex   P1Y5M  517.326                     NA
#> 39        genotypic sex               NA                     NA
#> 40        genotypic sex  P10Y4M 3774.092                     NA
#>    data_use_conditions_label histological_diagnosis_id
#> 36                        NA                NCIT:C3270
#> 37                        NA                NCIT:C3270
#> 38                        NA                NCIT:C3270
#> 39                        NA                NCIT:C3270
#> 40                        NA                NCIT:C3270
#>    histological_diagnosis_label index_disease_notes index_disease_followup_time
#> 36                Neuroblastoma                  NA                        None
#> 37                Neuroblastoma                  NA                        None
#> 38                Neuroblastoma                  NA                        P28M
#> 39                Neuroblastoma                  NA                        None
#> 40                Neuroblastoma                  NA                        P42M
#>    index_disease_followup_state_id index_disease_followup_state_label
#> 36                     EFO:0030039                 no followup status
#> 37                     EFO:0030039                 no followup status
#> 38                     EFO:0030041           alive (follow-up status)
#> 39                     EFO:0030039                 no followup status
#> 40                     EFO:0030039                 no followup status
#>    auxiliary_disease_id auxiliary_disease_label auxiliary_disease_notes
#> 36                   NA                      NA                      NA
#> 37                   NA                      NA                      NA
#> 38                   NA                      NA                      NA
#> 39                   NA                      NA                      NA
#> 40                   NA                      NA                      NA
#>    geoprov_id  geoprov_city          geoprov_country geoprov_iso_alpha3
#> 36         NA         Chiba                    Japan                 NA
#> 37         NA      Bethesda United States of America                 NA
#> 38         NA San Francisco            United States                 NA
#> 39         NA         Chiba                    Japan                 NA
#> 40         NA         Tokyo                    Japan                 NA
#>    geoprov_long_lat cell_line_donation_id cell_line_donation_label
#> 36     140.12::35.6                    NA                       NA
#> 37     -77.1::38.98                    NA                       NA
#> 38   -122.42::37.77                    NA                       NA
#> 39     140.12::35.6                    NA                       NA
#> 40    139.69::35.69                    NA                       NA

3.3 Search by biosample id and individual id

You can get the id from the query of samples

individual <- pgxLoader(type="individual",individual_id = "pgxind-kftx26ml", biosample_id="pgxbs-kftvh94d")

individual
#>     individual_id individual_legacy_id            legacy_ids       sex_id
#> 1 pgxind-kftx3565                   NA     PGX_IND_EpTu-N270 PATO:0020000
#> 2 pgxind-kftx26ml                   NA PGX_IND_AdSqLu-bjo-01 PATO:0020001
#>            sex_label age_iso age_days data_use_conditions_id
#> 1      genotypic sex      NA       NA                     NA
#> 2 male genotypic sex      NA       NA                     NA
#>   data_use_conditions_label histological_diagnosis_id
#> 1                        NA                NCIT:C3697
#> 2                        NA                NCIT:C3493
#>   histological_diagnosis_label index_disease_notes index_disease_followup_time
#> 1     Myxopapillary Ependymoma                  NA                        None
#> 2 Squamous Cell Lung Carcinoma                  NA                        None
#>   index_disease_followup_state_id index_disease_followup_state_label
#> 1                     EFO:0030039                 no followup status
#> 2                     EFO:0030039                 no followup status
#>   auxiliary_disease_id auxiliary_disease_label auxiliary_disease_notes
#> 1                   NA                      NA                      NA
#> 2                   NA                      NA                      NA
#>   geoprov_id geoprov_city geoprov_country geoprov_iso_alpha3 geoprov_long_lat
#> 1         NA     Nijmegen The Netherlands                 NA      5.84::51.81
#> 2         NA     Helsinki         Finland                 NA     24.94::60.17
#>   cell_line_donation_id cell_line_donation_label
#> 1                    NA                       NA
#> 2                    NA                       NA

4 Visualization of survival data

4.1 pgxMetaplot function

This function generates a survival plot using metadata of individuals obtained by the pgxLoader function.

The parameters of this function:

  • data: The meatdata of individuals returned by pgxLoader function.
  • group_id: A string specifying which column is used for grouping in the Kaplan-Meier plot.
  • condition: Condition for splitting individuals into younger and older groups. Only used if group_id is age related.
  • return_data: A logical value determining whether to return the metadata used for plotting. Default is FALSE.
  • ...: Other parameters relevant to KM plot. These include pval, pval.coord, pval.method, conf.int, linetype, and palette (see ggsurvplot function from survminer package)

Suppose you want to investigate whether there are survival differences between younger and older patients with a particular disease, you can query and visualize the relevant information as follows:

# query metadata of individuals with lung adenocarcinoma
luad_inds <- pgxLoader(type="individual",filters="NCIT:C3512")
# use 65 years old as the splitting condition
pgxMetaplot(data=luad_inds, group_id="age_iso", condition="P65Y", pval=TRUE)

It’s noted that not all individuals have available survival data. If you set return_data to TRUE, the function will return the metadata of individuals used for the plot.

5 Session Info

#> R version 4.4.0 (2024-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] pgxRpi_1.0.2     BiocStyle_2.32.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.5        xfun_0.44           bslib_0.7.0        
#>  [4] ggplot2_3.5.1       rstatix_0.7.2       lattice_0.22-6     
#>  [7] vctrs_0.6.5         tools_4.4.0         generics_0.1.3     
#> [10] curl_5.2.1          tibble_3.2.1        fansi_1.0.6        
#> [13] highr_0.11          pkgconfig_2.0.3     Matrix_1.7-0       
#> [16] data.table_1.15.4   lifecycle_1.0.4     compiler_4.4.0     
#> [19] farver_2.1.2        munsell_0.5.1       tinytex_0.51       
#> [22] carData_3.0-5       htmltools_0.5.8.1   sass_0.4.9         
#> [25] yaml_2.3.8          pillar_1.9.0        car_3.1-2          
#> [28] ggpubr_0.6.0        jquerylib_0.1.4     tidyr_1.3.1        
#> [31] cachem_1.1.0        survminer_0.4.9     magick_2.8.3       
#> [34] abind_1.4-5         km.ci_0.5-6         tidyselect_1.2.1   
#> [37] digest_0.6.35       dplyr_1.1.4         purrr_1.0.2        
#> [40] bookdown_0.39       labeling_0.4.3      splines_4.4.0      
#> [43] fastmap_1.2.0       grid_4.4.0          colorspace_2.1-0   
#> [46] cli_3.6.2           magrittr_2.0.3      survival_3.7-0     
#> [49] utf8_1.2.4          broom_1.0.6         withr_3.0.0        
#> [52] scales_1.3.0        backports_1.5.0     lubridate_1.9.3    
#> [55] timechange_0.3.0    rmarkdown_2.27      httr_1.4.7         
#> [58] gridExtra_2.3       ggsignif_0.6.4      zoo_1.8-12         
#> [61] evaluate_0.24.0     knitr_1.47          KMsurv_0.1-5       
#> [64] survMisc_0.5.6      rlang_1.1.4         Rcpp_1.0.12        
#> [67] xtable_1.8-4        glue_1.7.0          BiocManager_1.30.23
#> [70] attempt_0.3.1       jsonlite_1.8.8      R6_2.5.1           
#> [73] plyr_1.8.9