SomaticCancerAlterations
Table of Contents
1 Motivation
Over the last years, large efforts have been taken to characterize the somatic landscape of cancers. Many of the conducted studies make their results publicly available, providing a valuable resource for investigating beyond the level of individual cohorts. The SomaticCancerAlterations package collects mutational data of several tumor types, currently focusing on the TCGA calls sets, and aims for a tight integration with and workflows. In the following, we will illustrate how to access this data and give examples for use cases.
2 Data Sets
The Cancer Genome Atlas (TCGA)1 is a consortium effort to analyze a variety of tumor types, including gene expression, methylation, copy number changes, and somatic mutations2. With the SomaticCancerAlterations package, we provide the callsets of somatic mutations for all publically available TCGA studies. Over time, more studies will be added, as they become available and unrestriced in their usage.
To get started, we get a list of all available data sets and access the metadata associated with each study.
all_datasets = scaListDatasets() print(all_datasets)
## [1] "gbm_tcga" "hnsc_tcga" "kirc_tcga" "luad_tcga" "lusc_tcga" "ov_tcga" ## [7] "skcm_tcga" "thca_tcga"
meta_data = scaMetadata() print(meta_data)
## Cancer_Type Center NCBI_Build Sequence_Source Sequencing_Phase ## gbm_tcga GBM broad.mi.... 37 WXS Phase_I ## hnsc_tcga HNSC broad.mi.... 37 Capture Phase_I ## kirc_tcga KIRC broad.mi.... 37 Capture Phase_I ## luad_tcga LUAD broad.mi.... 37 WXS Phase_I ## lusc_tcga LUSC broad.mi.... 37 WXS Phase_I ## ov_tcga OV broad.mi.... 37 WXS Phase_I ## skcm_tcga SKCM broad.mi.... 37 Capture Phase_I ## thca_tcga THCA broad.mi.... 37 WXS Phase_I ## Sequencer Number_Samples Number_Patients ## gbm_tcga Illumina.... 291 291 ## hnsc_tcga Illumina.... 319 319 ## kirc_tcga Illumina.... 297 293 ## luad_tcga Illumina.... 538 519 ## lusc_tcga Illumina.... 178 178 ## ov_tcga Illumina.... 142 142 ## skcm_tcga Illumina.... 266 264 ## thca_tcga Illumina.... 406 403 ## Cancer_Name ## gbm_tcga Glioblastoma multiforme ## hnsc_tcga Head and Neck squamous cell carcinoma ## kirc_tcga Kidney Chromophobe ## luad_tcga Lung adenocarcinoma ## lusc_tcga Lung squamous cell carcinoma ## ov_tcga Ovarian serous cystadenocarcinoma ## skcm_tcga Skin Cutaneous Melanoma ## thca_tcga Thyroid carcinoma
Next, we load a single dataset with the scaLoadDataset function.
ov = scaLoadDatasets("ov_tcga", merge = TRUE)
3 Exploring Mutational Data
The somatic variants of each study are represented as a object, ordered by genomic positions. Additional columns describe properties of the variant and relate it the the affected gene, sample, and patient.
head(ov, 3)
## GRanges object with 3 ranges and 14 metadata columns: ## seqnames ranges strand | Hugo_Symbol Entrez_Gene_Id ## <Rle> <IRanges> <Rle> | <factor> <integer> ## ov_tcga 1 1334552 * | CCNL2 81669 ## ov_tcga 1 1961652 * | GABRD 2563 ## ov_tcga 1 2420688 * | PLCH2 9651 ## Variant_Classification Variant_Type Reference_Allele ## <factor> <factor> <factor> ## ov_tcga Silent SNP C ## ov_tcga Silent SNP C ## ov_tcga Missense_Mutation SNP C ## Tumor_Seq_Allele1 Tumor_Seq_Allele2 Verification_Status ## <factor> <factor> <factor> ## ov_tcga C T Unknown ## ov_tcga C T Unknown ## ov_tcga C G Unknown ## Validation_Status Mutation_Status Patient_ID ## <factor> <factor> <factor> ## ov_tcga Valid Somatic TCGA-24-2262 ## ov_tcga Valid Somatic TCGA-24-1552 ## ov_tcga Valid Somatic TCGA-13-1484 ## Sample_ID index Dataset ## <factor> <integer> <factor> ## ov_tcga TCGA-24-2262-01A-01W-0799-08 3901 ov_tcga ## ov_tcga TCGA-24-1552-01A-01W-0551-08 3414 ov_tcga ## ov_tcga TCGA-13-1484-01A-01W-0545-08 1567 ov_tcga ## ------- ## seqinfo: 86 sequences from an unspecified genome
with(mcols(ov), table(Variant_Classification, Variant_Type))
## Variant_Type ## Variant_Classification DEL INS SNP ## 3'UTR 0 0 3 ## 5'Flank 0 0 1 ## 5'UTR 0 0 1 ## Frame_Shift_Del 79 0 0 ## Frame_Shift_Ins 0 16 0 ## IGR 0 0 5 ## In_Frame_Del 26 0 0 ## In_Frame_Ins 0 1 0 ## Intron 0 0 34 ## Missense_Mutation 0 0 4299 ## Nonsense_Mutation 0 0 285 ## Nonstop_Mutation 0 0 6 ## RNA 0 0 1 ## Silent 0 0 1417 ## Splice_Site 9 2 121 ## Translation_Start_Site 1 0 1
With such data at hand, we can identify the samples and genes haboring the most mutations.
head(sort(table(ov$Sample_ID), decreasing = TRUE))
## ## TCGA-09-2049-01D-01W-0799-08 TCGA-13-0923-01A-01W-0420-08 ## 119 118 ## TCGA-09-2050-01A-01W-0799-08 TCGA-25-1326-01A-01W-0492-08 ## 111 110 ## TCGA-25-1313-01A-01W-0492-08 TCGA-23-1110-01A-01D-0428-08 ## 104 102
head(sort(table(ov$Hugo_Symbol), decreasing = TRUE), 10)
## ## TP53 TTN PCDHAC2 MUC16 MUC17 PCDHGC5 USH2A CSMD3 CD163L1 DYNC1H1 ## 118 30 14 12 9 9 9 8 7 7
4 Exploring Multiple Studies
Instead of focusing on an individual study, we can also import several at
once. The results are stored as a GRangesList in which each
element corresponds to a single study. This can be merged into a single GRanges
object with merge = TRUE
.
three_studies = scaLoadDatasets(all_datasets[1:3]) print(elementNROWS(three_studies))
## gbm_tcga hnsc_tcga kirc_tcga ## 22166 73766 26265
class(three_studies)
## [1] "SimpleGRangesList" ## attr(,"package") ## [1] "GenomicRanges"
merged_studies = scaLoadDatasets(all_datasets[1:3], merge = TRUE) class(merged_studies)
## [1] "GRanges" ## attr(,"package") ## [1] "GenomicRanges"
We then compute the number of mutations per gene and study:
gene_study_count = with(mcols(merged_studies), table(Hugo_Symbol, Dataset)) gene_study_count = gene_study_count[order(apply(gene_study_count, 1, sum), decreasing = TRUE), ] gene_study_count = addmargins(gene_study_count) head(gene_study_count)
## Dataset ## Hugo_Symbol gbm_tcga hnsc_tcga kirc_tcga Sum ## Unknown 29 899 630 1558 ## TTN 121 401 125 647 ## TP53 101 323 8 432 ## MUC16 68 155 46 269 ## ADAM6 0 173 63 236 ## MUC4 17 32 130 179
Further, we can subset the data by regions of interests, and compute descriptive statistics only on the subset.
tp53_region = GRanges("17", IRanges(7571720, 7590863)) tp53_studies = subsetByOverlaps(merged_studies, tp53_region)
For example, we can investigate which type of somatic variants can be found in TP53 throughout the studies.
addmargins(table(tp53_studies$Variant_Classification, tp53_studies$Dataset))
## ## gbm_tcga hnsc_tcga kirc_tcga Sum ## Frame_Shift_Del 6 41 0 47 ## Frame_Shift_Ins 1 11 0 12 ## In_Frame_Del 2 7 0 9 ## In_Frame_Ins 0 2 0 2 ## Missense_Mutation 81 183 6 270 ## Nonsense_Mutation 4 54 0 58 ## Nonstop_Mutation 0 0 0 0 ## Silent 1 6 1 8 ## Splice_Site 6 19 1 26 ## Translation_Start_Site 0 0 0 0 ## RNA 0 0 0 0 ## Sum 101 323 8 432
To go further, how many patients have mutations in TP53 for each cancer type?
fraction_mutated_region = function(y, region) { s = subsetByOverlaps(y, region) m = length(unique(s$Patient_ID)) / metadata(s)$Number_Patients return(m) } mutated_fraction = sapply(three_studies, fraction_mutated_region, tp53_region) mutated_fraction = data.frame(name = names(three_studies), fraction = mutated_fraction)
library(ggplot2) p = ggplot(mutated_fraction) + ggplot2::geom_bar(aes(x = name, y = fraction, fill = name), stat = "identity") + ylim(0, 1) + xlab("Study") + ylab("Ratio") + theme_bw() print(p)
5 Data Provenance
5.1 TCGA Data
When importing the mutation data from the TCGA servers, we checked the data for consistency and fix common ambiguities in the annotation.
5.1.1 Processing
- Selection of the most recent somatic variant calls for each study. These were
stored as
*.maf
files in the TCGA data directory3. If both manually curated and automatically generated variant calls were available, the curated version was chosen. - Importing of the
*.maf
files into and checking for consistency with the TCGA MAF specifications4. Please note that these guidelines are currently only suggestions and most TCGA files violate some of these. - Transformation of the imported variants into a GRanges object, with one row for each reported variant. Only columns related to the genomic origin of the somatic variant were stored, additional columns describing higher-level effects, such as mutational consequences and alterations at the protein level, were dropped. The seqlevels information defining the chromosomal ranges were taken from the 1000genomes phase 2 reference assembly5.
- The patient barcode was extracted from the sample barcode.
- Metadata describing the design and analysis of the study was extracted.
- The processed variants were written to disk, with one file for each study. The metadata for all studies were stored as a single, separate object.
5.1.2 Selection Criteria of Data Sets
We included data sets in the package that were
- conducted by the Broad Institute.
- cleared for unrestricted access and usage6.
- sequenced with Illumina platforms.
5.1.3 Consistency Check
According to the TCGA specifications for the MAF
files, we screened and
corrected for common artifacts in the data regarding annotation. This included:
- Transfering of all genomic coordinates to the NCBI 37 reference notation (with the chromosome always depicted as 'MT')
- Checking of the entries against all allowed values for this field (currently
for the columns
Hugo_Symbol
,Chromosome
,Strand
,Variant_Classification
,Variant_Type
,Reference_Allele
,Tumor_Seq_Allele1
,Tumor_Seq_Allele2
,Verification_Status
,Validation_Status
,Sequencer
).
6 Alternatives
The TCGA data sets can be accessed in different ways. First, the TCGA itself offers access to certain types of its collected data7. Another approach has been taken by the cBioPortal for Cancer Genomics8 which has performed high-level analyses of several TCGA data sources, such as gene expression and copy number changes. This summarized data can be queried through an interface9.
7 Session Info
## R Under development (unstable) (2024-10-21 r87258) ## Platform: x86_64-pc-linux-gnu ## Running under: Ubuntu 24.04.1 LTS ## ## Matrix products: default ## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so ## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_GB LC_COLLATE=C ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## time zone: America/New_York ## tzcode source: system (glibc) ## ## attached base packages: ## [1] stats4 stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] ggbio_1.53.0 ggplot2_3.5.1 ## [3] GenomicRanges_1.57.2 GenomeInfoDb_1.41.2 ## [5] IRanges_2.39.2 S4Vectors_0.43.2 ## [7] BiocGenerics_0.51.3 SomaticCancerAlterations_1.41.0 ## ## loaded via a namespace (and not attached): ## [1] RColorBrewer_1.1-3 rstudioapi_0.17.1 ## [3] jsonlite_1.8.9 magrittr_2.0.3 ## [5] GenomicFeatures_1.57.1 farver_2.1.2 ## [7] rmarkdown_2.28 BiocIO_1.15.2 ## [9] zlibbioc_1.51.2 vctrs_0.6.5 ## [11] memoise_2.0.1 Rsamtools_2.21.2 ## [13] RCurl_1.98-1.16 base64enc_0.1-3 ## [15] htmltools_0.5.8.1 S4Arrays_1.5.11 ## [17] progress_1.2.3 curl_5.2.3 ## [19] SparseArray_1.5.45 Formula_1.2-5 ## [21] htmlwidgets_1.6.4 plyr_1.8.9 ## [23] httr2_1.0.5 cachem_1.1.0 ## [25] GenomicAlignments_1.41.0 lifecycle_1.0.4 ## [27] pkgconfig_2.0.3 Matrix_1.7-1 ## [29] R6_2.5.1 fastmap_1.2.0 ## [31] GenomeInfoDbData_1.2.13 MatrixGenerics_1.17.1 ## [33] digest_0.6.37 colorspace_2.1-1 ## [35] GGally_2.2.1 AnnotationDbi_1.67.0 ## [37] OrganismDbi_1.47.0 Hmisc_5.1-3 ## [39] RSQLite_2.3.7 labeling_0.4.3 ## [41] filelock_1.0.3 fansi_1.0.6 ## [43] httr_1.4.7 abind_1.4-8 ## [45] compiler_4.5.0 bit64_4.5.2 ## [47] withr_3.0.1 htmlTable_2.4.3 ## [49] backports_1.5.0 BiocParallel_1.39.0 ## [51] DBI_1.2.3 ggstats_0.7.0 ## [53] highr_0.11 biomaRt_2.61.3 ## [55] rappdirs_0.3.3 DelayedArray_0.31.14 ## [57] rjson_0.2.23 tools_4.5.0 ## [59] foreign_0.8-87 nnet_7.3-19 ## [61] glue_1.8.0 restfulr_0.0.15 ## [63] grid_4.5.0 checkmate_2.3.2 ## [65] cluster_2.1.6 reshape2_1.4.4 ## [67] generics_0.1.3 gtable_0.3.6 ## [69] BSgenome_1.73.1 tidyr_1.3.1 ## [71] ensembldb_2.29.1 data.table_1.16.2 ## [73] hms_1.1.3 xml2_1.3.6 ## [75] utf8_1.2.4 XVector_0.45.0 ## [77] pillar_1.9.0 stringr_1.5.1 ## [79] dplyr_1.1.4 BiocFileCache_2.13.2 ## [81] lattice_0.22-6 rtracklayer_1.65.0 ## [83] bit_4.5.0 biovizBase_1.53.0 ## [85] RBGL_1.81.0 tidyselect_1.2.1 ## [87] Biostrings_2.73.2 knitr_1.48 ## [89] gridExtra_2.3 ProtGenerics_1.37.1 ## [91] SummarizedExperiment_1.35.5 xfun_0.48 ## [93] Biobase_2.65.1 matrixStats_1.4.1 ## [95] stringi_1.8.4 UCSC.utils_1.1.0 ## [97] lazyeval_0.2.2 yaml_2.3.10 ## [99] evaluate_1.0.1 codetools_0.2-20 ## [101] tibble_3.2.1 graph_1.83.0 ## [103] BiocManager_1.30.25 cli_3.6.3 ## [105] rpart_4.1.23 munsell_0.5.1 ## [107] dichromat_2.0-0.1 Rcpp_1.0.13 ## [109] dbplyr_2.5.0 png_0.1-8 ## [111] XML_3.99-0.17 parallel_4.5.0 ## [113] blob_1.2.4 prettyunits_1.2.0 ## [115] AnnotationFilter_1.29.0 bitops_1.0-9 ## [117] txdbmaker_1.1.2 VariantAnnotation_1.51.2 ## [119] scales_1.3.0 purrr_1.0.2 ## [121] crayon_1.5.3 rlang_1.1.4 ## [123] KEGGREST_1.45.1