SomaticCancerAlterations

1. Motivation
2. Data Sets
3. Exploring Mutational Data
4. Exploring Multiple Studies
5. Data Provenance
- 5.1. TCGA Data
6. Alternatives
7. Session Info

1 Motivation

Over the last years, large efforts have been taken to characterize the somatic landscape of cancers. Many of the conducted studies make their results publicly available, providing a valuable resource for investigating beyond the level of individual cohorts. The SomaticCancerAlterations package collects mutational data of several tumor types, currently focusing on the TCGA calls sets, and aims for a tight integration with and workflows. In the following, we will illustrate how to access this data and give examples for use cases.

## Loading required package: stats4

## Loading required package: BiocGenerics

## Loading required package: generics

## 
## Attaching package: 'generics'

## The following objects are masked from 'package:base':
## 
##     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
##     setequal, union

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:generics':
## 
##     intersect, setdiff, setequal, union

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
##     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff,
##     setequal, table, tapply, union, unique, unsplit, which.max,
##     which.min

## Loading required package: S4Vectors

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:utils':
## 
##     findMatches

## The following objects are masked from 'package:base':
## 
##     I, expand.grid, unname

## Loading required package: IRanges

## Loading required package: GenomeInfoDb

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## Need specific help about ggbio? try mailing 
##  the maintainer or visit https://lawremi.github.io/ggbio/

## 
## Attaching package: 'ggbio'

## The following objects are masked from 'package:ggplot2':
## 
##     geom_bar, geom_rect, geom_segment, ggsave, stat_bin, stat_identity,
##     xlim

2 Data Sets

The Cancer Genome Atlas (TCGA)¹ is a consortium effort to analyze a variety of tumor types, including gene expression, methylation, copy number changes, and somatic mutations². With the SomaticCancerAlterations package, we provide the callsets of somatic mutations for all publically available TCGA studies. Over time, more studies will be added, as they become available and unrestriced in their usage.

To get started, we get a list of all available data sets and access the metadata associated with each study.

all_datasets = scaListDatasets()
print(all_datasets)

## [1] "gbm_tcga"  "hnsc_tcga" "kirc_tcga" "luad_tcga" "lusc_tcga" "ov_tcga"  
## [7] "skcm_tcga" "thca_tcga"

meta_data = scaMetadata()
print(meta_data)

##           Cancer_Type       Center NCBI_Build Sequence_Source Sequencing_Phase
## gbm_tcga          GBM broad.mi....         37             WXS          Phase_I
## hnsc_tcga        HNSC broad.mi....         37         Capture          Phase_I
## kirc_tcga        KIRC broad.mi....         37         Capture          Phase_I
## luad_tcga        LUAD broad.mi....         37             WXS          Phase_I
## lusc_tcga        LUSC broad.mi....         37             WXS          Phase_I
## ov_tcga            OV broad.mi....         37             WXS          Phase_I
## skcm_tcga        SKCM broad.mi....         37         Capture          Phase_I
## thca_tcga        THCA broad.mi....         37             WXS          Phase_I
##              Sequencer Number_Samples Number_Patients
## gbm_tcga  Illumina....            291             291
## hnsc_tcga Illumina....            319             319
## kirc_tcga Illumina....            297             293
## luad_tcga Illumina....            538             519
## lusc_tcga Illumina....            178             178
## ov_tcga   Illumina....            142             142
## skcm_tcga Illumina....            266             264
## thca_tcga Illumina....            406             403
##                                     Cancer_Name
## gbm_tcga                Glioblastoma multiforme
## hnsc_tcga Head and Neck squamous cell carcinoma
## kirc_tcga                    Kidney Chromophobe
## luad_tcga                   Lung adenocarcinoma
## lusc_tcga          Lung squamous cell carcinoma
## ov_tcga       Ovarian serous cystadenocarcinoma
## skcm_tcga               Skin Cutaneous Melanoma
## thca_tcga                    Thyroid carcinoma

Next, we load a single dataset with the scaLoadDataset function.

ov = scaLoadDatasets("ov_tcga", merge = TRUE)

3 Exploring Mutational Data

The somatic variants of each study are represented as a object, ordered by genomic positions. Additional columns describe properties of the variant and relate it the the affected gene, sample, and patient.

head(ov, 3)

## GRanges object with 3 ranges and 14 metadata columns:
##           seqnames    ranges strand | Hugo_Symbol Entrez_Gene_Id
##              <Rle> <IRanges>  <Rle> |    <factor>      <integer>
##   ov_tcga        1   1334552      * |       CCNL2          81669
##   ov_tcga        1   1961652      * |       GABRD           2563
##   ov_tcga        1   2420688      * |       PLCH2           9651
##           Variant_Classification Variant_Type Reference_Allele
##                         <factor>     <factor>         <factor>
##   ov_tcga      Silent                     SNP                C
##   ov_tcga      Silent                     SNP                C
##   ov_tcga      Missense_Mutation          SNP                C
##           Tumor_Seq_Allele1 Tumor_Seq_Allele2 Verification_Status
##                    <factor>          <factor>            <factor>
##   ov_tcga                 C                 T             Unknown
##   ov_tcga                 C                 T             Unknown
##   ov_tcga                 C                 G             Unknown
##           Validation_Status Mutation_Status   Patient_ID
##                    <factor>        <factor>     <factor>
##   ov_tcga             Valid         Somatic TCGA-24-2262
##   ov_tcga             Valid         Somatic TCGA-24-1552
##   ov_tcga             Valid         Somatic TCGA-13-1484
##                              Sample_ID     index  Dataset
##                               <factor> <integer> <factor>
##   ov_tcga TCGA-24-2262-01A-01W-0799-08      3901  ov_tcga
##   ov_tcga TCGA-24-1552-01A-01W-0551-08      3414  ov_tcga
##   ov_tcga TCGA-13-1484-01A-01W-0545-08      1567  ov_tcga
##   -------
##   seqinfo: 86 sequences from an unspecified genome

with(mcols(ov), table(Variant_Classification, Variant_Type))

##                         Variant_Type
## Variant_Classification    DEL  INS  SNP
##   3'UTR                     0    0    3
##   5'Flank                   0    0    1
##   5'UTR                     0    0    1
##   Frame_Shift_Del          79    0    0
##   Frame_Shift_Ins           0   16    0
##   IGR                       0    0    5
##   In_Frame_Del             26    0    0
##   In_Frame_Ins              0    1    0
##   Intron                    0    0   34
##   Missense_Mutation         0    0 4299
##   Nonsense_Mutation         0    0  285
##   Nonstop_Mutation          0    0    6
##   RNA                       0    0    1
##   Silent                    0    0 1417
##   Splice_Site               9    2  121
##   Translation_Start_Site    1    0    1

With such data at hand, we can identify the samples and genes haboring the most mutations.

head(sort(table(ov$Sample_ID), decreasing = TRUE))

## 
## TCGA-09-2049-01D-01W-0799-08 TCGA-13-0923-01A-01W-0420-08 
##                          119                          118 
## TCGA-09-2050-01A-01W-0799-08 TCGA-25-1326-01A-01W-0492-08 
##                          111                          110 
## TCGA-25-1313-01A-01W-0492-08 TCGA-23-1110-01A-01D-0428-08 
##                          104                          102

head(sort(table(ov$Hugo_Symbol), decreasing = TRUE), 10)

## 
##    TP53     TTN PCDHAC2   MUC16   MUC17 PCDHGC5   USH2A   CSMD3 CD163L1 DYNC1H1 
##     118      30      14      12       9       9       9       8       7       7

4 Exploring Multiple Studies

Instead of focusing on an individual study, we can also import several at once. The results are stored as a GRangesList in which each element corresponds to a single study. This can be merged into a single GRanges object with merge = TRUE.

three_studies = scaLoadDatasets(all_datasets[1:3])

print(elementNROWS(three_studies))

##  gbm_tcga hnsc_tcga kirc_tcga 
##     22166     73766     26265

class(three_studies)

## [1] "SimpleGRangesList"
## attr(,"package")
## [1] "GenomicRanges"

merged_studies = scaLoadDatasets(all_datasets[1:3], merge = TRUE)

class(merged_studies)

## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"

We then compute the number of mutations per gene and study:

gene_study_count = with(mcols(merged_studies), table(Hugo_Symbol, Dataset))

gene_study_count = gene_study_count[order(apply(gene_study_count, 1, sum), decreasing = TRUE), ]

gene_study_count = addmargins(gene_study_count)

head(gene_study_count)

##            Dataset
## Hugo_Symbol gbm_tcga hnsc_tcga kirc_tcga  Sum
##     Unknown       29       899       630 1558
##     TTN          121       401       125  647
##     TP53         101       323         8  432
##     MUC16         68       155        46  269
##     ADAM6          0       173        63  236
##     MUC4          17        32       130  179

Further, we can subset the data by regions of interests, and compute descriptive statistics only on the subset.

tp53_region = GRanges("17", IRanges(7571720, 7590863))

tp53_studies = subsetByOverlaps(merged_studies, tp53_region)

For example, we can investigate which type of somatic variants can be found in TP53 throughout the studies.

addmargins(table(tp53_studies$Variant_Classification, tp53_studies$Dataset))

##                         
##                          gbm_tcga hnsc_tcga kirc_tcga Sum
##   Frame_Shift_Del               6        41         0  47
##   Frame_Shift_Ins               1        11         0  12
##   In_Frame_Del                  2         7         0   9
##   In_Frame_Ins                  0         2         0   2
##   Missense_Mutation            81       183         6 270
##   Nonsense_Mutation             4        54         0  58
##   Nonstop_Mutation              0         0         0   0
##   Silent                        1         6         1   8
##   Splice_Site                   6        19         1  26
##   Translation_Start_Site        0         0         0   0
##   RNA                           0         0         0   0
##   Sum                         101       323         8 432

To go further, how many patients have mutations in TP53 for each cancer type?

fraction_mutated_region = function(y, region) {
    s = subsetByOverlaps(y, region)
    m = length(unique(s$Patient_ID)) / metadata(s)$Number_Patients
    return(m)
}

mutated_fraction = sapply(three_studies, fraction_mutated_region, tp53_region)

mutated_fraction = data.frame(name = names(three_studies), fraction =
mutated_fraction)

library(ggplot2)

p = ggplot(mutated_fraction) + ggplot2::geom_bar(aes(x = name, y = fraction,
fill = name), stat = "identity") + ylim(0, 1) + xlab("Study") + ylab("Ratio") +
theme_bw()

print(p)

5 Data Provenance

5.1 TCGA Data

When importing the mutation data from the TCGA servers, we checked the data for consistency and fix common ambiguities in the annotation.

5.1.1 Processing

Selection of the most recent somatic variant calls for each study. These were stored as *.maf files in the TCGA data directory³. If both manually curated and automatically generated variant calls were available, the curated version was chosen.
Importing of the *.maf files into and checking for consistency with the TCGA MAF specifications⁴. Please note that these guidelines are currently only suggestions and most TCGA files violate some of these.
Transformation of the imported variants into a GRanges object, with one row for each reported variant. Only columns related to the genomic origin of the somatic variant were stored, additional columns describing higher-level effects, such as mutational consequences and alterations at the protein level, were dropped. The seqlevels information defining the chromosomal ranges were taken from the 1000genomes phase 2 reference assembly⁵.
The patient barcode was extracted from the sample barcode.
Metadata describing the design and analysis of the study was extracted.
The processed variants were written to disk, with one file for each study. The metadata for all studies were stored as a single, separate object.

5.1.2 Selection Criteria of Data Sets

We included data sets in the package that were

conducted by the Broad Institute.
cleared for unrestricted access and usage⁶.
sequenced with Illumina platforms.

5.1.3 Consistency Check

According to the TCGA specifications for the MAF files, we screened and corrected for common artifacts in the data regarding annotation. This included:

Transfering of all genomic coordinates to the NCBI 37 reference notation (with the chromosome always depicted as 'MT')
Checking of the entries against all allowed values for this field (currently for the columns Hugo_Symbol, Chromosome, Strand, Variant_Classification, Variant_Type, Reference_Allele, Tumor_Seq_Allele1, Tumor_Seq_Allele2, Verification_Status, Validation_Status, Sequencer).

6 Alternatives

The TCGA data sets can be accessed in different ways. First, the TCGA itself offers access to certain types of its collected data⁷. Another approach has been taken by the cBioPortal for Cancer Genomics⁸ which has performed high-level analyses of several TCGA data sources, such as gene expression and copy number changes. This summarized data can be queried through an interface⁹.

7 Session Info

## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] ggbio_1.55.0                    ggplot2_3.5.1                  
## [3] GenomicRanges_1.59.0            GenomeInfoDb_1.43.0            
## [5] IRanges_2.41.0                  S4Vectors_0.45.0               
## [7] BiocGenerics_0.53.1             generics_0.1.3                 
## [9] SomaticCancerAlterations_1.43.0
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3          rstudioapi_0.17.1          
##   [3] jsonlite_1.8.9              magrittr_2.0.3             
##   [5] GenomicFeatures_1.59.0      farver_2.1.2               
##   [7] rmarkdown_2.29              BiocIO_1.17.0              
##   [9] zlibbioc_1.53.0             vctrs_0.6.5                
##  [11] memoise_2.0.1               Rsamtools_2.23.0           
##  [13] RCurl_1.98-1.16             base64enc_0.1-3            
##  [15] htmltools_0.5.8.1           S4Arrays_1.7.1             
##  [17] progress_1.2.3              curl_5.2.3                 
##  [19] SparseArray_1.7.0           Formula_1.2-5              
##  [21] htmlwidgets_1.6.4           plyr_1.8.9                 
##  [23] httr2_1.0.6                 cachem_1.1.0               
##  [25] GenomicAlignments_1.43.0    lifecycle_1.0.4            
##  [27] pkgconfig_2.0.3             Matrix_1.7-1               
##  [29] R6_2.5.1                    fastmap_1.2.0              
##  [31] GenomeInfoDbData_1.2.13     MatrixGenerics_1.19.0      
##  [33] digest_0.6.37               colorspace_2.1-1           
##  [35] GGally_2.2.1                AnnotationDbi_1.69.0       
##  [37] OrganismDbi_1.49.0          Hmisc_5.2-0                
##  [39] RSQLite_2.3.7               labeling_0.4.3             
##  [41] filelock_1.0.3              fansi_1.0.6                
##  [43] httr_1.4.7                  abind_1.4-8                
##  [45] compiler_4.5.0              bit64_4.5.2                
##  [47] withr_3.0.2                 htmlTable_2.4.3            
##  [49] backports_1.5.0             BiocParallel_1.41.0        
##  [51] DBI_1.2.3                   ggstats_0.7.0              
##  [53] highr_0.11                  biomaRt_2.63.0             
##  [55] rappdirs_0.3.3              DelayedArray_0.33.1        
##  [57] rjson_0.2.23                tools_4.5.0                
##  [59] foreign_0.8-87              nnet_7.3-19                
##  [61] glue_1.8.0                  restfulr_0.0.15            
##  [63] grid_4.5.0                  checkmate_2.3.2            
##  [65] cluster_2.1.6               reshape2_1.4.4             
##  [67] gtable_0.3.6                BSgenome_1.75.0            
##  [69] tidyr_1.3.1                 ensembldb_2.31.0           
##  [71] data.table_1.16.2           hms_1.1.3                  
##  [73] xml2_1.3.6                  utf8_1.2.4                 
##  [75] XVector_0.47.0              pillar_1.9.0               
##  [77] stringr_1.5.1               dplyr_1.1.4                
##  [79] BiocFileCache_2.15.0        lattice_0.22-6             
##  [81] rtracklayer_1.67.0          bit_4.5.0                  
##  [83] biovizBase_1.55.0           RBGL_1.83.0                
##  [85] tidyselect_1.2.1            Biostrings_2.75.0          
##  [87] knitr_1.48                  gridExtra_2.3              
##  [89] ProtGenerics_1.39.0         SummarizedExperiment_1.37.0
##  [91] xfun_0.49                   Biobase_2.67.0             
##  [93] matrixStats_1.4.1           stringi_1.8.4              
##  [95] UCSC.utils_1.3.0            lazyeval_0.2.2             
##  [97] yaml_2.3.10                 evaluate_1.0.1             
##  [99] codetools_0.2-20            tibble_3.2.1               
## [101] graph_1.85.0                BiocManager_1.30.25        
## [103] cli_3.6.3                   rpart_4.1.23               
## [105] munsell_0.5.1               dichromat_2.0-0.1          
## [107] Rcpp_1.0.13-1               dbplyr_2.5.0               
## [109] png_0.1-8                   XML_3.99-0.17              
## [111] parallel_4.5.0              blob_1.2.4                 
## [113] prettyunits_1.2.0           AnnotationFilter_1.31.0    
## [115] bitops_1.0-9                txdbmaker_1.3.0            
## [117] VariantAnnotation_1.53.0    scales_1.3.0               
## [119] purrr_1.0.2                 crayon_1.5.3               
## [121] rlang_1.1.4                 KEGGREST_1.47.0