C. ClinVar Integration

Original version: 1 May, 2024

library(AlphaMissenseR)

Introduction

ClinVar is a freely available, public archive of human genetic variants that provides clinical classifications for whether a variant is likely benign or pathogenic. The AlphaMissense publication uses the ClinVar data to evaluate and calibrate the predictions generated by their model. A table containing ClinVar information for 82872 variants across 7951 proteins was derived from the supplemental data of the AlphaMissense paper, and is made available through this package for benchmarking and visualization purposes.

Access ClinVar classifications with AlphaMissense predictions

The ClinVar table can be accessed using clinvar_data() from the database.

clinvar_data()
#> * [16:30:25][info] downloading or finding local file
#> * [16:30:25][info] creating database table 'clinvar'
#> * [16:30:25][info] disconnecting all registered connections
#> # Source:   table<clinvar> [?? x 5]
#> # Database: DuckDB v1.1.1 [biocbuild@Linux 6.8.0-47-generic:R 4.5.0//home/biocbuild/.cache/R/BiocFileCache/2f7e82265d846_2f7e82265d846]
#>    variant_id           transcript_id     protein_variant AlphaMissense label 
#>    <chr>                <chr>             <chr>                   <dbl> <fct> 
#>  1 chr1_925969_C_T_hg38 ENST00000342066.8 Q96NU1:P10S            0.967  benign
#>  2 chr1_930165_G_A_hg38 ENST00000342066.8 Q96NU1:R28Q            0.663  benign
#>  3 chr1_930204_G_A_hg38 ENST00000342066.8 Q96NU1:R41Q            0.0866 benign
#>  4 chr1_930245_G_A_hg38 ENST00000342066.8 Q96NU1:D55N            0.134  benign
#>  5 chr1_930248_G_A_hg38 ENST00000342066.8 Q96NU1:G56S            0.100  benign
#>  6 chr1_930282_G_A_hg38 ENST00000342066.8 Q96NU1:R67Q            0.0635 benign
#>  7 chr1_930285_G_A_hg38 ENST00000342066.8 Q96NU1:R68Q            0.0629 benign
#>  8 chr1_930314_C_T_hg38 ENST00000342066.8 Q96NU1:H78Y            0.110  benign
#>  9 chr1_930320_C_T_hg38 ENST00000342066.8 Q96NU1:R80C            0.0918 benign
#> 10 chr1_931058_G_A_hg38 ENST00000342066.8 Q96NU1:V92M            0.196  benign
#> # ℹ more rows

The ClinVar table is now available for exploration or parsing.

Compare ClinVar and AlphaMissense

This section uses the clinvar_plot() function to generate a scatterplot for benchmarking and comparing ClinVar classification with AlphaMissense predictions. By default, the function takes one UniProt accession identifier, derives AlphaMissense scores from am_data("aa_substitution"), and pulls ClinVar classifications from the data.frame previously obtained. Alternatively, it is possible to pass a custom AlphaMissense or ClinVar table to the function. See function details for more information.

clinvar_plot(uniprotId = "P37023")
#> * [16:30:26][info] 'alphamissense_table' not provided, using default 'am_data("aa_substitution")' table accessed through the AlphaMissenseR package
#> * [16:30:28][info] 'clinvar_table' not provided, using default ClinVar dataset in AlphaMissenseR package

We returned a ggplot object which overlays ClinVar classifications onto AlphaMissense predicted scores. Blue, gray, and red colors represent pathogenicity classifications for “likely benign”, “ambiguous”, or “likely pathogenic”, respectively. Large, bolded points are ClinVar variants colored according to their clinical classification, while smaller points in the background are AlphaMissense predictions.

We can note a discrepancy between the clinically-validated annotations and the AlphaMissense predictions around position 50. AlphaMissense seems to predict several variants in that region as likely benign, while ClinVar identifies them as pathogenic.

Because the ClinVar dataset is not exhaustive (not all proteins have been clinically-assessed), there may be proteins where information is not available. In this case, the function will provide an error.

Remember to disconnect from the database.

db_disconnect_all()
#> * [16:30:31][info] disconnecting all registered connections

Session information

sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] ProteinGymR_0.99.7   gghalves_0.1.4       ggplot2_3.5.1       
#>  [4] ggdist_3.3.2         tidyr_1.3.1          ExperimentHub_2.15.0
#>  [7] AnnotationHub_3.15.0 BiocFileCache_2.15.0 dbplyr_2.5.0        
#> [10] BiocGenerics_0.53.0  AlphaMissenseR_1.3.0 dplyr_1.1.4         
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1        viridisLite_0.4.2       farver_2.1.2           
#>  [4] blob_1.2.4              filelock_1.0.3          Biostrings_2.75.0      
#>  [7] fastmap_1.2.0           duckdb_1.1.1            promises_1.3.0         
#> [10] digest_0.6.37           mime_0.12               lifecycle_1.0.4        
#> [13] r3dmol_0.1.2            KEGGREST_1.47.0         RSQLite_2.3.7          
#> [16] magrittr_2.0.3          compiler_4.5.0          rlang_1.1.4            
#> [19] sass_0.4.9              tools_4.5.0             utf8_1.2.4             
#> [22] yaml_2.3.10             knitr_1.48              labeling_0.4.3         
#> [25] htmlwidgets_1.6.4       bit_4.5.0               spdl_0.0.5             
#> [28] curl_5.2.3              withr_3.0.2             purrr_1.0.2            
#> [31] grid_4.5.0              stats4_4.5.0            fansi_1.0.6            
#> [34] xtable_1.8-4            colorspace_2.1-1        scales_1.3.0           
#> [37] cli_3.6.3               rmarkdown_2.28          crayon_1.5.3           
#> [40] generics_0.1.3          httr_1.4.7              BiocBaseUtils_1.9.0    
#> [43] DBI_1.2.3               cachem_1.1.0            zlibbioc_1.53.0        
#> [46] parallel_4.5.0          AnnotationDbi_1.69.0    BiocManager_1.30.25    
#> [49] XVector_0.47.0          vctrs_0.6.5             jsonlite_1.8.9         
#> [52] IRanges_2.41.0          S4Vectors_0.45.0        bit64_4.5.2            
#> [55] jquerylib_0.1.4         bio3d_2.4-4             glue_1.8.0             
#> [58] distributional_0.5.0    gtable_0.3.6            BiocVersion_3.21.1     
#> [61] later_1.3.2             GenomeInfoDb_1.43.0     GenomicRanges_1.59.0   
#> [64] UCSC.utils_1.3.0        munsell_0.5.1           tibble_3.2.1           
#> [67] pillar_1.9.0            rappdirs_0.3.3          htmltools_0.5.8.1      
#> [70] GenomeInfoDbData_1.2.13 R6_2.5.1                shiny.gosling_1.3.0    
#> [73] queryup_1.0.5           evaluate_1.0.1          shiny_1.9.1            
#> [76] Biobase_2.67.0          highr_0.11              png_0.1-8              
#> [79] memoise_2.0.1           httpuv_1.6.15           bslib_0.8.0            
#> [82] RcppSpdlog_0.0.18       rjsoncons_1.3.1         Rcpp_1.0.13            
#> [85] whisker_0.4.1           xfun_0.48               forcats_1.0.0          
#> [88] pkgconfig_2.0.3