Resource specific interaction attributes

Denes Turei1*

1Institute for Computational Biomedicine, Heidelberg University

23 October 2022

Abstract

OmniPath provides a broad variety of protein annotations, but for interactions, until recently, only a standard set of essential attributes (direction, effect, etc) and a handful of others (e.g. DoRothEA confidence level) were available. The newly introduced extra_attrs column consists of JSON encoded custom, resource specific attributes from network databases. We also revised the processing of these resources to ensure that we include as many useful attributes as possible. In the OmnipathR package we added a few new functions to support the processing of the JSON encoded column: to scan it for keys and values, and to extract specific variables of interest into new columns. We give a brief overview of these here.

Package

OmnipathR 3.5.25

1 Loading a network
2 Which extra attributes are available?
3 Inspecting one attribute
4 Converting extra attributes to columns
5 Filtering records based on extra attributes
6 Example: finding ubiquitination interactions
7 Session information

¹ Institute for Computational Biomedicine, Heidelberg University

1 Loading a network

library(OmnipathR)

First we retrieve the complete directed PPI network. Importantly, the extra attributes are only included if the fields = "extra_attrs" argument is provided.

i <- import_post_translational_interactions(fields = 'extra_attrs')
dplyr::select(i, source_genesymbol, target_genesymbol, extra_attrs)

## # A tibble: 80,237 × 3
##    source_genesymbol target_genesymbol extra_attrs     
##    <chr>             <chr>             <list>          
##  1 CALM3             TRPC1             <named list [1]>
##  2 CALM1             TRPC1             <named list [1]>
##  3 CALM2             TRPC1             <named list [1]>
##  4 CAV1              TRPC1             <named list [1]>
##  5 DRD2              TRPC1             <named list [1]>
##  6 MDFI              TRPC1             <named list [1]>
##  7 ITPR2             TRPC1             <named list [1]>
##  8 MARCKS            TRPC1             <named list [1]>
##  9 TRPC1             GRM1              <named list [0]>
## 10 GRM1              TRPC1             <named list [1]>
## # … with 80,227 more rows

Above we see, the extra_attrs column is a list type column. Each list is a nested list itself, containing the extra attributes from all resources, as it was extracted from the JSON.

2 Which extra attributes are available?

Which attributes present in the network depends only on the interactions: if none of the interactions is from the SPIKE database, obviously the SPIKE_mechanism won’t be present. The names of the extra attributes consist of the name of the resource and the name of the attribute, separated by an underscore. The resource name never contains underscore, while some attribute names do. To list the extra attributes available in a particular data frame use the extra_attrs function:

extra_attrs(i)

##  [1] "TRIP_method"                "SIGNOR_mechanism"           "PhosphoSite_noref_evidence"
##  [4] "PhosphoPoint_category"      "PhosphoSite_evidence"       "HPRD-phos_mechanism"       
##  [7] "Li2012_mechanism"           "Li2012_route"               "SPIKE_effect"              
## [10] "SPIKE_mechanism"            "CA1_effect"                 "CA1_type"                  
## [13] "Macrophage_type"            "Macrophage_location"        "ACSN_effect"               
## [16] "Cellinker_type"             "CellChatDB_category"        "talklr_putative"           
## [19] "CellPhoneDB_type"           "Ramilowski2015_source"      "ARN_effect"                
## [22] "ARN_is_direct"              "ARN_is_directed"            "NRF2ome_effect"            
## [25] "NRF2ome_is_direct"          "NRF2ome_is_directed"

The labels listed here are the top level keys in the lists in the extra_attrs column. Note, the coverage of these variables varies a lot, typically in agreement with the size of the resource.

3 Inspecting one attribute

The values of each extra attribute, in theory, can be arbitrarily complex nested lists, but in reality, these are most often simple numeric, logical or character values or vectors. To see the unique values of one attribute use the extra_attr_values function. Let’s see the values of the SIGNOR_mechanism attribute:

extra_attr_values(i, SIGNOR_mechanism)

##  [1] "phosphorylation"                    "binding"                           
##  [3] "dephosphorylation"                  "Phosphorylation"                   
##  [5] "ubiquitination"                     "N/A"                               
##  [7] "Physical Interaction"               "cleavage"                          
##  [9] "Proteolytic Processing"             "deubiquitination"                  
## [11] "Deubiqitination"                    "relocalization"                    
## [13] "Ubiquitination"                     "Dephosphorylation"                 
## [15] "Other"                              "guanine nucleotide exchange factor"
## [17] "Transcription Regulation"           "gtpase-activating protein"         
## [19] "Indirect"                           ""                                  
## [21] "Sumoylation"                        "sumoylation"                       
## [23] "palmitoylation"                     "demethylation"                     
## [25] "Demethylation"                      "mRNA stability"                    
## [27] "methylation"                        "Methylation"                       
## [29] "hydroxylation"                      "Acetylation"                       
## [31] "acetylation"                        "deacetylation"                     
## [33] "Deacetylation"                      "Translational Regulation"          
## [35] "Protein Degradation"                "s-nitrosylation"                   
## [37] "phosphomotif_binding"               "chemical activation"               
## [39] "Proteolytic Cleavage"               "glycosylation"                     
## [41] "post transcriptional regulation"    "catalytic activity"                
## [43] "neddylation"                        "Neddylation"                       
## [45] "tyrosination"                       "lipidation"                        
## [47] "ADP-ribosylation"                   "desumoylation"                     
## [49] "isomerization"                      "post translational modification"   
## [51] "carboxylation"                      "Alkylation"                        
## [53] "chemical inhibition"                "oxidation"                         
## [55] "translation regulation"             "Carboxylation"                     
## [57] "destabilization"

The values are provided as they are in the original resource, including potential typos and inconsistencies, e.g. see above the capitalized vs. lowercase forms of each value.

4 Converting extra attributes to columns

To make use of the attributes, it is convenient to extract the interesting ones into separate columns of the data frame. With the extra_attrs_to_cols function multiple attributes can be converted in a single call. Custom column names can be passed by argument names. As an example, let’s extract two attributes:

i0 <- extra_attrs_to_cols(
    i,
    si_mechanism = SIGNOR_mechanism,
    ma_mechanism = Macrophage_type,
    keep_empty = FALSE
)

dplyr::select(
    i0,
    source_genesymbol,
    target_genesymbol,
    si_mechanism,
    ma_mechanism
)

## # A tibble: 11,638 × 4
##    source_genesymbol target_genesymbol si_mechanism ma_mechanism
##    <chr>             <chr>             <list>       <list>      
##  1 PRKG1             TRPC3             <chr [1]>    <NULL>      
##  2 PRKG1             TRPC7             <chr [1]>    <NULL>      
##  3 OS9               TRPV4             <chr [1]>    <NULL>      
##  4 PTPN1             TRPV6             <chr [1]>    <NULL>      
##  5 RACK1             TRPM6             <chr [1]>    <NULL>      
##  6 PRKACA            MCOLN1            <chr [1]>    <NULL>      
##  7 MAPK14            MAPKAPK2          <chr [2]>    <chr [2]>   
##  8 MAPKAPK2          HNRNPA0           <chr [2]>    <NULL>      
##  9 MAPKAPK2          PARN              <chr [2]>    <NULL>      
## 10 JAK2              EPOR              <chr [2]>    <NULL>      
## # … with 11,628 more rows

Above we disabled the keep_empty option, otherwise the new columns would have NULL values for most of the records, simply because out of the 80k interactions in the data frame only a few thousands are from either SIGNOR or Macrophage. The new columns are list type, individual values are character vectors. Let’s look into one value:

dplyr::pull(i0, si_mechanism)[[7]]

## [1] "phosphorylation" "Phosphorylation"

Here we have two values, but only because the inconsistent names in the resource.

Depending on downstream methods, atomic columns might be preferable instead of lists. In this case one interaction record might yield multiple rows in the resulted data frame, depending on the number of attributes it has. To have atomic columns, use the flatten option:

i1 <- extra_attrs_to_cols(
    i,
    si_mechanism = SIGNOR_mechanism,
    ma_mechanism = Macrophage_type,
    keep_empty = FALSE,
    flatten = TRUE
)

dplyr::select(
    i1,
    source_genesymbol,
    target_genesymbol,
    si_mechanism,
    ma_mechanism
)

## # A tibble: 13,409 × 4
##    source_genesymbol target_genesymbol si_mechanism      ma_mechanism   
##    <chr>             <chr>             <chr>             <chr>          
##  1 PRKG1             TRPC3             phosphorylation   <NA>           
##  2 PRKG1             TRPC7             phosphorylation   <NA>           
##  3 OS9               TRPV4             binding           <NA>           
##  4 PTPN1             TRPV6             dephosphorylation <NA>           
##  5 RACK1             TRPM6             binding           <NA>           
##  6 PRKACA            MCOLN1            phosphorylation   <NA>           
##  7 MAPK14            MAPKAPK2          phosphorylation   Phosphorylation
##  8 MAPK14            MAPKAPK2          phosphorylation   Phosphorylation
##  9 MAPK14            MAPKAPK2          Phosphorylation   Phosphorylation
## 10 MAPK14            MAPKAPK2          Phosphorylation   Phosphorylation
## # … with 13,399 more rows

5 Filtering records based on extra attributes

Another useful application of extra attributes is filtering the records of the interactions data frame. The with_extra_attrs function filters to records which have certain extra attributes. For example, to have only interactions with SIGNOR_mechanism given:

nrow(with_extra_attrs(i, SIGNOR_mechanism))

## [1] 11340

This results around 11 thousands rows. Filtering for multiple attributes the records which have at least one of them will be selected. Adding some more attributes results more interactions:

nrow(with_extra_attrs(i, SIGNOR_mechanism, CA1_effect, Li2012_mechanism))

## [1] 12247

It is possible to filter the records not only by the names but the values of the extra attributes. Let’s select the interactions which are phosphorylation according to SIGNOR:

phos <- c('phosphorylation', 'Phosphorylation')

si_phos <- filter_extra_attrs(i, SIGNOR_mechanism = phos)

dplyr::select(si_phos, source_genesymbol, target_genesymbol)

## # A tibble: 4,255 × 2
##    source_genesymbol target_genesymbol
##    <chr>             <chr>            
##  1 PRKG1             TRPC3            
##  2 PRKG1             TRPC7            
##  3 PRKACA            MCOLN1           
##  4 MAPK14            MAPKAPK2         
##  5 MAPKAPK2          HNRNPA0          
##  6 MAPKAPK2          PARN             
##  7 JAK2              EPOR             
##  8 MAPK14            ZFP36            
##  9 MAPKAPK2          ZFP36            
## 10 AKT1              CHUK             
## # … with 4,245 more rows

6 Example: finding ubiquitination interactions

First let’s search for the word “ubiquitination” in the attributes. Below is a slow but simple solution:

keys <- extra_attrs(i)
keys_ubi <- purrr::keep(
    keys,
    function(k){
        any(stringr::str_detect(extra_attr_values(i, !!k), 'biqu'))
    }
)
keys_ubi

## [1] "SIGNOR_mechanism"    "HPRD-phos_mechanism" "SPIKE_mechanism"     "CA1_type"           
## [5] "Macrophage_type"

We found five attributes that have at least one value which matches “biqu”. Next take a look at their values:

ubi <- rlang::set_names(
    purrr::map(
        keys_ubi,
        function(k){
            stringr::str_subset(extra_attr_values(i, !!k), 'biqu')
        }
    ),
    keys_ubi
)
ubi

## $SIGNOR_mechanism
## [1] "ubiquitination"   "deubiquitination" "Ubiquitination"  
## 
## $`HPRD-phos_mechanism`
## [1] "Ubiquitination"
## 
## $SPIKE_mechanism
## [1] "Ubiquitination"     "Polyubiquitination"
## 
## $CA1_type
## [1] "Ubiquitination"
## 
## $Macrophage_type
## [1] "Ubiquitination"

Actually to match all ubiquitination interactions, it’s enough to filter for “ubiquitination” in its lowercase and capitalized forms (note, we could also include deubiqutination and polyubiquitination):

ubi_kws <- c('ubiquitination', 'Ubiquitination')

i_ubi <-
    dplyr::distinct(
        dplyr::bind_rows(
            purrr::map(
                keys_ubi,
                function(k){
                    filter_extra_attrs(i, !!k := ubi_kws, na_ok = FALSE)
                }
            )
        )
    )

dplyr::select(i_ubi, source_genesymbol, target_genesymbol)

## # A tibble: 405 × 2
##    source_genesymbol target_genesymbol
##    <chr>             <chr>            
##  1 NUMB              NOTCH1           
##  2 PRKN              RANBP2           
##  3 PRKN              SNCA             
##  4 FBXW7             MYC              
##  5 UBE2T             FANCL            
##  6 BIRC2             TRAF2            
##  7 TRAF2             MAP3K14          
##  8 TRAF6             MAP3K7           
##  9 XIAP              DIABLO           
## 10 TRAF2             RIPK1            
## # … with 395 more rows

We found 405 ubiquitination interactions. We had to use map, bind_rows and distinct because otherwise filter_extra_attrs would return the intersection of the matches, instead of their union.

In this data frame we have 150 unique ubiquitin E3 ligases:

length(unique(i_ubi$source_genesymbol))

## [1] 150

UniProt annotates E3 ligases by the “Ubl conjugation” keyword. We can check how many of those 150 proteins have this annotation:

uniprot_kws <- import_omnipath_annotations(
    resources = 'UniProt_keyword',
    entity_type = 'protein',
    wide = TRUE
)

e3_ligases <- dplyr::pull(
    dplyr::filter(uniprot_kws, keyword == 'Ubl conjugation'),
    genesymbol
)

length(e3_ligases)

## [1] 2517

length(intersect(unique(i_ubi$source_genesymbol), e3_ligases))

## [1] 84

length(setdiff(unique(i_ubi$source_genesymbol), e3_ligases))

## [1] 66

We retrieved 2503 E3 ligases from UniProt. 83 of these has substrates in the interaction database, while 67 of the effectors of the interactions are not annotated in UniProt.

In the OmniPath enzyme-substrate database we collect ubiquitination interactions from enzyme-PTM resources. However, these contain only a small number of interactions:

es_ubi <- import_omnipath_enzsub(types = 'ubiquitination')
es_ubi

## # A tibble: 52 × 12
##    enzyme substrate enzyme_genesymbol substr…¹ resid…² resid…³ modif…⁴ sources refer…⁵ curat…⁶ n_ref…⁷ n_res…⁸
##    <chr>  <chr>     <chr>             <chr>    <chr>     <dbl> <chr>   <chr>   <chr>     <dbl>   <int>   <int>
##  1 Q8IUD6 O95786    RNF135            DDX58    K           907 ubiqui… SIGNOR  SIGNOR…       1       1       1
##  2 Q8IUD6 O95786    RNF135            DDX58    K           909 ubiqui… SIGNOR  SIGNOR…       1       1       1
##  3 Q8IYW5 P16104    RNF168            H2AX     K            14 ubiqui… SIGNOR  SIGNOR…       1       1       1
##  4 Q8NG06 O95786    TRIM58            DDX58    K           172 ubiqui… SIGNOR  SIGNOR…       1       1       1
##  5 Q969H0 P23769    FBXW7             GATA2    T           176 ubiqui… SIGNOR  SIGNOR…       1       1       1
##  6 Q969V5 P31749    MUL1              AKT1     K           284 ubiqui… SIGNOR  SIGNOR…       1       1       1
##  7 Q96J02 Q7Z434    ITCH              MAVS     K           371 ubiqui… SIGNOR  SIGNOR…       1       1       1
##  8 Q96J02 Q7Z434    ITCH              MAVS     K           420 ubiqui… SIGNOR  SIGNOR…       1       1       1
##  9 Q96PU5 P35240    NEDD4L            NF2      K           396 ubiqui… SIGNOR  SIGNOR…       1       1       1
## 10 Q9C0C9 O43541    UBE2O             SMAD6    K           173 ubiqui… SIGNOR  SIGNOR…       1       1       1
## # … with 42 more rows, and abbreviated variable names ¹substrate_genesymbol, ²residue_type, ³residue_offset,
## #   ⁴modification, ⁵references, ⁶curation_effort, ⁷n_references, ⁸n_resources

With only two exception, all these have been recovered by using the extra attributes from the network database:

es_i_ubi <-
    dplyr::inner_join(
        es_ubi,
        i_ubi,
        by = c(
            'enzyme_genesymbol' = 'source_genesymbol',
            'substrate_genesymbol' = 'target_genesymbol'
        )
    )

nrow(dplyr::distinct(dplyr::select(es_i_ubi, enzyme, substrate, residue_offset)))

## [1] 50

7 Session information

sessionInfo()

## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB             
##  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] magrittr_2.0.3   ggraph_2.1.0     igraph_1.3.5     ggplot2_3.3.6    dplyr_1.0.10     OmnipathR_3.5.25
## [7] BiocStyle_2.25.0
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.6.2       httr_1.4.4          sass_0.4.2          tidyr_1.2.1         tidygraph_1.2.2    
##  [6] bit64_4.0.5         vroom_1.6.0         jsonlite_1.8.3      viridisLite_0.4.1   bslib_0.4.0        
## [11] assertthat_0.2.1    highr_0.9           BiocManager_1.30.18 selectr_0.4-2       cellranger_1.1.0   
## [16] yaml_2.3.6          progress_1.2.2      ggrepel_0.9.1       pillar_1.8.1        backports_1.4.1    
## [21] glue_1.6.2          digest_0.6.30       polyclip_1.10-4     checkmate_2.1.0     rvest_1.0.3        
## [26] colorspace_2.0-3    htmltools_0.5.3     pkgconfig_2.0.3     logger_0.2.2        magick_2.7.3       
## [31] bookdown_0.29       purrr_0.3.5         scales_1.2.1        tweenr_2.0.2        later_1.3.0        
## [36] tzdb_0.3.0          ggforce_0.4.1       tibble_3.1.8        generics_0.1.3      farver_2.1.1       
## [41] ellipsis_0.3.2      cachem_1.0.6        withr_2.5.0         cli_3.4.1           crayon_1.5.2       
## [46] readxl_1.4.1        evaluate_0.17       fansi_1.0.3         MASS_7.3-58.1       xml2_1.3.3         
## [51] tools_4.2.1         prettyunits_1.1.1   hms_1.1.2           lifecycle_1.0.3     stringr_1.4.1      
## [56] munsell_0.5.0       compiler_4.2.1      jquerylib_0.1.4     rlang_1.0.6         grid_4.2.1         
## [61] rappdirs_0.3.3      labeling_0.4.2      rmarkdown_2.17      gtable_0.3.1        DBI_1.1.3          
## [66] curl_4.3.3          graphlayouts_0.8.3  R6_2.5.1            gridExtra_2.3       knitr_1.40         
## [71] fastmap_1.1.0       bit_4.0.4           utf8_1.2.2          readr_2.1.3         stringi_1.7.8      
## [76] parallel_4.2.1      Rcpp_1.0.9          vctrs_0.5.0         tidyselect_1.2.0    xfun_0.34