1 Introduction

High-throughput omics data are often affected by systematic biases introduced throughout all the steps of a clinical study, from sample collection to quantification. Failure to account for these biases can lead to erroneous results and misleading conclusions in downstream analysis. Normalization methods aim to adjust for these biases to make the actual biological signal more prominent. However, selecting an appropriate normalization method is challenging due to the wide range of available approaches. Therefore, a comparative evaluation of unnormalized and normalized data is essential in identifying an appropriate normalization strategy for a specific data set. This R package provides different functions for preprocessing, normalizing, and evaluating different normalization approaches. Furthermore, normalization methods can be evaluated on downstream steps, such as differential expression analysis and statistical enrichment analysis. Spike-in data sets with known ground truth and real-world data sets of biological experiments acquired by either tandem mass tag (TMT) or label-free quantification (LFQ) can be analyzed.

2 Installation

# Official BioC installation instructions
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("PRONE")
# Load and attach PRONE
library(PRONE)

3 Workflow

A six-step workflow was developed in R version 4.2.2 to evaluate the effectiveness of the previously defined normalization methods on proteomics data. The workflow incorporates a set of novel functions and also integrates various methods adopted by state-of-the-art tools.

Six-step workflow of PRONE.

Figure 3.1: Six-step workflow of PRONE.

Following the upload of the proteomics data into a SummarizedExperiment object, proteins with too many missing values can be removed, outlier samples identified, and normalization carried out. Furthermore, an exploratory analysis of the performance of normalization methods can be conducted. Finally, differential expression analysis can be executed to further evaluate the effectiveness of normalization methods. For data sets with known ground truth, such as spike-in and simulated data sets, performance metrics, such as true positives (TPs), false positives (FPs), and area under the curve (AUC) values, can be computed. The evaluation of DE results of real-world experiments is based on visual quality inspection, for instance, using volcano plots, and an intersection analysis of the DE proteins of different normalization methods is available.

4 Usage

This guide serves as an introduction to the PRONE R package, designed to facilitate the preparation of your data set for utilization of the PRONE package’s functionalities. It begins by delineating the underlying data structure essential for the application of the package, followed by a brief description of how to apply different normalization techniques to your data. Additionally, this tutorial shows how to export the normalized data at the end.

Beyond the scope of this introductory tutorial, PRONE encompassess a broad spectrum of functionalities, ranging from preprocessing steps, imputation, normalization and evaluation of the performance of different normalization techniques, to the identification of differentially expressed proteins. These functionalities are detailed in dedicated vignettes, offering detailed insights and instructions for leveraging full capabilities of the PRONE package:

Furthermore, PRONE provides additional functionalities for the analysis of spike-in data sets, which are detailed in the following vignette:

5 Load Data

PRONE uses the SummarizedExperiment class as storage for protein intensities and meta data information on the proteomics data set. Hence, before being able to execute the functionalities of PRONE, the data needs to be saved accordingly. For this, the load_data() function was implemented and requires different parameters which are explained in the following:

If you have a tandem mass tag (TMT) data set with samples being measured in different batches than you have to specify the batch information. If reference samples were included in each batch, then additionally specify the samples names of the reference samples.

Attention: You need to make sure that the sample names are saved in a column named “Column” in the meta-data table and are named accordingly in the protein intensity table.

5.1 Example 1: TMT Data Set

The example TMT data set originates from (Biadglegne et al. 2022).

data_path <- readPRONE_example("tuberculosis_protein_intensities.csv")
md_path <- readPRONE_example("tuberculosis_metadata.csv")

data <- read.csv(data_path)
md <- read.csv(md_path)

md$Column <- stringr::str_replace_all(md$Column, " ", ".")

ref_samples <- md[md$Group == "ref",]$Column

se <- load_data(data, md, protein_column = "Protein.IDs", 
                gene_column = "Gene.names", ref_samples = ref_samples, 
                batch_column = "Pool", condition_column = "Group", 
                label_column = "Label")

5.2 Example 2: Label-free (LFQ) Data Set

The example data set originates from (Vehmas et al. 2016). This data set is used for the subsequent examples in this tutorial.

data_path <- readPRONE_example("mouse_liver_cytochrome_P450_protein_intensities.csv")
md_path <- readPRONE_example("mouse_liver_cytochrome_P450_metadata.csv")

data <- read.csv(data_path, check.names = FALSE)
md <- read.csv(md_path)

se <- load_data(data, md, protein_column = "Accession", 
                gene_column = "Gene names", ref_samples = NULL, 
                batch_column = NULL, condition_column = "Condition", 
                label_column = NULL)

6 Data Structure

The SummarizedExperiment object contains the protein intensities as “assay”, the meta-data table as “colData”, and additional columns for instance resulting from MaxQuant as “rowData”. Furthermore, information on the different columns, for instance, which columns contains the batch information, can be found in the “metadata” slot.

se
#> class: SummarizedExperiment 
#> dim: 1499 12 
#> metadata(4): condition batch refs label
#> assays(2): raw log2
#> rownames(1499): 1 2 ... 1498 1499
#> rowData names(4): Gene.Names Protein.IDs Peptides used for quantitation
#>   IDs
#> colnames(12): 2206_WT 2208_WT ... 2285_Arom 2253_Arom
#> colData names(3): Column Animal Condition

The different data types can be accessed by using the assays() function. Currently, only the raw data and log2-transformed data are stored in the SummarizedExperiment object.

assays(se)
#> List of length 2
#> names(2): raw log2

7 Preprocessing, Imputation, Normalization, Evaluation, and Differential Expression

As already mentioned in the introduction section, many functionalities are available in PRONE. All these functionalities are mainly based on the SummarizedExperiment object.

In this tutorial, we will only perform simple normalization of the data using median and LoessF normalization.

se <- normalize_se(se, c("Median", "LoessF"))
#> Median completed.
#> LoessF completed.

The normalized intensities will be saved as additional assays in the SummarizedExperiment object.

assays(se)
#> List of length 4
#> names(4): raw log2 Median LoessF

Again, more information on the individual processes can be find in dedicated vignettes.

8 Download Data

Finally, you can easily download the normalized data by using the export_data() function. This function will save the specified assays as CSV files and the SummarizedExperiment object as an RDS file in a specified output directory. Make sure that the output directory exists.

if(!dir.exists("output/")) dir.create("output/")

export_data(se, out_dir = "output/", ain = c("log2", "Median", "LoessF"))

9 Session Info

utils::sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] PRONE_1.1.0                 SummarizedExperiment_1.37.0
#>  [3] Biobase_2.67.0              GenomicRanges_1.59.0       
#>  [5] GenomeInfoDb_1.43.0         IRanges_2.41.0             
#>  [7] S4Vectors_0.45.0            BiocGenerics_0.53.0        
#>  [9] MatrixGenerics_1.19.0       matrixStats_1.4.1          
#> 
#> loaded via a namespace (and not attached):
#>   [1] rlang_1.1.4                 magrittr_2.0.3             
#>   [3] GetoptLong_1.0.5            clue_0.3-65                
#>   [5] compiler_4.5.0              png_0.1-8                  
#>   [7] vctrs_0.6.5                 reshape2_1.4.4             
#>   [9] stringr_1.5.1               ProtGenerics_1.39.0        
#>  [11] shape_1.4.6.1               pkgconfig_2.0.3            
#>  [13] crayon_1.5.3                fastmap_1.2.0              
#>  [15] magick_2.8.5                XVector_0.47.0             
#>  [17] labeling_0.4.3              utf8_1.2.4                 
#>  [19] rmarkdown_2.28              UCSC.utils_1.3.0           
#>  [21] preprocessCore_1.69.0       purrr_1.0.2                
#>  [23] xfun_0.48                   MultiAssayExperiment_1.33.0
#>  [25] zlibbioc_1.53.0             cachem_1.1.0               
#>  [27] jsonlite_1.8.9              highr_0.11                 
#>  [29] DelayedArray_0.33.0         BiocParallel_1.41.0        
#>  [31] parallel_4.5.0              cluster_2.1.6              
#>  [33] R6_2.5.1                    RColorBrewer_1.1-3         
#>  [35] bslib_0.8.0                 stringi_1.8.4              
#>  [37] ComplexUpset_1.3.3          limma_3.63.0               
#>  [39] jquerylib_0.1.4             Rcpp_1.0.13                
#>  [41] bookdown_0.41               iterators_1.0.14           
#>  [43] knitr_1.48                  Matrix_1.7-1               
#>  [45] igraph_2.1.1                tidyselect_1.2.1           
#>  [47] abind_1.4-8                 yaml_2.3.10                
#>  [49] doParallel_1.0.17           ggtext_0.1.2               
#>  [51] codetools_0.2-20            affy_1.85.0                
#>  [53] lattice_0.22-6              tibble_3.2.1               
#>  [55] plyr_1.8.9                  withr_3.0.2                
#>  [57] evaluate_1.0.1              xml2_1.3.6                 
#>  [59] circlize_0.4.16             pillar_1.9.0               
#>  [61] affyio_1.77.0               BiocManager_1.30.25        
#>  [63] DT_0.33                     foreach_1.5.2              
#>  [65] MSnbase_2.33.0              MALDIquant_1.22.3          
#>  [67] ncdf4_1.23                  generics_0.1.3             
#>  [69] ggplot2_3.5.1               munsell_0.5.1              
#>  [71] scales_1.3.0                glue_1.8.0                 
#>  [73] lazyeval_0.2.2              tools_4.5.0                
#>  [75] data.table_1.16.2           mzID_1.45.0                
#>  [77] QFeatures_1.17.0            vsn_3.75.0                 
#>  [79] mzR_2.41.0                  XML_3.99-0.17              
#>  [81] Cairo_1.6-2                 grid_4.5.0                 
#>  [83] impute_1.81.0               tidyr_1.3.1                
#>  [85] crosstalk_1.2.1             MsCoreUtils_1.19.0         
#>  [87] colorspace_2.1-1            patchwork_1.3.0            
#>  [89] GenomeInfoDbData_1.2.13     PSMatch_1.11.0             
#>  [91] cli_3.6.3                   fansi_1.0.6                
#>  [93] S4Arrays_1.7.0              ComplexHeatmap_2.23.0      
#>  [95] dplyr_1.1.4                 AnnotationFilter_1.31.0    
#>  [97] pcaMethods_1.99.0           gtable_0.3.6               
#>  [99] sass_0.4.9                  digest_0.6.37              
#> [101] SparseArray_1.7.0           htmlwidgets_1.6.4          
#> [103] rjson_0.2.23                farver_2.1.2               
#> [105] htmltools_0.5.8.1           lifecycle_1.0.4            
#> [107] httr_1.4.7                  NormalyzerDE_1.25.0        
#> [109] GlobalOptions_0.1.2         statmod_1.5.0              
#> [111] gridtext_0.1.5              MASS_7.3-61

References

Biadglegne, Fantahun, Johannes R. Schmidt, Kathrin M. Engel, Jörg Lehmann, Robert T. Lehmann, Anja Reinert, Brigitte König, Jürgen Schiller, Stefan Kalkhof, and Ulrich Sack. 2022. “Mycobacterium Tuberculosis Affects Protein and Lipid Content of Circulating Exosomes in Infected Patients Depending on Tuberculosis Disease State.” Biomedicines 10 (4): 783. https://doi.org/10.3390/biomedicines10040783.
Vehmas, Anni P., Marion Adam, Teemu D. Laajala, Gabi Kastenmüller, Cornelia Prehn, Jan Rozman, Claes Ohlsson, et al. 2016. “Liver Lipid Metabolism Is Altered by Increased Circulating Estrogen to Androgen Ratio in Male Mouse.” Journal of Proteomics 133 (February): 66–75. https://doi.org/10.1016/j.jprot.2015.12.009.