phenomis 1.0.2
reading
: reading the datainspecting
: looking at the datahypotesting
: univariate hypothesis testingannotating
: MS annotationwriting
: Exporting the resultsAnalysis of a molecular phenotyping (e.g. metabolomics) data sets (i.e. samples times variables table of peak or bucket intensities generated by preprocessing tools such as XCMS) is aimed at mining the data (e.g. detecting trends and outliers) and selecting features of predictive value (biomarker discovery). It comprises multiple steps including:
Post-processing (quality control, normalization and/or transformation of intensities)
Statistical analysis (univariate hypothesis testing, multivariate modeling, feature selection)
Annotation (physico-chemical properties, biological pathways)
The phenomis
package focuses on the two first steps, and can be
combined with other packages for multivariate modeling, feature
selection and annotation, such as the
ropls
and
biosigner
and
biodb
Bioconductor
packages. As an example, the tutorial below reproduces the whole
workflow described in the sacurine
study (Thévenot et al. 2015).
The phenomis
package has also been used to post-processed the
preclinical, proteomics and metabolomics data from the ProMetIS study
(Imbert et al. 2021). Proteomics and metabolomics data
from this study are provided as a second dataset, named prometis
.
Examples of statistical analyses of this dataset are provided in the
“Working with multi-omics datasets” section.
SummarizedExperiment
and MultiAssayExperiment
formatsThe standard SummarizedExperiment
and MultiAssayExperiment
Bioconductor formats for single and multi-omics datasets are used by the
phenomis
methods. The methods return updated SummarizedExperiment
and MultiAssayExperiment
objects with either modified assay matrices
(e.g. when using the transforming
method) or enriched rowData
and
colData
with new columns (e.g. fold changes and p-values when using
the hypotesting
method).
Note that the phenomis
package also supports the ExpressionSet
(respectively, MultiDataSet
) formats, which are previous
(respectively, alternative) formats for the management of single
(respectively, multiple) datasets.
Input (i.e. preprocessed) data consists of a ‘variable times sample’ matrix of intensities (dataMatrix numeric matrix), in addition to sample and variable metadata (sampleMetadata and variableMetadata data frames). Importantly, the row names from the dataMatrix must be identical to the row names from the vaiableMetadata (feature names), and the column names from the dataMatrix must be identical to the row names from sampleMetadata (sample names).
Note that 3 sample metadata columns should be specified when working
with the correcting
method, namely:
sampleType (character): the following types are handled by the algorithms:
sample: biological sample of interest
blank: e.g. solvent only in Liquid Chromatography coupled to Mass Spectrometry
pool: quality control sample generated by pooling an equal volume of each of the sample of interest from the whole study (i.e. from all batches)
poolN: (where N is an integer; e.g. pool2, pool4, …): diluted quality control sample: N indicates the dilution factor
injectionOrder (integer): order of the sample injection (e.g. in the LC-MS instrument)
batch (character): name of each batch
Text and graphics can be handled with the phenomis methods by setting the two arguments:
report.c
: if set to ‘interactive’ [default], messages are
displayed interactively; if set to a character corresponding the a
filename (with the ‘.txt’ extension), messages are diverted to this
file by using the sink function internally (the same file can be
appended by successive methods); if set to ‘none’, messages are
suppressed
figure.c
: if set to ‘interactive’ [default], graphics are
displayed interactively; if set to a character corresponding the a
filename (with the ‘.pdf’ extension), a pdf file with the plot is
generated instead; if set to ‘none’, graphics are suppressed
As an example, we will use the phenomis
package to study the
sacurine human cohort. The study is aimed at characterizing the
physiological variations of the human urine metabolome with age, body
mass index (BMI), and gender (Thévenot et al. 2015). Urine samples
from 184 volunteers were analyzed by reversed-phase (C18) ultrahigh
performance liquid chromatography (UPLC) coupled to high-resolution mass
spectrometry (LTQ-Orbitrap). Raw data are publicly available on the
MetaboLights repository
(MTBLS404).
This vignette describes the statistical analysis of the data set from the negative ionization mode (113 identified metabolites at MSI levels 1 or 2):
correcting
: Correction of the signal drift by local regression
(loess) modeling of the intensity trend in pool samples
(Dunn et al. 2011); Adjustment of offset differences between the
two analytical batches by using the average of the pool intensities
in each batch (Kloet et al. 2009)
Variable quality control by discarding features with a coefficient of variation above 30% in pool samples
Normalization by the sample osmolality
transforming
: log10 transformation
inspecting
: Computing metrics to filter out outlier samples
according to the Weighted Hotellings’T2 distance
(Tenenhaus 1999), the Z-score of one of the intensity
distribution deciles (Alonso et al. 2011), and the Z-score of the
number of missing values (Alonso et al. 2011). A 0.001 threshold
for all p-values results in the HU_096 sample being discarded
hypotesting
: Univariate hypothesis testing of significant
variations with age, BMI, or between genders (Student’s T test with
Benjamini Hochberg correction)
PCA exploration of the data set;
ropls
Bioconductor
package (Thévenot et al. 2015)
clustering
: Hierarchical clustering
OPLS(-DA) modeling of age, BMI and gender;
ropls
Bioconductor
package (Thévenot et al. 2015)
Selection of the features which significantly contributes to the
discrimination between gender with PLS-DA, Random Forest, or Support
Vector Machines classifiers;
biosigner
Bioconductor package (Rinaudo et al. 2016)
A Galaxy version of this analysis is available W4M00001 ‘Sacurine-statistics’ on the workflow4metabolomics.usegalaxy.fr online infrastructure (Guitton et al. 2017).
reading
: reading the dataThe reading
function reads the data sets and builds the
SummarizedExperiment
object. Additional information on how to build
and handle SummarizedExperiment
objects (as well as
MultiAssayExperiment
, ExpressionSet
, and MultiDataSet
) are
provided in the Appendix.
library(phenomis)
sacurine.se <- reading(system.file("extdata/sacurine", package = "phenomis"))
## class: SummarizedExperiment
## dim: 113 210
## metadata(3): experimentData annotation protocolData
## assays(1): exprs
## rownames(113): (2-methoxyethoxy)propanoic acid isomer
## (gamma)Glu-Leu/Ile ... Valerylglycine isomer 2 Xanthosine
## rowData names(10): database_identifier chemical_formula ...
## retention_time reliability
## colnames(210): QC1_001 HU_neg_017 ... HU_neg_146_b2 QC1_012_b2
## colData names(10): subset full ... bmi gender
inspecting
: looking at the dataThe inspecting
method provides a numerical and graphical overview of
the data. Furthermore, it computes quality metrics which may
subsequently be used to filter out some samples or variables.
Graphical overview. The data matrix is visualized with a color
gradient (top right) and the score plot of the Principal Component
Analysis is shown for the two first components (bottom right). The
black ellipse corresponds to the area of 95% of the samples, as
computed with the Hotelling test. For each sample the mean of
variable intensities is shown as a function of the injection order
to detect any signal drift and/or batch correction (bottom left).
Note that for this coloring to be displayed, the sampleType,
injectionOrder and batch columns should be provided in the
colData
of (each of) the dataset(s). Finally, some metrics are
computed regarding the proportion of NAs, 0 values, intensity
quantiles, and proportion of features with a coefficient of
variation in pool intensities < 30%.
Quality metrics (as additional column in the metadata):
samples (added in the colData
):
hotel_pval: p-value from the Hotelling’s T2 test in the first plane of the PC components
miss_pval: p-value associated to the z-score of the proportion of missing values (Alonso et al. 2011)
deci_pval: p-value associated to the z-score of intensity deciles (Alonso et al. 2011)
For each test, low p-values highlight samples with extreme behavior.
features (included in the rowData
)
colData
from the
individual datasets), variable metrics are computed: sample,
pool, and blank mean, sd and coefficient of variation (if
the corresponding types are present in the ‘sampleType’
column), as well as ‘blank’ mean / ‘sample’ mean, and ‘pool’
CV / ‘sample’ CV ratiosacurine.se <- inspecting(sacurine.se, report.c = "none")
The filtering
method may be used to discard samples and/or features
with a high proportion of missing values and/or a variance of 0. When
filtering the features, sample class may be specified (as the name a
column from the sample metadata): in this case, features will be kept
when:
the proportion of missing values is below the threshold in at least one class
the variance is above 0 in both classes
correcting
: Correcting signal drift and batch effectFor each feature, the signal drift and the batch effect are corrected by
using a normalization strategy which relies on the measurements of a
pooled (or QC) sample injected periodically: for each variable, a
regression model is fitted to the values of the pool and subsequently
used to adjust the intensities of the samples of interest
(Kloet et al. 2009; Dunn et al. 2011). The sample
metadata of each datasets (e.g. colData
Data Frames) must contain 3
columns:
sampleType (character): either ‘sample’, ‘blank’, or ‘pool’
injectionOrder (integer): order of injection in the instrument
batch (character): batch name
sacurine.se <- correcting(sacurine.se,
reference.vc = "pool",
report.c = "none")
## Correction method: loess
## Reference observations: pool
## Processing batch:
## ne1
## ne2
inspecting
to compute the ‘pool_CV’ metricsacurine.se <- inspecting(sacurine.se,
figure.c = "none", report.c = "none")
sacurine.se <- sacurine.se[rowData(sacurine.se)[, "pool_CV"] <= 0.3, ]
sacurine.se <- sacurine.se[, colData(sacurine.se)[, "sampleType"] != "pool"]
print(sacurine.se)
## class: SummarizedExperiment
## dim: 110 184
## metadata(3): experimentData annotation protocolData
## assays(1): exprs
## rownames(110): (2-methoxyethoxy)propanoic acid isomer
## (gamma)Glu-Leu/Ile ... Valerylglycine isomer 2 Xanthosine
## rowData names(14): database_identifier chemical_formula ... poolDil_cor
## poolDil_pval
## colnames(184): HU_neg_017 HU_neg_018 ... HU_neg_106_b2 HU_neg_146_b2
## colData names(15): subset full ... miss_pval deci_pval
The normalization of each sample by its ‘osmolality’ value does not
require any extra method and relies on the classical methods from the
SummarizedExperiment
package.
assay(sacurine.se) <- sweep(assay(sacurine.se),
2,
colData(sacurine.se)[, "osmolality"],
"/")
Note that a normalizing
method is available in the phenomis
package
to perform Probabilistic Quotient Normalization
(Dieterle et al. 2006).
transforming
: transforming the data intensitiessacurine.se <- transforming(sacurine.se, method.vc = "log10")
## 'log10' transformation
sacurine.se <- inspecting(sacurine.se, figure.c = "none", report.c = "none")
sacurine.se <- sacurine.se[, colData(sacurine.se)[, "hotel_pval"] >= 0.001 &
colData(sacurine.se)[, "miss_pval"] >= 0.001 &
colData(sacurine.se)[, "deci_pval"] >= 0.001]
Final visual check of the data before performing the statistics:
inspecting(sacurine.se, report.c = "none")
hypotesting
: univariate hypothesis testingThe hypotesting
method is a wrapper of the main R functions for
hypothesis testing and corrections for multiple testing. The list of
available tests include two sample tests (t-test and Wilcoxon rank test,
but also the limma test), analysis of variance (for one and two factors)
and Kruskal-Wallis rank test, and correlation tests (by using either the
pearson or the spearman correlation).
sacurine.se <- hypotesting(sacurine.se,
test.c = "ttest",
factor_names.vc = "gender",
adjust.c = "BH",
adjust_thresh.n = 0.05,
report.c = "none")
The phenomis
package can be conveniently complemented with other
packages for comprehensive statistical analysis and annotation. In the
rest of this tutorial, we illustrate how the
ropls
,
biosigner
and
biodb
Bioconductor
packages can be included to perform multivariate modeling
(Thévenot et al. 2015), feature selection (Rinaudo et al. 2016),
and feature annotation (Roger 2022). All these packages support the
SummarizedExperiment
and MultiAssayExperiment
(but also
ExpressionSet
and MultiDataSet
), to provide interoperability.
We first focus on unsupervised methods for multivariate analysis, by performing Principal Component Analysis and hierarchical clustering.
PCA can be performed by using the opls
function from the ropls
package. It returns an updated object with scores and loadings included
as new columns in the metadata. The PCA model is stored in the metadata
from the SummarizedExperiment
and can be accessed with the getOpls
method:
sacurine.se <- ropls::opls(sacurine.se, info.txt = "none")
sacurine.pca <- ropls::getOpls(sacurine.se)[["PCA"]]
ropls::plot(sacurine.pca,
parAsColFcVn = colData(sacurine.se)[, "gender"],
typeVc = "x-score")
ropls::plot(sacurine.pca,
parAsColFcVn = colData(sacurine.se)[, "age"],
typeVc = "x-score")
clustering
: hierarchical clusteringThe clustering
method from the phenomis
package performs the
hierarchical clustering of samples and variables. The default
dissimilarity is $1-cor_{pearson}$.
sacurine.se <- clustering(sacurine.se,
clusters.vi = c(5, 3))
We now address the multivariate modeling of the response of interest (either age, bmi, or gender), by using Partial Least Squares modeling and feature selection.
Partial Least-Squares (PLS), which is a latent variable regression method based on covariance between the predictors and the response, has been shown to efficiently handle datasets with multi-collinear predictors, as in the case of spectrometry measurements (Wold, Sjostrom, and Eriksson 2001).
The opls
function from the ropls
package can be used to perform
Partial Least Squares modeling, by specifying the response to be modeled
as the second argument:
sacurine.se <- ropls::opls(sacurine.se, "gender")
## PLS-DA
## 183 samples x 110 variables and 1 response
## standard scaling of predictors and response(s)
## R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2
## Total 0.273 0.73 0.57 0.261 3 0 0.05 0.05
The (O)PLS(-DA) can be obtained from the metadata, e.g. to perform specific plots:
sacurine_gender.plsda <- ropls::getOpls(sacurine.se)[["gender_PLSDA"]]
ropls::plot(sacurine_gender.plsda,
parAsColFcVn = colData(sacurine.se)[, "gender"],
typeVc = "x-score")
The biosigner
approach is a wrapper method to select the features that
significantly contribute to the performance of a binary classifier,
namely PLS-DA, random forest, or Support Vector Machine
(Rinaudo et al. 2016). We refer the interested reader to the
vignette from the biosigner
package.
sacurine.biosign <- biosigner::biosign(sacurine.se, "gender", bootI = 5)
## Selecting features for the plsda model
## Selecting features for the randomforest model
## Selecting features for the svm model
## Significant features from 'S' groups:
## plsda randomforest svm
## p-Anisic acid "S" "S" "S"
## Testosterone glucuronide "S" "S" "S"
## Oxoglutaric acid "A" "S" "S"
## Pantothenic acid "A" "E" "S"
## Acetylphenylalanine "B" "E" "S"
## Malic acid "S" "E" "C"
## N2-Acetylaminoadipic acid "C" "E" "S"
## 2-Isopropylmalic acid "E" "E" "S"
## Methoxysalicylic acid isomer "E" "E" "S"
## N4-Acetylcytidine "E" "E" "S"
## Pyrroledicarboxylic acid "E" "E" "S"
## Taurine "E" "E" "S"
## Accuracy:
## plsda randomforest svm
## Full 0.867 0.826 0.888
## AS 0.898 0.848 0.910
## S 0.885 0.848 0.941
Please note that the bootI
parameter is set to 5 to speed up the
vignette building, but that we advice to keep the default 50 number of
bootstraps to obtain more stable signatures.
annotating
: MS annotationThis method is based on the biodb package (and its companion biodbChebi), which provides access to standard remote chemical and biological databases (ChEBI, KEGG, HMDB, …), as well as to in-house local database files (CSV, SQLite), with easy retrieval of entries, access to web services, search of compounds by mass and/or name (Roger 2022).
Viewing the parameters from the annotating
method and their
default values:
annotating_parameters()
## Annotating parameters for the query of chebi:
## mode available default
## query.type character c('mz', 'chebi.id', 'kegg.id') mz
## query.col character mz
## ms.mode character c('pos', 'neg') pos
## mz.tol numeric 10
## mz.tol.unit character c('plain', 'ppm') ppm
## fields character see below see below
## fieldsLimit numeric 1
## max.results numeric 3
## prefix character chebi.
## sep character |
## 'fields' possible values:
## 'accession', 'charge', 'chebi.id', 'formula', 'inchi', 'inchikey', 'kegg.compound.id', 'molecular.mass', 'monoisotopic.mass', 'name', 'smiles'
## 'fields' default values:
## 'chebi.id', 'name', 'formula', 'monoisotopic.mass', 'kegg.compound.id'
Querying chemical annotation from the Chemical Entities of Biological Interest ChEBI database:
sacurine.se <- annotating(sacurine.se,
database.c = "chebi",
param.ls = list(query.type = "mz",
query.col = "mass_to_charge",
ms.mode = "neg",
mz.tol = 10,
mz.tol.unit = "ppm",
max.results = 3,
prefix = "chebiMZ."),
report.c = "none")
knitr::kable(head(rowData(sacurine.se)[, grep("chebiMZ",
colnames(rowData(sacurine.se)))]))
sacurine.se <- annotating(sacurine.se, database.c = "chebi",
param.ls = list(query.type = "chebi.id",
query.col = "database_identifier",
prefix = "chebiID."))
knitr::kable(head(rowData(sacurine.se)[, grep("chebiID",
colnames(rowData(sacurine.se)))]))
Adding chemical information from a local database:
msdbDF <- read.table(system.file("extdata/local_ms_db.tsv", package = "phenomis"),
header = TRUE,
sep = "\t",
stringsAsFactors = FALSE)
sacurine.se <- annotating(sacurine.se,
database.c = "local.ms",
param.ls = list(query.type = "mz",
query.col = "mass_to_charge",
ms.mode = "neg",
mz.tol = 5,
mz.tol.unit = "ppm",
local.ms.db = msdbDF,
prefix = "localMS."),
report.c = "none")
## INFO [17:36:43.973] Loading definitions from package biodb version 1.6.0.
## INFO [17:36:47.495] Loading definitions from package biodbChebi version 1.4.0.
## INFO [17:36:47.613] Loading definitions from package biodbExpasy version 1.2.0.
## INFO [17:36:47.646] Loading definitions from package biodbHmdb version 1.4.0.
## INFO [17:36:47.790] Loading definitions from package biodbKegg version 1.4.0.
## INFO [17:36:48.009] Loading definitions from package biodbLipidmaps version 1.4.0.
## INFO [17:36:48.063] Loading definitions from package biodbMirbase version 1.2.0.
## INFO [17:36:48.094] Loading definitions from package biodbNcbi version 1.2.0.
## INFO [17:36:48.184] Loading definitions from package biodbNci version 1.2.0.
## INFO [17:36:48.215] Loading definitions from package biodbUniprot version 1.4.0.
## INFO [17:36:53.932] Closing BiodbMain instance...
## INFO [17:36:53.935] Connector "mass.csv.file" deleted.
knitr::kable(rowData(sacurine.se)[!is.na(rowData(sacurine.se)[, "localMS.accession"]), grep("localMS.", colnames(rowData(sacurine.se)), fixed = TRUE)])
localMS.accession | localMS.formula | localMS.mass.csv.file.id | localMS.molecular.mass | localMS.ms.mode | localMS.name | localMS.peak.mz | localMS.peak.mztheo | |
---|---|---|---|---|---|---|---|---|
Citric acid | KNA00550 | C6H8O7 | KNA00550 | 192.027 | neg | Citric acid | 191.019252 | 191.019252 |
Malic acid | KNA00527 | C4H6O5 | KNA00527 | 134.02152 | neg | L-Malic acid | 133.014168 | 133.014168 |
Taurine | KNA00479 | C2H7NO3S | KNA00479 | 125.01466 | neg | Taurine | 124.00727 | 124.00727 |
Tryptophan | KNA00507 | C11H12N2O2 | KNA00507 | 204.08988 | neg | L-Tryptophan | 203.082131 | 203.082131 |
writing
: Exporting the resultsFinally, the writing
method can be used to export the
SummarizedExperiment
object in the 3 tabular file format. In case of a
MultiAssayExperiment
, one subfolder containing the 3 files will be
created for each dataset.
writing(sacurine.se, dir.c = getwd(), prefix.c = "sacurine")
The main features from the phenomis
package can also be accessed via a
graphical user interface in the Quality Metrics, Batch
Correction, Univariate, and Heatmap modules from the
workflow4metabolomics.usegalaxy.fr
online resource for computational metabolomics, based on the Galaxy
environment.
The MultiAssayExperiment
format is useful to handle multi-omics
datasets (Ramos et al. 2017). The phenomis
package (as well as the ropls
and biosigner
packages) can be used to
process the data sets in parallel, such as all methods described before
can be directly applied to the MultiAssayExperiment
object.
In several methods, specific parameters for each dataset can be passed
as a vector: for instance, with two datasets, if one wishes to log2
transform the intensities of the first dataset but not the second one,
the method.vc = c("log2", "none")
parameter can be used with the
transforming
method (in contrast, if only one value is provided, it
will be used for all datasets).
Here we apply phenomis
statistical methods to the ProMetIS study ,
which addresses the molecular phenotyping of knock-out mouse models by
combined proteomics and metabolomics analysis
(Imbert et al. 2021). The phenomis
dataset has been
selected from the original data (publicly available on GitHub:
IFB-ElixirFr/ProMetIS) as
follows:
liver data
WT and KO mice for the LAT gene (Linker for Activation of T cells) only
the proteomics dataset and the metabolomics dataset obtained with the ZIC-pHILIC chromatographic column
post-processed data only
100 features only
The multi-omics dataset is saved as text files and can be loaded with
the reading
function as a MultiAssayExperiment
(or as a
MultiDataSet
with the output.c = "set"
argument)
prometis.mae <- reading(system.file("extdata/prometis",
package = "phenomis"))
## Reading the 'metabo' dataset...
## Reading the 'proteo' dataset...
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] metabo: SummarizedExperiment with 100 rows and 28 columns
## [2] proteo: SummarizedExperiment with 100 rows and 28 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
Let us have a look at the (common) sample metadata:
head(colData(prometis.mae))
## DataFrame with 6 rows and 3 columns
## gene mouse_id sex
## <character> <integer> <character>
## L818f LAT 818 F
## L819f LAT 819 F
## L824m LAT 824 M
## L825m LAT 825 M
## L826m LAT 826 M
## L827m LAT 827 M
We first explore the datasets:
prometis.mae <- inspecting(prometis.mae, report.c = "none")
We then perform univariate hypothesis testing with the limma
approach
(Ritchie et al. 2015):
prometis.mae <- hypotesting(prometis.mae, "limma", "gene",
report.c = "none")
## Warning: 'experiments' dropped; see 'metadata'
## harmonizing input:
## removing 28 sampleMap rows not in names(experiments)
## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider increasing max.overlaps
## 'experiments' dropped; see 'metadata'
## harmonizing input:
## removing 28 sampleMap rows not in names(experiments)
As with single omics studies, the ropls
and biosigner
can be
directly applied to MultiAssayExperiment
for multivariate analysis:
prometis.mae <- ropls::opls(prometis.mae, "gene")
##
##
## Building the model for the 'metabo' dataset:
## PLS-DA
## 28 samples x 100 variables and 1 response
## standard scaling of predictors and response(s)
## R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2
## Total 0.331 0.941 0.859 0.128 2 0 0.05 0.05
##
##
## Building the model for the 'proteo' dataset:
## PLS-DA
## 28 samples x 100 variables and 1 response
## standard scaling of predictors and response(s)
## R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2
## Total 0.137 0.677 0.458 0.294 1 0 0.1 0.05
## Warning: Single component model: only 'overview' and 'permutation' (in case of
## single response (O)PLS(-DA)) plots available
and feature selection:
prometis.mae <- biosigner::biosign(prometis.mae, "gene", bootI = 5)
##
##
## Selecting the features for the 'metabo' dataset:
## Selecting features for the plsda model
## Selecting features for the randomforest model
## Selecting features for the svm model
## Significant features from 'S' groups:
## plsda randomforest svm
## ADP "B" "S" "S"
## GDP-L-fucose "S" "E" "S"
## 9H-xanthine "B" "S" "E"
## deoxycholic acid "E" "E" "S"
## taurocholic acid "E" "E" "S"
## 3-phenylpropionic acid "E" "E" "S"
## 4-oxo-4-(pyridin-3-yl)butanoic acid "E" "E" "S"
## ferulic acid "E" "E" "S"
## glycocholate "E" "E" "S"
## UDP-alpha-D-galactose "E" "E" "S"
## Accuracy:
## plsda randomforest svm
## Full 0.980 0.949 0.963
## AS 0.912 0.918 0.952
## S 0.830 0.933 0.980
##
##
## Selecting the features for the 'proteo' dataset:
## Selecting features for the plsda model
## Selecting features for the randomforest model
## Selecting features for the svm model
## No significant variable found for the selected classifier(s): 'plsda', 'randomforest', 'svm'
The final MultiAssayExperiment
can be exported as text files with:
writing(prometis.mae, dir.c = getwd(), prefix.c = "prometis")
The CLL
package contains the chronic lymphocytic leukemia (CLL) gene
expression data from 24 samples, that were either classified as
progressive or stable in regards to disease progression.
We first load the data (which are stored as an ExpressionSet
object),
and restrict to the first 1,000 features (to speed up computations):
data(sCLLex, package = "CLL")
cll.eset <- sCLLex[seq_len(1000), ]
We explore the data:
cll.eset <- inspecting(cll.eset, report.c = "none")
Investigate whether clusters can be observed (we first add the letter ‘p’ or ‘s’ in the sample names to indicate the phenotype: p = progressive, s = stable):
Biobase::sampleNames(cll.eset) <- paste0(Biobase::sampleNames(cll.eset),
"_",
substr(Biobase::pData(cll.eset)[, "Disease"], 1, 1))
cll.eset <- clustering(cll.eset)
and perform hypothesis testing with the limma test (the dots must be removed from the ‘progress.’ phenotype):
Biobase::pData(cll.eset)[, "Disease"] <- gsub(".", "",
Biobase::pData(cll.eset)[, "Disease"],
fixed = TRUE)
cll.eset <- hypotesting(cll.eset, "limma", "Disease")
All the subsequent multivariate analysis and feature selection steps
described with the sacurine
data set can be performed with the
cll.eset
accordingly.
We provide an example based on the NCI60_4arrays
cancer data set from
the omicade4
package (Meng et al. 2014), which
has been made available in the MultiAssayExperiment
format in the
ropls
package. The data consist of gene transcript expression analysis
from NCI-60 cell lines by using 4 microarray platforms
(Reinhold et al. 2012).
Getting the NCI60
data set as a MultiAssayExperiment
:
data(NCI60, package = "ropls")
nci.mae <- NCI60[["mae"]]
Let’s focus on the LEukemia and MElanoma cancer types, and the agilent and hgu95 arrays:
library(MultiAssayExperiment)
table(nci.mae$cancer)
##
## BR CN CO LC LE ME OV PR RE
## 5 6 7 9 6 10 7 2 8
nci.mae <- nci.mae[,
nci.mae$cancer %in% c("LE", "ME"),
c("agilent", "hgu95")]
We can now e.g. perform clustering on each data set independently:
nci.mae <- clustering(nci.mae)
and/or perform hypothesis testing on each data set:
nci.mae <- hypotesting(nci.mae, "limma", "cancer", report.c = "none")
as well as all the analyses described above for the prometis
data set.
SummarizedExperiment
The SummarizedExperiment
format has been designed by the Bioconductor
team to handle (single) omics datasets (Morgan et al. 2022).
An example of SummarizedExperiment
creation and a summary of useful
methods are provided below.
Please refer to package
vignettes
for a full description of SummarizedExperiment
objects .
# Preparing the data (matrix) and sample and variable metadata (data frames):
data(sacurine, package = "ropls")
data.mn <- sacurine[["dataMatrix"]] # matrix: samples x variables
samp.df <- sacurine[["sampleMetadata"]] # data frame: samples x sample metadata
feat.df <- sacurine[["variableMetadata"]] # data frame: features x feature metadata
# Creating the SummarizedExperiment (package SummarizedExperiment)
library(SummarizedExperiment)
sac.se <- SummarizedExperiment(assays = list(sacurine = t(data.mn)),
colData = samp.df,
rowData = feat.df)
# note that colData and rowData main format is DataFrame, but data frames are accepted when building the object
stopifnot(validObject(sac.se))
# Viewing the SummarizedExperiment
# ropls::view(sac.se)
Methods | Description | Returned class |
---|---|---|
Constructors | ||
SummarizedExperiment |
Create a SummarizedExperiment object | SummarizedExperiment |
makeSummarizedExperimentFromExpressionSet |
SummarizedExperiment |
|
Accessors | ||
assayNames |
Get or set the names of assay() elements | character |
assay |
Get or set the ith (default first) assay element | matrix |
assays |
Get the list of experimental data numeric matrices | SimpleList |
rowData |
Get or set the row data (features) | DataFrame |
colData |
Get or set the column data (samples) | DataFrame |
metadata |
Get or set the experiment data | list |
dim |
Get the dimensions (features of interest x samples) | integer |
dimnames |
Get or set the dimension names | list of length 2 character/NULL |
rownames |
Get the feature names | character |
colnames |
Get the sample names | character |
Conversion | ||
as(, "SummarizedExperiment") |
Convert an ExpressionSet to a SummarizedExperiment | SummarizedExperiment |
MultiAssayExperiment
The MultiAssayExperiment
format has been designed by the Bioconductor
team to handle multiomics datasets
(Ramos et al. 2017).
An example of MultiAssayExperiment
creation and a summary of useful
methods are provided below. Please refer to package
vignettes or
to the original publication for a full description of
MultiAssayExperiment
objects
(Ramos et al. 2017).
Loading the NCI60_4arrays
from the omicade4
package:
data("NCI60_4arrays", package = "omicade4")
Creating the MultiAssayExperiment
:
library(MultiAssayExperiment)
# Building the individual SummarizedExperiment instances
experiment.ls <- list()
sampleMap.ls <- list()
for (set.c in names(NCI60_4arrays)) {
# Getting the data and metadata
assay.mn <- as.matrix(NCI60_4arrays[[set.c]])
coldata.df <- data.frame(row.names = colnames(assay.mn),
.id = colnames(assay.mn),
stringsAsFactors = FALSE) # the 'cancer' information will be stored in the main colData of the mae, not the individual SummarizedExperiments
rowdata.df <- data.frame(row.names = rownames(assay.mn),
name = rownames(assay.mn),
stringsAsFactors = FALSE)
# Building the SummarizedExperiment
assay.ls <- list(se = assay.mn)
names(assay.ls) <- set.c
se <- SummarizedExperiment(assays = assay.ls,
colData = coldata.df,
rowData = rowdata.df)
experiment.ls[[set.c]] <- se
sampleMap.ls[[set.c]] <- data.frame(primary = colnames(se),
colname = colnames(se)) # both datasets use identical sample names
}
sampleMap <- listToMap(sampleMap.ls)
# The sample metadata are stored in the colData data frame (both datasets have the same samples)
stopifnot(identical(colnames(NCI60_4arrays[[1]]),
colnames(NCI60_4arrays[[2]])))
sample_names.vc <- colnames(NCI60_4arrays[[1]])
colData.df <- data.frame(row.names = sample_names.vc,
cancer = substr(sample_names.vc, 1, 2))
nci.mae <- MultiAssayExperiment(experiments = experiment.ls,
colData = colData.df,
sampleMap = sampleMap)
stopifnot(validObject(nci.mae))
Methods | Description | Returned class |
---|---|---|
Constructors | ||
MultiAssayExperiment |
Create a MultiAssayExperiment object | MultiAssayExperiment |
ExperimentList |
Create an ExperimentList from a List or list | ExperimentList |
Accessors | ||
colData |
Get or set data that describe the patients/biological units | DataFrame |
experiments |
Get or set the list of experimental data objects as original classes | experimentList |
assays |
Get the list of experimental data numeric matrices | SimpleList |
assay |
Get the first experimental data numeric matrix | matrix , matrix-like |
sampleMap |
Get or set the map relating observations to subjects | DataFrame |
metadata |
Get or set additional data descriptions | list |
rownames |
Get row names for all experiments | CharacterList |
colnames |
Get column names for all experiments | CharacterList |
Subsetting | ||
mae[i, j, k] |
Get rows, columns, and/or experiments | MultiAssayExperiment |
mae[i, ,] |
i: GRanges, character, integer, logical, List, list | MultiAssayExperiment |
mae[,j,] |
j: character, integer, logical, List, list | MultiAssayExperiment |
mae[,,k] |
k: character, integer, logical | MultiAssayExperiment |
mae[[i]] |
Get or set object of arbitrary class from experiments | (Varies) |
mae[[i]] |
Character, integer, logical | |
mae$column |
Get or set colData column | vector (varies) |
Management | ||
complete.cases |
Identify subjects with complete data in all experiments | vector (logical) |
duplicated |
Identify subjects with replicate observations per experiment | list of LogicalLists |
mergeReplicates |
Merge replicate observations within each experiment | MultiAssayExperiment |
intersectRows |
Return features that are present for all experiments | MultiAssayExperiment |
intersectColumns |
Return subjects with data available for all experiments | MultiAssayExperiment |
prepMultiAssay |
Troubleshoot common problems when constructing main class | list |
Reshaping | ||
longFormat |
Return a long and tidy DataFrame with optional colData columns | DataFrame |
wideFormat |
Create a wide DataFrame, one row per subject | DataFrame |
Combining | ||
c |
Concatenate an experiment | MultiAssayExperiment |
Viewing | ||
upsetSamples |
Generalized Venn Diagram analog for sample membership | upset |
Exporting | ||
exportClass |
Exporting to flat files | csv or tsv files |
ExpressionSet
The ExpressionSet
format (Biobase
package) was designed by the
Bioconductor team as the first format to handle (single) omics data. It
has now been supplanted by the SummarizedExperiment
format.
An example of ExpressionSet
creation and a summary of useful methods
are provided below. Please refer to ‘An introduction to Biobase and
ExpressionSets’
from the documentation from the
Biobase
package for a
full description of ExpressionSet
objects.
# Preparing the data (matrix) and sample and variable metadata (data frames):
data(sacurine, package = "ropls")
data.mn <- sacurine[["dataMatrix"]] # matrix: samples x variables
samp.df <- sacurine[["sampleMetadata"]] # data frame: samples x sample metadata
feat.df <- sacurine[["variableMetadata"]] # data frame: features x feature metadata
# Creating the ExpressionSet (package Biobase)
sac.set <- Biobase::ExpressionSet(assayData = t(data.mn))
Biobase::pData(sac.set) <- samp.df
Biobase::fData(sac.set) <- feat.df
stopifnot(validObject(sac.set))
# Viewing the ExpressionSet
# ropls::view(sac.set)
Methods | Description | Returned class |
---|---|---|
exprs |
‘variable times samples’ numeric matrix | matrix |
pData |
sample metadata | data.frame |
fData |
variable metadata | data.frame |
sampleNames |
sample names | character |
featureNames |
variable names | character |
dims |
2x1 matrix of ‘Features’ and ‘Samples’ dimensions | matrix |
varLabels |
Column names of the sample metadata data frame | character |
fvarLabels |
Column names of the variable metadata data frame | character |
MultiDataSet
Loading the NCI60_4arrays
from the omicade4
package:
data("NCI60_4arrays", package = "omicade4")
Creating the MultiDataSet
:
library(MultiDataSet)
# Creating the MultiDataSet instance
nci.mds <- MultiDataSet::createMultiDataSet()
# Adding the two datasets as ExpressionSet instances
for (set.c in names(NCI60_4arrays)) {
# Getting the data
expr.mn <- as.matrix(NCI60_4arrays[[set.c]])
pdata.df <- data.frame(row.names = colnames(expr.mn),
cancer = substr(colnames(expr.mn), 1, 2),
stringsAsFactors = FALSE)
fdata.df <- data.frame(row.names = rownames(expr.mn),
name = rownames(expr.mn),
stringsAsFactors = FALSE)
# Building the ExpressionSet
eset <- Biobase::ExpressionSet(assayData = expr.mn,
phenoData = new("AnnotatedDataFrame",
data = pdata.df),
featureData = new("AnnotatedDataFrame",
data = fdata.df),
experimentData = new("MIAME",
title = set.c))
# Adding to the MultiDataSet
nci.mds <- MultiDataSet::add_eset(nci.mds, eset, dataset.type = set.c,
GRanges = NA, warnings = FALSE)
}
stopifnot(validObject(nci.mds))
Methods | Description | Returned class |
---|---|---|
Constructors | ||
createMultiDataSet |
Create a MultiDataSet object | MultiDataSet |
add_eset |
Create a MultiAssayExperiment object | MultiDataSet |
Subsetting | ||
mset[i, ] |
i: character,logical (samples to select) | MultiDataSet |
mset[, k] |
k: character (names of datasets to select) | MultiDataSet |
mset[[k]] |
k: character (name of the datast to select) | ExpressionSet |
Accessors | ||
as.list |
Get the list of data matrices | list |
pData |
Get the list of sample metadata | list |
fData |
Get the list of variable metadata | list |
sampleNames |
Get the list of sample names | list |
Management | ||
commonSamples |
Select samples that are present in all datasets | MultiDataSet |
Conversion | ||
mds2mae |
Convert a MultiDataSet to a MultiAssayExperiment | MultiAssayExperiment |
Here is the output of sessionInfo()
on the system on which this
document was compiled:
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] MultiDataSet_1.26.0 MultiAssayExperiment_1.24.0
## [3] biodb_1.6.0 phenomis_1.0.2
## [5] SummarizedExperiment_1.28.0 Biobase_2.58.0
## [7] GenomicRanges_1.50.1 GenomeInfoDb_1.34.2
## [9] IRanges_2.32.0 S4Vectors_0.36.0
## [11] BiocGenerics_0.44.0 MatrixGenerics_1.10.0
## [13] matrixStats_0.62.0 BiocStyle_2.26.0
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-3 class_7.3-20 ellipsis_0.3.2
## [4] qqman_0.1.8 futile.logger_1.4.3 XVector_0.38.0
## [7] proxy_0.4-27 farver_2.1.1 ggrepel_0.9.1
## [10] bit64_4.0.5 fansi_1.0.3 mvtnorm_1.1-3
## [13] ranger_0.14.1 cachem_1.0.6 knitr_1.40
## [16] SuppDists_1.1-9.7 jsonlite_1.8.3 biodbChebi_1.4.0
## [19] Rmpfr_0.8-9 dbplyr_2.2.1 BiocManager_1.30.19
## [22] compiler_4.2.1 httr_1.4.4 PMCMRplus_1.9.6
## [25] assertthat_0.2.1 Matrix_1.5-1 fastmap_1.1.0
## [28] lazyeval_0.2.2 limma_3.54.0 cli_3.4.1
## [31] formatR_1.12 htmltools_0.5.3 prettyunits_1.1.1
## [34] tools_4.2.1 gmp_0.6-7 igraph_1.3.5
## [37] gtable_0.3.1 glue_1.6.2 GenomeInfoDbData_1.2.9
## [40] dplyr_1.0.10 rappdirs_0.3.3 Rcpp_1.0.9
## [43] jquerylib_0.1.4 vctrs_0.5.0 xfun_0.34
## [46] stringr_1.4.1 lifecycle_1.0.3 XML_3.99-0.12
## [49] zlibbioc_1.44.0 MASS_7.3-58.1 scales_1.2.1
## [52] lgr_0.4.4 hms_1.1.2 lambda.r_1.2.4
## [55] RColorBrewer_1.1-3 ropls_1.30.0 yaml_2.3.6
## [58] curl_4.3.3 memoise_2.0.1 ggplot2_3.4.0
## [61] sass_0.4.2 calibrate_1.7.7 stringi_1.7.8
## [64] RSQLite_2.2.18 biosigner_1.26.0 highr_0.9
## [67] randomForest_4.7-1.1 e1071_1.7-12 filelock_1.0.2
## [70] rlang_1.0.6 pkgconfig_2.0.3 bitops_1.0-7
## [73] evaluate_0.17 lattice_0.20-45 purrr_0.3.5
## [76] labeling_0.4.2 chk_0.8.1 htmlwidgets_1.5.4
## [79] bit_4.0.4 tidyselect_1.2.0 plyr_1.8.7
## [82] magrittr_2.0.3 bookdown_0.29 R6_2.5.1
## [85] magick_2.7.3 generics_0.1.3 multcompView_0.1-8
## [88] BWStest_0.2.2 DelayedArray_0.24.0 DBI_1.1.3
## [91] pillar_1.8.1 withr_2.5.0 RCurl_1.98-1.9
## [94] tibble_3.1.8 crayon_1.5.2 futile.options_1.0.1
## [97] utf8_1.2.2 BiocFileCache_2.6.0 plotly_4.10.0
## [100] rmarkdown_2.17 kSamples_1.2-9 progress_1.2.2
## [103] grid_4.2.1 data.table_1.14.4 blob_1.2.3
## [106] digest_0.6.30 VennDiagram_1.7.3 tidyr_1.2.1
## [109] BiocBaseUtils_1.0.0 openssl_2.0.4 munsell_0.5.0
## [112] viridisLite_0.4.1 bslib_0.4.1 askpass_1.1
Alonso, Arnald, Antonio Julia, Antoni Beltran, Maria Vinaixa, Marta Diaz, Lourdes Ibanez, Xavier Correig, and Sara Marsal. 2011. “AStream: An R Package for Annotating LC/MS Metabolomic Data.” Bioinformatics 27 (9): 1339–40. https://doi.org/10.1093/bioinformatics/btr138.
Dieterle, Frank, Alfred Ross, Gotz Schlotterbeck, and Hans Senn. 2006. “Probabilistic Quotient Normalization as Robust Method to Account for Dilution of Complex Biological Mixtures. Application in 1H NMR Metabonomics.” Analytical Chemistry 78 (13): 4281–90. https://doi.org/10.1021/ac051632c.
Dunn, Warwick B, David Broadhurst, Paul Begley, Eva Zelena, Sue Francis-McIntyre, Nadine Anderson, Marie Brown, et al. 2011. “Procedures for Large-Scale Metabolic Profiling of Serum and Plasma Using Gas Chromatography and Liquid Chromatography Coupled to Mass Spectrometry.” Nature Protocols 6 (7): 1060–83. https://doi.org/10.1038/nprot.2011.335.
Guitton, Yann, Marie Tremblay-Franco, Gildas Le Corguillé G., Jean-Francois Martin, Melanie Pétéra, Pierrick Roger-Mele, Alexis Delabrière, et al. 2017. “Create, Run, Share, Publish, and Reference Your LC-MS, FIA-MS, GC-MS, and NMR Data Analysis Workflows with the Workflow4Metabolomics 3.0 Galaxy Online Infrastructure for Metabolomics.” The International Journal of Biochemistry & Cell Biology 93: 89–101. https://doi.org/10.1016/j.biocel.2017.07.002.
Imbert, Alyssa, Magali Rompais, Mohammed Selloum, Florence Castelli, Emmanuelle Mouton-Barbosa, Marion Brandolini-Bunlon, Emeline Chu-Van, et al. 2021. “ProMetIS, Deep Phenotyping of Mouse Models by Combined Proteomics and Metabolomics Analysis.” Scientific Data 8 (1). https://doi.org/10.1038/s41597-021-01095-3.
Kloet, Frans M. van der, Ivana Bobeldijk, Elwin R. Verheij, and Renger H. Jellema. 2009. “Analytical Error Reduction Using Single Point Calibration for Accurate and Precise Metabolomic Phenotyping.” Journal of Proteome Research 8 (11): 5132–41. https://doi.org/10.1021/pr900499r.
Meng, Chen, Bernhard Kuster, Aedin C. Culhane, and Amin Moghaddas Gholami. 2014. “A Multivariate Approach to the Integration of Multi-Omics Datasets.” BMC Bioinformatics 15 (1): 1–13. https://doi.org/10.1186/1471-2105-15-162.
Morgan, Martin, Valerie Obenchain, Jim Hester, and Hervé Pagès. 2022. SummarizedExperiment: SummarizedExperiment Container. https://bioconductor.org/packages/SummarizedExperiment.
Ramos, Marcel, Lucas Schiffer, Angela Re, Rimsha Azhar, Azfar Basunia, Carmen Rodriguez, Tiffany Chan, et al. 2017. “Software for the Integration of Multiomics Experiments in Bioconductor.” Cancer Research 77 (21): e39–e42. https://doi.org/10.1158/0008-5472.CAN-17-0344.
Reinhold, William C., Margot Sunshine, Hongfang Liu, Sudhir Varma, Kurt W. Kohn, Joel Morris, James Doroshow, and Yves Pommier. 2012. “CellMiner: A Web-Based Suite of Genomic and Pharmacologic Tools to Explore Transcript and Drug Patterns in the NCI-60 Cell Line Set.” Cancer Research 72 (14): 3499–3511. https://doi.org/10.1158/0008-5472.CAN-12-1370.
Rinaudo, Philippe, Samia Boudah, Christophe Junot, and Etienne A Thévenot. 2016. “Biosigner: A New Method for the Discovery of Significant Molecular Signatures from Omics Data.” Frontiers in Molecular Biosciences 3: –. https://doi.org/10.3389/fmolb.2016.00026.
Ritchie, Matthew E., Belinda Phipson, Di Wu, Yifang Hu, Charity W. Law, Wei Shi, and Gordon K. Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47–e47. https://doi.org/10.1093/nar/gkv007.
Roger, Pierrick. 2022. Biodb: Biodb, a Library and a Development Framework for Connecting to Chemical and Biological Databases. https://github.com/pkrog/biodb.
Tenenhaus, M. 1999. “L’approche PLS.” Revue de Statistiques Appliquees 47: 5–40.
Thévenot, Etienne A., Aurélie Roux, Ying Xu, Eric Ezan, and Christophe Junot. 2015. “Analysis of the Human Adult Urinary Metabolome Variations with Age, Body Mass Index and Gender by Implementing a Comprehensive Workflow for Univariate and OPLS Statistical Analyses.” Journal of Proteome Research 14 (8): 3322–35. https://doi.org/10.1021/acs.jproteome.5b00354.
Wold, S., M. Sjostrom, and L. Eriksson. 2001. “PLS-Regression: A Basic Tool of Chemometrics.” Chemometrics and Intelligent Laboratory Systems 58: 109–30. http://dx.doi.org/10.1016/S0169-7439(01)00155-1.