Untargeted metabolomics goal is to quantify as metabolites as possible from a sample. We can use liquid chromatography coupled to mass spectrometry (LC/MS) for this purpose. It is a great challenge to transform LC/MS data into a profile of annotated metabolites that provides us meaningful biological information. A very important limitation to the annotation of metabolomic experiments is that the number of m/z processed signals, called features, is much bigger than the putative number of metabolites in a sample. The sources that produce multiple features from a single metabolite are multiple and variable. Natural isotopes as carbon isotopes produce isotope features. Ionization of metabolites produce the so called adducts of the metabolite, which are detected as different features depending on the ion adduct involved ([M+Na]+, [M+H]+, etc..). Apart from adduct features, ionization also produces metabolite fragmentation and other reactions as dimerizations, trimerizations, all of them being detected as multiple features. Being the reduction of multiple features to single metabolites a crucial step for the correct annotation of LC/MS experiments, we will show how to use cliqueMS
to do so.
cliqueMS
annotates samples one by one. Annotation can be summarised in these three steps:
Annotation steps are stored in an anClique
S4 object. This object can be created from a XCMSnExp
or a xcmsSet
object with processed m/z data from xcms
package. First m/z raw data is processed:
library(cliqueMS)
mzfile <- system.file("standards.mzXML", package = "cliqueMS")
library(xcms)
mzraw <- readMSData(files = mzfile, mode = "onDisk")
cpw <- CentWaveParam(ppm = 15, peakwidth = c(5,20), snthresh = 10)
mzData <- findChromPeaks(object = mzraw, param = cpw)
Then we can create an anClique
object:
ex.anClique <- createanClique(mzData)
show(ex.anClique)
#> anClique S4 object with 126 features
#> No computed clique groups
#> No isotope annotation
#> No adduct annotation
Here we see an anClique
object before any annotation step. Features have not been grouped, isotopes and adducts are not annotated. Now let’s see the three steps in detail.
Metabolites produce multiple features and very often they do not separate completely in the chromatography, so we observe coelution. This increases the difficulty of the annotation because many features coming from different metabolites might appear very close in the chromatogram. Before trying to annotate isotopes, adducts and fragments we want to make groups of features. Ideally, each group should include all the features produced by a single metabolite.
cliqueMS
uses a similarity network to find groups of features. Each feature is a node, and edges are weighted according to the cosine similarity between features:
\(c_{ij}=\frac{\sum_k f_i(t_k)f_j(t_k)}{\| {f}_i\| \|{f}_j\|}\)
Values from cosine similarity are useful to discriminate pairs of features that come from the same metabolite from pairs of features that come from different metabolites[1]. We compute the cosine similarity using the profile mode of the data, having each feature a m/z value and vector intensities. All features are discretized into a vector of equal bins \(k\). Each vector position relative to retention time \(t_k\) contains the intensity of the feature \(i_k\) at that moment of the chromatography. Features with no coelution at all have a cosine similarity = 0. Edges with weight = 0 are not included in the network, nor nodes without any edge.
Once we have the network, it is time to divide the features into groups. cliqueMS
assumes that the similarity between all features that come from the same metabolite must be \(c_{ij} > 0\). Additionally, similarity values between features of the same metabolite should be generally higher than between features of different metabolites. With this information, cliqueMS
uses a probabilistic model to find the feature groups. This model find cliques, fully connected components so for all nodes \(c_{ij} > 0\). The similarity inside this cliques should be higher than the similarity between features outside the clique. cliqueMS
estimates the log-likelihood for a particular assignment of features into different clique groups. For details in the probabilistic model and the log-likelihood maximisation see[1]. The log-likelihood maximisation procedure can be summarised in the following way:
cliqueMS
starts with each node as a different clique group.Now let’s see how to find this groups with getCliques
.
With the function getCliques
we assign clique groups to our features. This function creates the network of similarity and then computes the clique groups. As input data it uses a xcmsSet
object. getCliques
outputs an anClique
S4 object, which will be used to store all annotation steps.
set.seed(2)
ex.cliqueGroups <- getCliques(mzData, filter = TRUE)
#> Creating anClique object
#> Creating network
#> Features filtered: 0
#> Computing cliques
#> Beggining value of logl is -712.347
#> Aggregate cliques done, with 146 rounds
#> Kernighan-Lin done with 1 rounds
#> Finishing value of logl is -164.515
show(ex.cliqueGroups)
#> anClique S4 object with 126 features
#> Features have been splitted into 17 cliques
#> No isotope annotation
#> No adduct annotation
As we see from the printed messages, the function getCliques
first creates a network, and then it filters features if parameter filter = TRUE
. As m/z signal processing algorithms may produce two artefact features from a single one, it is recommended to set filter = TRUE
to drop repeated features. This filter only drops features with similarity > 0.99, and equal values of m/z, retention time and maximum intensity, defined by the relative error parameters mzerror
, rtdiff
and intdiff
. From the output of the function we see the computed log-likelihood at the beginning, when each feature is in a different clique and the computed log-likelihood at the end of the process. If we now look at the summary
of the resulting anClique
object, we see that the features have been grouped into 69 clique groups. Now we can annotate isotopes.
cliqueMS
annotates isotopes within each clique group. cliqueMS
searches pairs of features than can be carbon isotopes based in these two rules:
Isotopes are annotated with the function getIsotopes
. This function finds pairs of features that fulfil the conditions of an isotope. Then it creates the isotope annotation after removing incoherences like two monoisotopic masses for one isotope, two second isotopes for one first isotope, etc… In all this cases the removed pair is the one with smaller similarity. The use of the function is pretty straightforward:
ex.Isotopes <- getIsotopes(ex.cliqueGroups, ppm = 10)
#> Computing isotopes
#> Warning in xtfrm.data.frame(x): cannot xtfrm data frames
#> Warning in xtfrm.data.frame(x): cannot xtfrm data frames
#> Warning in xtfrm.data.frame(x): cannot xtfrm data frames
#> Warning in xtfrm.data.frame(x): cannot xtfrm data frames
#> Warning in xtfrm.data.frame(x): cannot xtfrm data frames
#> Warning in xtfrm.data.frame(x): cannot xtfrm data frames
#> Warning in xtfrm.data.frame(x): cannot xtfrm data frames
#> Warning in xtfrm.data.frame(x): cannot xtfrm data frames
#> Warning in xtfrm.data.frame(x): cannot xtfrm data frames
#> Warning in xtfrm.data.frame(x): cannot xtfrm data frames
#> Warning in xtfrm.data.frame(x): cannot xtfrm data frames
#> Warning in xtfrm.data.frame(x): cannot xtfrm data frames
#> Warning in xtfrm.data.frame(x): cannot xtfrm data frames
#> Warning in xtfrm.data.frame(x): cannot xtfrm data frames
#> Warning in xtfrm.data.frame(x): cannot xtfrm data frames
#> Warning in xtfrm.data.frame(x): cannot xtfrm data frames
#> Warning in xtfrm.data.frame(x): cannot xtfrm data frames
#> Warning in xtfrm.data.frame(x): cannot xtfrm data frames
#> Warning in xtfrm.data.frame(x): cannot xtfrm data frames
#> Updating anClique object
show(ex.Isotopes)
#> anClique S4 object with 126 features
#> Features have been splitted into 17 cliques
#> 37 Features are isotopes
#> No adduct annotation
Parameter ppm
is important because it defines in ppm units the range of the accepted relative error. Once isotopes are annotated we can annotate adducts and fragments.
The last step of cliqueMS
is to annotate adducts. Each feature has a m/z value that is the neutral mass of the metabolite plus the mass of the ion adduct (or fragmentation ion adduct). The neutral mass is an unknown value, but the ion adduct mass is to some degree known as many ion adducts are known. The list of possible adducts should be given as input to cliqueMS
by the user or either use one of the default adduct lists (positive.adinfo
or negative.adinfo
). Here is how the default lists look:
data(positive.adinfo)
head(positive.adinfo)
#> adduct log10freq massdiff nummol charge
#> 1 [M+2H-NH3]2+ -3.512904 -15.012016600 1 2
#> 2 [Cat]3+ -3.512904 -0.001645737 1 3
#> 3 [Cat]2+ -3.512904 -0.001040400 1 2
#> 4 [Cat+H]2+ -3.336813 1.006178842 1 2
#> 5 [M+2H]2+ -1.813934 2.014552000 1 2
#> 6 [M+H+Na]2+ -2.699991 23.996494000 1 2
data(negative.adinfo)
head(negative.adinfo)
#> adduct log10freq massdiff nummol charge
#> 1 [M-2H]2- -1.4029243 -2.014552 1 -2
#> 2 [M+Na-3H]2- -4.2480223 19.967390 1 -2
#> 3 [M-H]- -0.1721106 -1.007276 1 -1
#> 4 [M-H-H2O]- -0.8198875 -19.018390 1 -1
#> 5 [M-H-NH3]- -1.9469923 -18.033815 1 -1
#> 6 [M-H+H2O]- -2.1340790 17.003275 1 -1
The lists should have a column with the name of the adduct, one for the log10 frequency of that adduct, another for the mass of the adduct, one for the number of molecules involved and also one for the charge (see[1] for details in how default lists were made). With the adduct list we can estimate neutral masses.
cliqueMS
searches in each clique for groups of two or more features compatible with a neutral mass and two or more adducts in the adduct list. Neutral masses with only one adduct are not included in the scoring. Once we have all possible neutral masses and their corresponding adducts, the algorithm tries combinations of different adducts and neutral masses to find the most plausible annotation. All combinations are scored and the top five annotations are returned. The scoring is based on the following criteria:
The computed score (which is a logarithmic score) is the sum of the adducts log-frequencies plus the number of empty features (which has a log-frequency smaller than the least frequent adduct) and the number of neutral masses. Within a clique group, it may happen than the annotation of some features is independent from the annotation of some other features, as there is not a single neutral mass with adducts in both groups of features. In those cases, the clique group is splitted in non overlapping groups, called annotation groups. This is common in big cliques. The reported scores refer to annotation groups. The score is useful to see how probable is the first annotation compared to second annotation, third annotation, etc… within an annotation group, but it is not intended to compare annotations between different annotation groups because the score will be smaller when the number of features in the group is bigger.To compare scores from different groups, the option normalizeScore
should be set to TRUE
. The normalized score value is 100 when the score is similar to the theoretical maximum score (all the features annotated with the most frequent adducts and the minimum number of neutral masses) and goes until 0, which is the extreme case that all features of the group are not annotated. To find annotation cliqueMS
uses the function getAnnotation
.
Here is an example of annotating adducts with getAnnotation
ex.Adducts <- getAnnotation(ex.Isotopes, ppm = 10,
adinfo = positive.adinfo, polarity = "positive",
normalizeScore = TRUE)
#> Computing annotation
#> Annotation computed, updating peaklist
show(ex.Adducts)
#> anClique S4 object with 126 features
#> Features have been splitted into 17 cliques
#> 37 Features are isotopes
#> 84 features annotated
As we see from the summary
output, 178 of 275 features have annotation. Function getAnnotation
requires as input an adduct list, the parameter adinfo
. Users can use the default adduct list positive.adinfo
for positive charged adducts and negative.adinfo
for negative charged adducts. polarity
must be set, either to positive
or negative
. Lots of neutral masses are found when the clique groups have many features. In those cases, scoring all annotations could take much time as there are many possible combinations. To prevent this, neutral masses that likely will be in the final top annotations are selected and annotation is computed quickly. The selected masses have the highest frequency adducts and the largest number of adducts. For each clique group, all neutral masses are ordered depending on their score. A number of top scoring masses controlled by topmasstotal
parameter are selected. Additionally and for every feature, the ordered list of scored neutral masses is subsetted to only the neutral masses with adducts in that feature. Then a number of top scoring masses set by topmassf
parameter are selected in each sublist, and added to the set of selected masses. After the mass selection, and in cases of big cliques (size of a “big” clique is defined by parameter sizeanG
), annotation groups are splitted again in new non overlapping groups just taking into account the set of selected neutral masses.
getCliques
stores the annotation in the peaklist
of the anClique
object. Here we can see an overview of some annotated features in our sample:
features.clique6 <- getlistofCliques(ex.Adducts)[[6]]
head(getPeaklistanClique(ex.Adducts)[features.clique6,
c("an1","mass1","an2", "mass2", "an3", "mass3", "an4", "mass4", "an5",
"mass5")], n = 10)
#> an1 mass1 an2 mass2 an3 mass3 an4
#> CP037 [M+K]+ 242.0805 NA [M+H+K]2+ 522.1188
#> CP038 [2M+H]+ 242.0805 [M+H]+ 484.1605 [M+Na]+ 462.1786 [M+H-H2O]+
#> CP039 NA [M+NH4]+ 484.1605 NA
#> CP040 [2M+K]+ 242.0805 [M+K]+ 484.1605 [M+H]+ 522.1188 [M+K-H2O]+
#> CP041 [Cat-H]+ 105.0583 [Cat]2+ 208.1014 [Cat-H]+ 105.0583 [Cat-H]+
#> CP042 [M+H-OH]+ 216.0768 [M+H]+ 199.0745 [M+H]+ 199.0745 [M+H]+
#> CP043 NA NA NA
#> CP044 [Cat]+ 118.0649 [Cat]+ 118.0649 [Cat]+ 118.0649 [Cat]+
#> CP045 [Cat-H2O]+ 216.0768 [Cat-H]+ 199.0745 [Cat-H]+ 199.0745 [Cat-H]+
#> CP046 [M+Na]+ 242.0805 [M-H+2Na]+ 220.0986 [M+Na]+ 242.0805 [M-H+2Na]+
#> mass4 an5 mass5
#> CP037 NA [M+H+K]2+ 522.1188
#> CP038 502.1711 [M+H-NH3]+ 501.1871
#> CP039 NA [M+H]+ 501.1871
#> CP040 502.1711 [M+H]+ 522.1188
#> CP041 105.0583 [Cat-H]+ 105.0583
#> CP042 199.0745 [M+H]+ 199.0745
#> CP043 NA NA
#> CP044 118.0649 [Cat]+ 118.0649
#> CP045 199.0745 [Cat-H]+ 199.0745
#> CP046 220.0986 [M-H+2Na]+ 220.0986
Now we have obtained the neutral mass and the adduct annotation for our features. We could use the neutral mass together with the retention time and MS/MS data to annotate more confidently some of these metabolites. We also know how many features in the dataset are isotopes. Finally, we have achieved a reduction in the complexity of our the dataset, from many features to a signficant smaller number annotated neutral masses that have different adducts and isotopes.
[1]: “CliqueMS: a computational tool for annotating in-source metabolite ions from LC-MS untargeted metabolomics data based on a coelution similarity network”. Oriol Senan, Antoni Aguilar-Mogas, Miriam Navarro, Jordi Capellades, Luke Noon, Deborah Burks, Oscar Yanes, Roger Guimerà and Marta Sales-Pardo. Bioinformatics. Accepted March 2019.