library(Risa)
library(xcms)
library(CAMERA)
library(pcaMethods)
Indole-3-acetaldoxime (IAOx) represents an early intermediate of the biosynthesis of a variety of indolic secondary metabolites including the phytoanticipin indol-3-ylmethyl glucosinolate and the phytoalexin camalexin (3-thiazol-2’-yl-indole). Arabidopsis thaliana cyp79B2 cyp79B3 double knockout plants are completely impaired in the conversion of tryptophan to indole-3-acetaldoxime and do not accumulate IAOx-derived metabolites any longer. Consequently, comparative analysis of wild-type and cyp79B2 cyp79B3 plant lines has the potential to explore the complete range of IAOx-derived indolic secondary metabolites.
Since 2006, the Bioconductor package xcms (Smith et al, 2006) provides a rich set of algorithms for mass spectrometry data processing. Typically, xcms will create an xcmsSet object from several raw data files in an assay, which are obtained from the samples in the study.
Allowed raw data formats are netCDF, mzData, mzXML and mzML.
In this vignette, we demonstrate the processing of the MTBLS2 dataset, which was described in Neumann 2012.
A few things might be worth to define at the beginning of an analysis
## How many CPU cores has your machine (or cluster) ?
nSlaves=1
# prefilter <- c(3,200) ## standard
prefilter=c(6,750) ## quick-run for debugging
This can be done with the vendor tools, or the open source proteowizard converter. The preferred format should be mzML or mzData/mzXML. An overview of formats (and problems) is available at the xcms online help pages.
An ISAtab archive will contain the metadata description in several tab-separated files. (One of) the assay files contains the column Raw Spectral Data File
with the paths to the mass spectral raw data files in one of the above formats.
ISAmtbls2 <- readISAtab(find.package("mtbls2"))
a.filename <- ISAmtbls2["assay.filenames"][[1]]
With the combination of Risa and xcms, we can convert the MS raw data in an ISAtab archive into an xcmsSet:
mtbls2Set <- processAssayXcmsSet(ISAmtbls2, a.filename,
method="centWave", prefilter=prefilter,
snthr=25, ppm=25,
peakwidth=c(5,12),
nSlaves=nSlaves)
##
## Detecting mass traces at 25 ppm ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 11335 m/z ROI's.
##
## Detecting chromatographic peaks ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 3707 Peaks.
##
## Detecting mass traces at 25 ppm ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 11193 m/z ROI's.
##
## Detecting chromatographic peaks ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 3690 Peaks.
##
## Detecting mass traces at 25 ppm ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 11156 m/z ROI's.
##
## Detecting chromatographic peaks ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 3722 Peaks.
##
## Detecting mass traces at 25 ppm ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 11146 m/z ROI's.
##
## Detecting chromatographic peaks ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 3771 Peaks.
##
## Detecting mass traces at 25 ppm ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 12661 m/z ROI's.
##
## Detecting chromatographic peaks ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 4045 Peaks.
##
## Detecting mass traces at 25 ppm ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 12041 m/z ROI's.
##
## Detecting chromatographic peaks ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 3862 Peaks.
##
## Detecting mass traces at 25 ppm ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 12489 m/z ROI's.
##
## Detecting chromatographic peaks ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 3969 Peaks.
##
## Detecting mass traces at 25 ppm ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 11940 m/z ROI's.
##
## Detecting chromatographic peaks ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 3897 Peaks.
##
## Detecting mass traces at 25 ppm ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 12627 m/z ROI's.
##
## Detecting chromatographic peaks ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 4124 Peaks.
##
## Detecting mass traces at 25 ppm ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 13344 m/z ROI's.
##
## Detecting chromatographic peaks ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 4226 Peaks.
##
## Detecting mass traces at 25 ppm ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 12897 m/z ROI's.
##
## Detecting chromatographic peaks ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 4185 Peaks.
##
## Detecting mass traces at 25 ppm ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 12689 m/z ROI's.
##
## Detecting chromatographic peaks ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 4190 Peaks.
##
## Detecting mass traces at 25 ppm ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 13007 m/z ROI's.
##
## Detecting chromatographic peaks ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 4429 Peaks.
##
## Detecting mass traces at 25 ppm ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 13666 m/z ROI's.
##
## Detecting chromatographic peaks ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 4407 Peaks.
##
## Detecting mass traces at 25 ppm ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 13033 m/z ROI's.
##
## Detecting chromatographic peaks ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 4337 Peaks.
##
## Detecting mass traces at 25 ppm ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 13646 m/z ROI's.
##
## Detecting chromatographic peaks ...
## % finished: 0 10 20 30 40 50 60 70 80 90 100
## 4377 Peaks.
The result is the same type of xcmsSet object:
show(mtbls2Set)
## An "xcmsSet" object with 16 samples
##
## Time range: 18.4-1147.6 seconds (0.3-19.1 minutes)
## Mass range: 99.5288-1003.5005 m/z
## Peaks: 64938 (about 4059 per sample)
## Peak Groups: 0
## Sample classes: Col-0.Exp1, cyp79.Exp1, Col-0.Exp2, cyp79.Exp2
##
## Peak picking was performed on MS1.
## Profile settings: method = bin
## step = 0.1
##
## Memory usage: 6.33 MB
Several options exist to quantify the individual intensities. For each feature, additional attributes are available, such as the minimum/maximum and average retention time and m/z values.
In the following steps, we perform a grouping: because the UPLC system used here has very stable retention times, we just use the retention time correction step as quality control of the raw data. After that, ‘fillPeaks()’ will integrate the raw data for those features, which were not detected in some of the samples.
mtbls2Set <- group(mtbls2Set, minfrac=1, bw=3)
## 162 224 287 349 412 474 537 599 662 724 787 849 912 974
retcor(mtbls2Set, plottype="mdevden")
## Retention Time Correction Groups: 1227
A first QC step is the visual inspection of intensities across the samples. Alternatively to a boxplot, one could also create histograms/density plots.
boxplot(groupval(mtbls2Set, value="into")+1,
col=as.numeric(sampclass(mtbls2Set))+1,
log="y", las=2)
After grouping, peaks might be missing/not found in some samples. fillPekas()
will impute them, using the consensus mz and RT from the other samples.
mtbls2Set <- fillPeaks(mtbls2Set, nSlaves=nSlaves)
## /tmp/RtmpOCW2HC/Rinst361baa6a7aa/mtbls2/mzData/MSpos-Ex1-Col0-48h-Ag-1_1-A,1_01_9818.mzData
## /tmp/RtmpOCW2HC/Rinst361baa6a7aa/mtbls2/mzData/MSpos-Ex1-Col0-48h-Ag-2_1-A,1_01_9820.mzData
## /tmp/RtmpOCW2HC/Rinst361baa6a7aa/mtbls2/mzData/MSpos-Ex1-Col0-48h-Ag-3_1-A,1_01_9822.mzData
## /tmp/RtmpOCW2HC/Rinst361baa6a7aa/mtbls2/mzData/MSpos-Ex1-Col0-48h-Ag-4_1-A,1_01_9824.mzData
## /tmp/RtmpOCW2HC/Rinst361baa6a7aa/mtbls2/mzData/MSpos-Ex1-cyp79-48h-Ag-1_1-B,1_01_9819.mzData
## /tmp/RtmpOCW2HC/Rinst361baa6a7aa/mtbls2/mzData/MSpos-Ex1-cyp79-48h-Ag-2_1-B,2_01_9821.mzData
## /tmp/RtmpOCW2HC/Rinst361baa6a7aa/mtbls2/mzData/MSpos-Ex1-cyp79-48h-Ag-3_1-B,1_01_9823.mzData
## /tmp/RtmpOCW2HC/Rinst361baa6a7aa/mtbls2/mzData/MSpos-Ex1-cyp79-48h-Ag-4_1-B,2_01_9825.mzData
## /tmp/RtmpOCW2HC/Rinst361baa6a7aa/mtbls2/mzData/MSpos-Ex2-Col0-48h-Ag-1_1-A,2_01_9827.mzData
## /tmp/RtmpOCW2HC/Rinst361baa6a7aa/mtbls2/mzData/MSpos-Ex2-Col0-48h-Ag-2_1-A,3_01_9829.mzData
## /tmp/RtmpOCW2HC/Rinst361baa6a7aa/mtbls2/mzData/MSpos-Ex2-Col0-48h-Ag-3_1-A,4_01_9831.mzData
## /tmp/RtmpOCW2HC/Rinst361baa6a7aa/mtbls2/mzData/MSpos-Ex2-Col0-48h-Ag-4_1-A,2_01_9833.mzData
## /tmp/RtmpOCW2HC/Rinst361baa6a7aa/mtbls2/mzData/MSpos-Ex2-cyp79-48h-Ag-1_1-B,3_01_9828.mzData
## /tmp/RtmpOCW2HC/Rinst361baa6a7aa/mtbls2/mzData/MSpos-Ex2-cyp79-48h-Ag-2_1-B,4_01_9830.mzData
## /tmp/RtmpOCW2HC/Rinst361baa6a7aa/mtbls2/mzData/MSpos-Ex2-cyp79-48h-Ag-3_1-B,3_01_9832.mzData
## /tmp/RtmpOCW2HC/Rinst361baa6a7aa/mtbls2/mzData/MSpos-Ex2-cyp79-48h-Ag-4_1-B,4_01_9834.mzData
The final xcmsSet represents a rectangular matrix of mass spectral features, which were detected (or imputed) across the samples. The dimensionality is M * N, where M denotes the number of samples in the assay, and N the number of features grouped across the samples.
QC of mass accuracy and retention time consistency
plotQC(mtbls2Set)
In addition to the boxplot for QC, we can also check a hierarchical clustering and the PCA of the samples.
sdThresh <- 4.0 ## Filter low-standard deviation rows for plot
data <- log(groupval(mtbls2Set, value="into")+1)
pca.result <- pca(data, nPcs=3)
plotPcs(pca.result, type="loadings",
col=as.numeric(sampclass(mtbls2Set))+1)
an <- xsAnnotate(mtbls2Set,
sample=seq(1,length(sampnames(mtbls2Set))),
nSlaves=nSlaves)
an <- groupFWHM(an)
an <- findIsotopes(an) # optional but recommended.
an <- groupCorr(an,
graphMethod="lpc",
calcIso = TRUE,
calcCiS = TRUE,
calcCaS = TRUE,
cor_eic_th=0.5)
## Setup ruleSet
rs <- new("ruleSet")
rs@ionlistfile <- file.path(find.package("mtbls2"), "lists","ions.csv")
rs@neutraladditionfile <- file.path(find.package("mtbls2"), "lists","neutraladdition.csv")
rs@neutrallossfile <- file.path(find.package("mtbls2"), "lists","neutralloss.csv")
rs <- readLists(rs)
rs <- setDefaultParams(rs)
rs <- generateRules(rs)
an <- findAdducts(an,
rules=rs@rules,
polarity="positive")
dr <- diffreport(mtbls2Set, sortpval=FALSE, filebase="mtbls2diffreport", eicmax=20 )
## Loading required package: multtest
## MSpos-Ex1-Col0-48h-Ag-1_1-A,1_01_9818 MSpos-Ex1-Col0-48h-Ag-2_1-A,1_01_9820 MSpos-Ex1-Col0-48h-Ag-3_1-A,1_01_9822 MSpos-Ex1-Col0-48h-Ag-4_1-A,1_01_9824 MSpos-Ex1-cyp79-48h-Ag-1_1-B,1_01_9819 MSpos-Ex1-cyp79-48h-Ag-2_1-B,2_01_9821 MSpos-Ex1-cyp79-48h-Ag-3_1-B,1_01_9823 MSpos-Ex1-cyp79-48h-Ag-4_1-B,2_01_9825
## Warning in dir.create(eicdir): 'mtbls2diffreport_eic' already exists
## Warning in dir.create(boxdir): 'mtbls2diffreport_box' already exists
cspl <- getPeaklist(an)
annotatedDiffreport <- cbind(dr, cspl)
interestingPspec <- tapply(seq(1, nrow(annotatedDiffreport)),
INDEX=annotatedDiffreport[,"pcgroup"],
FUN=function(x, a) {m <- median(annotatedDiffreport[x, "pvalue"]);
p <- max(annotatedDiffreport[x, "pcgroup"]);
as.numeric(c(pvalue=m,pcgroup=p))},
annotatedDiffreport)
interestingPspec <- do.call(rbind, interestingPspec)
colnames(interestingPspec) <- c("pvalue", "pcgroup")
o <- order(interestingPspec[,"pvalue"])
pdf("interestingPspec.pdf")
dummy <- lapply(interestingPspec[o[1:40], "pcgroup"],
function(x) {suppressWarnings(plotPsSpectrum(an, pspec=x, maxlabel=5))})
dev.off()
## png
## 2
These attributes and the intensity matrix could already be exported to conform to the specification for the ``metabolite assignment file’’ in the mzTab format used in MetaboLights. Currently, this functionality is not included in xcms. A prototype snippet is the following:
pl <- annotatedDiffreport
charge <- sapply(an@isotopes, function(x) {
ifelse( length(x) > 0, x$charge, NA)
})
abundance <- groupval(an@xcmsSet, value="into")
##
## load ISA assay files
##
a.samples <- ISAmtbls2["samples.per.assay.filename"][[ a.filename ]]
##
## These columns are defined by mzTab
##
maf.std.colnames <- c("identifier", "chemical_formula", "description",
"mass_to_charge", "fragmentation", "charge", "retention_time",
"taxid", "species", "database", "database_version", "reliability",
"uri", "search_engine", "search_engine_score", "modifications",
"smallmolecule_abundance_sub", "smallmolecule_abundance_stdev_sub",
"smallmolecule_abundance_std_error_sub")
##
## Plus the columns for the sample intensities
##
all.colnames <- c(maf.std.colnames, a.samples)
##
## Now assemble new maf
##
l <- nrow(pl)
maf <- data.frame(identifier = character(l),
chemical_formula = character(l),
description = character(l),
mass_to_charge = pl$mz,
fragmentation = character(l),
charge = charge,
retention_time = pl$rt,
taxid = character(l),
species = character(l),
database = character(l),
database_version = character(l),
reliability = character(l),
uri = character(l),
search_engine = character(l),
search_engine_score = character(l),
modifications = character(l),
smallmolecule_abundance_sub = character(l),
smallmolecule_abundance_stdev_sub = character(l),
smallmolecule_abundance_std_error_sub = character(l),
abundance, stringsAsFactors=FALSE)
##
## Make sure maf table is quoted properly,
## and add to the ISAmtbls2 assay file.
##
maf_character <- apply(maf, 2, as.character)
write.table(maf_character,
file=paste(tempdir(), "/a_mtbl2_metabolite profiling_mass spectrometry_maf.csv", sep=""),
row.names=FALSE, col.names=all.colnames,
quote=TRUE, sep="\t", na="\"\"")
ISAmtbls2 <- updateAssayMetadata(ISAmtbls2, a.filename,
"Metabolite Assignment File",
paste(tempdir(), "/a_mtbl2_metabolite profiling_mass spectrometry_maf.csv", sep=""))
write.assay.file(ISAmtbls2, a.filename)
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.3 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] multtest_2.26.0 pcaMethods_1.60.0 CAMERA_1.26.0
## [4] igraph_1.0.1 xcms_1.46.0 ProtGenerics_1.2.0
## [7] mzR_2.4.0 Risa_1.11.1 affy_1.48.0
## [10] biocViews_1.38.0 Rcpp_0.12.1 Biobase_2.30.0
## [13] BiocGenerics_0.16.0
##
## loaded via a namespace (and not attached):
## [1] plyr_1.8.3 BiocInstaller_1.20.0 formatR_1.2.1
## [4] RColorBrewer_1.1-2 bitops_1.0-6 tools_3.2.2
## [7] zlibbioc_1.16.0 rpart_4.1-10 digest_0.6.8
## [10] gtable_0.1.2 evaluate_0.8 preprocessCore_1.32.0
## [13] lattice_0.20-33 graph_1.48.0 yaml_2.1.13
## [16] proto_0.3-10 gridExtra_2.0.0 cluster_2.0.3
## [19] stringr_1.0.0 knitr_1.11 nnet_7.3-11
## [22] stats4_3.2.2 grid_3.2.2 XML_3.98-1.3
## [25] RBGL_1.46.0 survival_2.38-3 foreign_0.8-66
## [28] rmarkdown_0.8.1 latticeExtra_0.6-26 Formula_1.2-1
## [31] reshape2_1.4.1 ggplot2_1.0.1 magrittr_1.5
## [34] MASS_7.3-44 scales_0.3.0 Hmisc_3.17-0
## [37] codetools_0.2-14 htmltools_0.2.6 splines_3.2.2
## [40] RUnit_0.4.30 colorspace_1.2-6 stringi_0.5-5
## [43] acepack_1.3-3.3 munsell_0.4.2 RCurl_1.95-4.7
## [46] affyio_1.40.0