Package: MsBackendMetaboLights
Authors: Johannes Rainer [aut, cre] (ORCID:
https://orcid.org/0000-0002-6977-7147),
Philippine Louail [aut] (ORCID:
https://orcid.org/0009-0007-5429-6846)
Last modified: 2024-10-24 01:16:11.227111
Compiled: Tue Oct 29 18:27:17 2024
The Spectra package provides a central infrastructure for the handling of Mass Spectrometry (MS) data in Bioconductor. The package supports interchangeable use of different backends to import and represent MS data from a variety of sources and data formats. The MsBackendMetaboLights package allows to retrieve MS data files directly from the MetaboLights repository. MetaboLights is one of the main public repositories for deposition of metabolomics experiments including (raw) MS and/or NMR data files and the related experimental and analytical results. The MsBackendMetaboLights package downloads and locally caches MS data files for a MetaboLights data set and enables further analyses of this data directly in R.
The package can be installed from within R with the commands below:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("RforMassSpectrometry/MsBackendMetaboLights")
MetaboLights is one of the main public repositories for deposition of metabolomics experiments including (raw) mass spectrometry (MS) and NMR data files and experimental/analysis results. The experimental metadata and results are stored as plain text files in ISA-tab format. Each MetaboLights experiment must provide a file describing the samples analyzed and at least one assay file that links between the experimental samples and the (raw and processed) data files with quantification of metabolites/features in these samples.
In this vignette we explore and load MS data files from a small MetaboLights experiment. MetaboLights provides information on a data set/experiment as a set of plain text files in ISA-tab format. These can be accessed and read from the data set’s ftp folder. The set of files consist generally of a file with information on the experiment/investigation (in a file with the file name starting with i_) the samples of the data set (file name starting with s_), the assay (measurements/analysis) of the experiment and a file with quantified metabolite abundances (file name starting with m_). Note that a data set can have more than one assay file.
Below we list all files from the MetaboLights data set with the ID MTBLS39.
library(MsBackendMetaboLights)
#' List files of a MetaboLights data set
all_files <- mtbls_list_files("MTBLS39")
All these files are directly accessible in the ftp folder associated with the
MetaboLights data set. Below we use the mtbls_ftp_path()
function to return
the ftp path for our test data set.
mtbls_ftp_path("MTBLS39")
## [1] "ftp://ftp.ebi.ac.uk/pub/databases/metabolights/studies/public/MTBLS39/"
We could inspect the content of this folder also using a browser supporting the
ftp file transfer protocol and download individual files manually. We can
however access the files also directly from within R. Below we read the assay
data file directly using the base R read.table()
function.
#' Get the assay files of the data set
grep("^a_", all_files, value = TRUE)
## [1] "a_MTBLS39_the_plasticity_of_the_grapevine_berry_transcriptome_metabolite_profiling_mass_spectrometry.txt"
#' Read the assay file
a <- read.table(paste0(mtbls_ftp_path("MTBLS39"),
grep("^a_", all_files, value = TRUE)),
sep = "\t", header = TRUE, check.names = FALSE)
Each row in this assay table refers to one measurement (data file) of the data set, with columns providing information on that measurement. The number and content of columns can vary between data sets and depends on the information the original researcher (manually) provided. Below we list the columns available in the assay file of our test data set.
colnames(a)
## [1] "Sample Name"
## [2] "Protocol REF"
## [3] "Parameter Value[Post Extraction]"
## [4] "Parameter Value[Derivatization]"
## [5] "Extract Name"
## [6] "Protocol REF"
## [7] "Parameter Value[Chromatography Instrument]"
## [8] "Term Source REF"
## [9] "Term Accession Number"
## [10] "Parameter Value[Autosampler model]"
## [11] "Term Source REF"
## [12] "Term Accession Number"
## [13] "Parameter Value[Column model]"
## [14] "Parameter Value[Column type]"
## [15] "Parameter Value[Guard column]"
## [16] "Term Source REF"
## [17] "Term Accession Number"
## [18] "Labeled Extract Name"
## [19] "Label"
## [20] "Term Source REF"
## [21] "Term Accession Number"
## [22] "Protocol REF"
## [23] "Parameter Value[Scan polarity]"
## [24] "Parameter Value[Scan m/z range]"
## [25] "Parameter Value[Instrument]"
## [26] "Term Source REF"
## [27] "Term Accession Number"
## [28] "Parameter Value[Ion source]"
## [29] "Term Source REF"
## [30] "Term Accession Number"
## [31] "Parameter Value[Mass analyzer]"
## [32] "Term Source REF"
## [33] "Term Accession Number"
## [34] "MS Assay Name"
## [35] "Raw Spectral Data File"
## [36] "Protocol REF"
## [37] "Normalization Name"
## [38] "Derived Spectral Data File"
## [39] "Protocol REF"
## [40] "Data Transformation Name"
## [41] "Metabolite Assignment File"
MS data files are generally provided in a column named "Derived Spectral Data File"
but sometimes they are also listed in a column named "Raw Spectral Data File"
. Note that providing MS data files is not absolutely mandatory, thus, for
some data sets no MS data files might be available. Below we list the content of
these data columns.
a[, c("Raw Spectral Data File", "Derived Spectral Data File")]
## Raw Spectral Data File Derived Spectral Data File
## 1 FILES/MN063A.cdf NA
## 2 FILES/MN063B.cdf NA
## 3 FILES/MN063C.cdf NA
## 4 FILES/CS063A.cdf NA
## 5 FILES/CS063B.cdf NA
## 6 FILES/CS063C.cdf NA
## 7 FILES/AM063A.cdf NA
## 8 FILES/AM063B.cdf NA
## 9 FILES/AM063C.cdf NA
## 10 FILES/MN073A.cdf NA
## 11 FILES/MN073B.cdf NA
## 12 FILES/MN073C.cdf NA
## 13 FILES/CS073A.cdf NA
## 14 FILES/CS073B.cdf NA
## 15 FILES/CS073C.cdf NA
## 16 FILES/AM073A.cdf NA
## 17 FILES/AM073B.cdf NA
## 18 FILES/AM073C.cdf NA
## 19 FILES/MN083A.cdf NA
## 20 FILES/MN083B.cdf NA
## 21 FILES/MN083C.cdf NA
## 22 FILES/CS083A.cdf NA
## 23 FILES/CS083B.cdf NA
## 24 FILES/CS083C.cdf NA
## 25 FILES/AM083A.cdf NA
## 26 FILES/AM083B.cdf NA
## 27 FILES/AM083C.cdf NA
For this particular data set the MS data files are provided in the "Raw Spectral Data File"
column. These files are in CDF format and can hence be
loaded using the MsBackendMetaboLights
backend into R as a Spectra
object
(MsBackendMetaboLights
directly extends Spectra’s MsBackendMzR
backend and
therefore supports import of MS data files in mzML, CDF or mzXML
formats). By default, all MS data files of all assays would be retrieved, but in
our example below we restrict to few data files to reduce the amount of data
that needs to be downloaded. To this end we define a pattern matching the file
name of only some data files using the filePattern
parameter. Alternatively,
for data sets with more than one assay, it would also be possible to select MS
data files from one particular assay only using the assayName
parameter. In
our case we load all MS data files that end with 63A.cdf.
library(Spectra)
#' Load MS data files of one data set
s <- Spectra("MTBLS39", filePattern = "63A.cdf",
source = MsBackendMetaboLights())
## Used data files from the assay's column "Raw Spectral Data File" since none were available in column "Derived Spectral Data File".
s
## MSn data (Spectra) with 1664 spectra in a MsBackendMetaboLights backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 0.296384 1
## 2 1 6.206912 2
## 3 1 12.093056 3
## 4 1 17.942912 4
## 5 1 23.835072 5
## ... ... ... ...
## 1660 1 2678.27 549
## 1661 1 2683.01 550
## 1662 1 2687.81 551
## 1663 1 2692.62 552
## 1664 1 2697.40 553
## ... 36 more variables/columns.
##
## file(s):
## MN063A.cdf
## CS063A.cdf
## AM063A.cdf
This call now downloaded the files to the local cache and loaded these files as
a Spectra
object. The downloading and caching of the data is handled by
Bioconductor’s BiocFileCache. The local cache can thus be managed
directly using functionality from that package. Any subsequent loading of the
same data files will load the locally cached versions avoiding thus repetitive
download of the same data.
The message that is shown by the call above indicates that the MS data files
were not provided in the expected column ("Derived Spectral Data File"
) but in
the column for raw data files.
The Spectra
object with the MS data files of the MetaboLights data set enables
now any subsequent analysis of the data in R. On top of the spectra variables
and mass peak data values that are provided by the MS data files also additional
information related to the MetaboLights data set are available as specific
spectra variables. We list all available spectra variables of the data set
below.
spectraVariables(s)
## [1] "msLevel" "rtime"
## [3] "acquisitionNum" "scanIndex"
## [5] "dataStorage" "dataOrigin"
## [7] "centroided" "smoothed"
## [9] "polarity" "precScanNum"
## [11] "precursorMz" "precursorIntensity"
## [13] "precursorCharge" "collisionEnergy"
## [15] "isolationWindowLowerMz" "isolationWindowTargetMz"
## [17] "isolationWindowUpperMz" "peaksCount"
## [19] "totIonCurrent" "basePeakMZ"
## [21] "basePeakIntensity" "ionisationEnergy"
## [23] "lowMZ" "highMZ"
## [25] "mergedScan" "mergedResultScanNum"
## [27] "mergedResultStartScanNum" "mergedResultEndScanNum"
## [29] "injectionTime" "filterString"
## [31] "spectrumId" "ionMobilityDriftTime"
## [33] "scanWindowLowerLimit" "scanWindowUpperLimit"
## [35] "mtbls_id" "mtbls_assay_name"
## [37] "derived_spectral_data_file"
The MetaboLights-specific variables are "mtbls_id"
, "mtbls_assay_name"
and
"derived_spectral_data_file"
providing the MetaboLights ID of the data set,
the assay/method with which the data files were generated and the original file
path/name of the data files on the MetaboLights ftp server.
spectraData(s, c("mtbls_id", "mtbls_assay_name",
"derived_spectral_data_file"))
## DataFrame with 1664 rows and 3 columns
## mtbls_id mtbls_assay_name derived_spectral_data_file
## <character> <character> <character>
## 1 MTBLS39 a_MTBLS39_the_plasti.. FILES/MN063A.cdf
## 2 MTBLS39 a_MTBLS39_the_plasti.. FILES/MN063A.cdf
## 3 MTBLS39 a_MTBLS39_the_plasti.. FILES/MN063A.cdf
## 4 MTBLS39 a_MTBLS39_the_plasti.. FILES/MN063A.cdf
## 5 MTBLS39 a_MTBLS39_the_plasti.. FILES/MN063A.cdf
## ... ... ... ...
## 1660 MTBLS39 a_MTBLS39_the_plasti.. FILES/AM063A.cdf
## 1661 MTBLS39 a_MTBLS39_the_plasti.. FILES/AM063A.cdf
## 1662 MTBLS39 a_MTBLS39_the_plasti.. FILES/AM063A.cdf
## 1663 MTBLS39 a_MTBLS39_the_plasti.. FILES/AM063A.cdf
## 1664 MTBLS39 a_MTBLS39_the_plasti.. FILES/AM063A.cdf
These variables can be used to link the individual spectra back to the original sample (e.g. through the assay and sample tables of the MetaboLights data set.
The mtbls_sync()
function can be used to synchronize the local content of a
MsBackendMetaboLights
. This function checks if all data files of the backend
are available locally and eventually downloads and caches missing files.
mtbls_sync(s@backend)
## Used data files from the assay's column "Raw Spectral Data File" since none were available in column "Derived Spectral Data File".
## MsBackendMetaboLights with 1664 spectra
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 0.296384 1
## 2 1 6.206912 2
## 3 1 12.093056 3
## 4 1 17.942912 4
## 5 1 23.835072 5
## ... ... ... ...
## 1660 1 2678.27 549
## 1661 1 2683.01 550
## 1662 1 2687.81 551
## 1663 1 2692.62 552
## 1664 1 2697.40 553
## ... 36 more variables/columns.
##
## file(s):
## MN063A.cdf
## CS063A.cdf
## AM063A.cdf
Also, it is possible to manually cache and download data files from
MetaboLights using the mtbls_sync_data_files()
function. This function
evaluates if the respective data files are already cached and, if so, does not
download them again. Below we use this retrieve the local storage information on
one of the data files of the MetaboLights data set MTBLS39:
res <- mtbls_sync_data_files("MTBLS39", fileName = "AM063A.cdf")
## Used data files from the assay's column "Raw Spectral Data File" since none were available in column "Derived Spectral Data File".
res
## rid mtbls_id
## 1 BFC649 MTBLS39
## mtbls_assay_name
## 1 a_MTBLS39_the_plasticity_of_the_grapevine_berry_transcriptome_metabolite_profiling_mass_spectrometry.txt
## derived_spectral_data_file rpath
## 1 FILES/AM063A.cdf /home/biocbuild/.cache/R/BiocFileCache/AM063A.cdf
The mtbls_cached_data_files()
function can be used to inspect and list locally
cached MetaboLights data files. This function does not require an active
internet connection since only local content is queried. With the default
settings, a data.frame
with all available data files is returned.
mtbls_cached_data_files()
## rid mtbls_id
## 3 BFC649 MTBLS39
## mtbls_assay_name
## 3 a_MTBLS39_the_plasticity_of_the_grapevine_berry_transcriptome_metabolite_profiling_mass_spectrometry.txt
## derived_spectral_data_file rpath
## 3 FILES/AM063A.cdf /home/biocbuild/.cache/R/BiocFileCache/AM063A.cdf
sessionInfo()
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] MsBackendMetaboLights_1.1.0 Spectra_1.17.0
## [3] BiocParallel_1.41.0 S4Vectors_0.45.0
## [5] BiocGenerics_0.53.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 MsCoreUtils_1.19.0 utf8_1.2.4
## [4] generics_0.1.3 RSQLite_2.3.7 digest_0.6.37
## [7] magrittr_2.0.3 evaluate_1.0.1 bookdown_0.41
## [10] fastmap_1.2.0 blob_1.2.4 jsonlite_1.8.9
## [13] ProtGenerics_1.39.0 mzR_2.41.0 DBI_1.2.3
## [16] BiocManager_1.30.25 httr_1.4.7 purrr_1.0.2
## [19] fansi_1.0.6 codetools_0.2-20 jquerylib_0.1.4
## [22] cli_3.6.3 rlang_1.1.4 dbplyr_2.5.0
## [25] Biobase_2.67.0 bit64_4.5.2 withr_3.0.2
## [28] cachem_1.1.0 yaml_2.3.10 tools_4.5.0
## [31] parallel_4.5.0 memoise_2.0.1 dplyr_1.1.4
## [34] ncdf4_1.23 filelock_1.0.3 curl_5.2.3
## [37] vctrs_0.6.5 R6_2.5.1 BiocFileCache_2.15.0
## [40] lifecycle_1.0.4 fs_1.6.4 IRanges_2.41.0
## [43] bit_4.5.0 clue_0.3-65 MASS_7.3-61
## [46] cluster_2.1.6 pkgconfig_2.0.3 bslib_0.8.0
## [49] pillar_1.9.0 Rcpp_1.0.13 glue_1.8.0
## [52] xfun_0.48 tibble_3.2.1 tidyselect_1.2.1
## [55] knitr_1.48 htmltools_0.5.8.1 rmarkdown_2.28
## [58] compiler_4.5.0 MetaboCoreUtils_1.15.0