Introduction
DNA methylation can be used to identify functional changes at
transcriptional enhancers and other cis-regulatory modules (CRMs) in
tumors and other primary disease tissues. Our R/Bioconductor package
ELMER
(Enhancer Linking by Methylation/Expression Relationships) provides a
systematic approach that reconstructs gene regulatory networks (GRNs) by
combining methylation and gene expression data derived from the same set
of samples. ELMER uses
methylation changes at CRMs as the central hub of these networks, using
correlation analysis to associate them with both upstream master
regulator (MR) transcription factors and downstream target genes.
This package can be easily applied to TCGA public available cancer
data sets and custom DNA methylation and gene expression data sets.
ELMER analyses have 5 main steps:
- Identify distal probes on HM450K or EPIC arrays.
- Identify distal probes with significantly different DNA methylation
level between two groups
- Identify putative target genes for differentially methylated distal
probes.
- Identify enriched motifs for the distalprobes which are
significantly differentially methylated and linked to putative target
gene.
- Identify regulatory TFs whose expression associate with DNA
methylation at enriched motifs.
Package workflow
The package workflow is showed in the figure below:
ELMER workflow: ELMER receives as input a DNA
methylation object, a gene expression object (both can be either a
matrix or a SummarizedExperiment object) and a Genomic Ranges (GRanges)
object with distal probes to be used as a filter which can be retrieved
using the
get.feature.probe
function. The function
createMAE
will create a Multi Assay Experiment object
keeping only samples that have both DNA methylation and gene expression
data. Genes will be mapped to genomic position and annotated using
ENSEMBL database, while for probes it will add annotation from (
http://zwdzwd.github.io/InfiniumAnnotation). This MAE
object will be used as input to the next analysis functions. First, it
identifies differentially methylated probes followed by the
identification of their nearest genes (10 upstream and 10 downstream)
through the
get.diff.meth
and
GetNearGenes
functions respectively. For each probe, it will verify if any of the
nearby genes were affected by its change in the DNA methylation level
and a list of gene and probes pairs will be outputted from
get.pair
function. For the probes in those pairs, it will
search for enriched regulatory Transcription Factors motifs with the
get.enriched.motif
function. Finally, the enriched motifs
will be correlated with the level of the transcription factor through
the
get.TFs
function. In the figure green Boxes represent
user input data, blue boxes represent output object, orange boxes
represent auxiliary pre-computed data and gray boxes are
functions.
Main differences between ELMER v2 vs ELMER v1
Summary table
Primary data structure |
mee object (custom data structure) |
MAE object (Bioconductor data structure) |
Auxiliary data |
Manually created |
Programmatically created |
Number of human TFs |
1,982 |
1,639 (curated list from Lambert, Samuel A., et
al.) |
Number of TF motifs |
91 |
771 (HOCOMOCO v11 database) |
TF classification |
78 families |
82 families and 331 subfamilies (TFClass database,
HOCOMOCO) |
Analysis performed |
Normal vs tumor samples |
Group 1 vs group 2 |
Statistical grouping |
Unsupervised only |
Unsupervised or
supervised using labeled groups |
TCGA data source |
The Cancer Genome Atlas (TCGA) (not available) |
The NCI’s Genomic Data Commons (GDC) |
Genome of reference |
GRCh37 (hg19) |
GRCh37 (hg19)/GRCh38 (hg38) |
DNA methylation platforms |
HM450 |
EPIC and HM450 |
Graphical User Interface (GUI) |
None |
TCGAbiolinksGUI |
Automatic report |
None |
HTML summarizing results |
Annotations |
None |
StateHub |
Supervised vs Unsupervised mode
In ELMER v2 we introduce a new concept, the algorithm
mode
that can be either supervised
or
unsupervised
. In the unsupervised mode (described in ELMER
v1), it is assumed that one of the two groups is a heterogeneous mix of
different (sometimes unknown) molecular phenotypes. For instance, in the
example of Breast Cancer, normal breast tissues (Group A) are relatively
homogenous, whereas Breast tumors fall into multiple molecular
subtypes.
The assumption of the Unsupervised mode is that methylation changes
may be restricted to a subset of one or more molecular subtypes, and
thus only be present in a fraction of the samples in the test group. For
instance, methylation changes related to estrogen signaling may only be
present in LuminalA or LuminalB subtypes.
When this structure is unknown, the Unsupervised mode is the
appropriate model, since it only requires changes in a subset of samples
(by default, 20%). In contrast, in the Supervised mode, it is assumed
that each group represents a more homogenous molecular phenotype, and
thus we compare all samples in Group A vs. all samples in Group B. This
can be used in the case of direct comparison of tumor subtypes
(i.e. Luminal vs. Basal-like tumors), but can also be used in numerous
other situations, including sorted cells of different types, or treated
vs. untreated samples in perturbation experiments.
Installing and loading ELMER
To install this package from github (development version), start R
and enter:
devtools::install_github(repo = "tiagochst/ELMER.data")
devtools::install_github(repo = "tiagochst/ELMER")
To install this package from Bioconductor start R and enter:
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("ELMER")
Then, to load ELMER enter:
Citing this work
If you used ELMER package or its results, please cite:
- Yao, L., Shen, H., Laird, P. W., Farnham, P. J., & Berman, B. P.
“Inferring regulatory element landscapes and transcription factor
networks from cancer methylomes.” Genome Biol 16 (2015): 105.
- Yao, Lijing, Benjamin P. Berman, and Peggy J. Farnham. “Demystifying
the secret mission of enhancers: linking distal regulatory elements to
target genes.” Critical reviews in biochemistry and molecular biology
50.6 (2015): 550-573.
- Tiago C Silva, Simon G Coetzee, Nicole Gull, Lijing Yao, Dennis J
Hazelett, Houtan Noushmehr, De-Chen Lin, Benjamin P Berman; ELMER v.2:
An R/Bioconductor package to reconstruct gene regulatory networks from
DNA methylation and transcriptome profiles, Bioinformatics, , bty902, https://doi.org/10.1093/bioinformatics/bty902
If you get TCGA data using getTCGA
function, please cite
TCGAbiolinks package:
Colaprico A, Silva TC, Olsen C, Garofano L, Cava C, Garolini D,
Sabedot T, Malta TM, Pagnotta SM, Castiglioni I, Ceccarelli M, Bontempi
G and Noushmehr H. “TCGAbiolinks: an R/Bioconductor package for
integrative analysis of TCGA data.” Nucleic acids research (2015):
gkv1507.
Silva, TC, A Colaprico, C Olsen, F D’Angelo, G Bontempi, M
Ceccarelli, and H Noushmehr. 2016. “TCGA Workflow: Analyze Cancer
Genomics and Epigenomics Data Using Bioconductor Packages [Version 2;
Referees: 1 Approved, 1 Approved with Reservations].” F1000Research 5
(1542). doi:10.12688/f1000research.8923.2.
Grossman, Robert L., et al. “Toward a shared vision for cancer
genomic data.” New England Journal of Medicine 375.12 (2016):
1109-1112.
If you get use the Graphical user interface, please cite
TCGAbiolinksGUI
package:
- Silva, Tiago C. and Colaprico, Antonio and Olsen, Catharina and
Bontempi, Gianluca and Ceccarelli, Michele and Berman, Benjamin P. and
Noushmehr, Houtan. “TCGAbiolinksGUI: A graphical user interface to
analyze cancer molecular and clinical data” (bioRxiv 147496; doi: https://doi.org/10.1101/147496)
Session Info
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] MultiAssayExperiment_1.33.0 SummarizedExperiment_1.37.0
## [3] Biobase_2.67.0 MatrixGenerics_1.19.0
## [5] matrixStats_1.4.1 GenomicRanges_1.59.0
## [7] GenomeInfoDb_1.43.0 IRanges_2.41.0
## [9] S4Vectors_0.45.0 sesameData_1.23.0
## [11] ExperimentHub_2.15.0 AnnotationHub_3.15.0
## [13] BiocFileCache_2.15.0 dbplyr_2.5.0
## [15] BiocGenerics_0.53.0 BiocStyle_2.35.0
## [17] dplyr_1.1.4 DT_0.33
## [19] ELMER_2.31.0 ELMER.data_2.29.0
##
## loaded via a namespace (and not attached):
## [1] later_1.3.2 BiocIO_1.17.0
## [3] bitops_1.0-9 filelock_1.0.3
## [5] tibble_3.2.1 XML_3.99-0.17
## [7] rpart_4.1.23 lifecycle_1.0.4
## [9] httr2_1.0.5 rstatix_0.7.2
## [11] doParallel_1.0.17 vroom_1.6.5
## [13] processx_3.8.4 lattice_0.22-6
## [15] ensembldb_2.31.0 crosstalk_1.2.1
## [17] backports_1.5.0 magrittr_2.0.3
## [19] plotly_4.10.4 Hmisc_5.2-0
## [21] sass_0.4.9 rmarkdown_2.28
## [23] jquerylib_0.1.4 yaml_2.3.10
## [25] Gviz_1.51.0 chromote_0.3.1
## [27] DBI_1.2.3 RColorBrewer_1.1-3
## [29] abind_1.4-8 zlibbioc_1.53.0
## [31] rvest_1.0.4 purrr_1.0.2
## [33] AnnotationFilter_1.31.0 biovizBase_1.55.0
## [35] RCurl_1.98-1.16 nnet_7.3-19
## [37] VariantAnnotation_1.53.0 rappdirs_0.3.3
## [39] circlize_0.4.16 GenomeInfoDbData_1.2.13
## [41] ggrepel_0.9.6 codetools_0.2-20
## [43] DelayedArray_0.33.0 xml2_1.3.6
## [45] tidyselect_1.2.1 shape_1.4.6.1
## [47] farver_2.1.2 UCSC.utils_1.3.0
## [49] TCGAbiolinksGUI.data_1.25.0 base64enc_0.1-3
## [51] GenomicAlignments_1.43.0 jsonlite_1.8.9
## [53] GetoptLong_1.0.5 Formula_1.2-5
## [55] iterators_1.0.14 systemfonts_1.1.0
## [57] foreach_1.5.2 tools_4.5.0
## [59] progress_1.2.3 ragg_1.3.3
## [61] Rcpp_1.0.13 glue_1.8.0
## [63] BiocBaseUtils_1.9.0 gridExtra_2.3
## [65] SparseArray_1.7.0 xfun_0.48
## [67] websocket_1.4.2 withr_3.0.2
## [69] BiocManager_1.30.25 fastmap_1.2.0
## [71] latticeExtra_0.6-30 fansi_1.0.6
## [73] digest_0.6.37 mime_0.12
## [75] R6_2.5.1 textshaping_0.4.0
## [77] colorspace_2.1-1 jpeg_0.1-10
## [79] dichromat_2.0-0.1 biomaRt_2.63.0
## [81] RSQLite_2.3.7 utf8_1.2.4
## [83] tidyr_1.3.1 generics_0.1.3
## [85] data.table_1.16.2 rtracklayer_1.67.0
## [87] prettyunits_1.2.0 httr_1.4.7
## [89] htmlwidgets_1.6.4 S4Arrays_1.7.0
## [91] pkgconfig_2.0.3 gtable_0.3.6
## [93] blob_1.2.4 ComplexHeatmap_2.23.0
## [95] XVector_0.47.0 htmltools_0.5.8.1
## [97] carData_3.0-5 ProtGenerics_1.39.0
## [99] clue_0.3-65 scales_1.3.0
## [101] png_0.1-8 knitr_1.48
## [103] rstudioapi_0.17.1 reshape2_1.4.4
## [105] tzdb_0.4.0 rjson_0.2.23
## [107] checkmate_2.3.2 curl_5.2.3
## [109] cachem_1.1.0 GlobalOptions_0.1.2
## [111] stringr_1.5.1 BiocVersion_3.21.1
## [113] parallel_4.5.0 foreign_0.8-87
## [115] AnnotationDbi_1.69.0 restfulr_0.0.15
## [117] reshape_0.8.9 pillar_1.9.0
## [119] grid_4.5.0 vctrs_0.6.5
## [121] promises_1.3.0 ggpubr_0.6.0
## [123] car_3.1-3 cluster_2.1.6
## [125] archive_1.1.9 htmlTable_2.4.3
## [127] evaluate_1.0.1 TCGAbiolinks_2.35.0
## [129] readr_2.1.5 GenomicFeatures_1.59.0
## [131] cli_3.6.3 compiler_4.5.0
## [133] Rsamtools_2.23.0 rlang_1.1.4
## [135] crayon_1.5.3 ggsignif_0.6.4
## [137] labeling_0.4.3 interp_1.1-6
## [139] ps_1.8.1 plyr_1.8.9
## [141] stringi_1.8.4 viridisLite_0.4.2
## [143] deldir_2.0-4 BiocParallel_1.41.0
## [145] munsell_0.5.1 Biostrings_2.75.0
## [147] lazyeval_0.2.2 Matrix_1.7-1
## [149] BSgenome_1.75.0 hms_1.1.3
## [151] bit64_4.5.2 ggplot2_3.5.1
## [153] KEGGREST_1.47.0 highr_0.11
## [155] fontawesome_0.5.2 broom_1.0.7
## [157] memoise_2.0.1 bslib_0.8.0
## [159] bit_4.5.0 downloader_0.4