detection of m7G, m3C and D modification by AlkAnilineSeq
7-methyl guanosine (m7G), 3-methyl cytidine (m3C) and Dihydrouridine (D) are commonly found in rRNA and tRNA and can be detected classically by primer extension analysis. However, since the modifications do not interfere with Watson-Crick base pairing, a specific chemical treatment needs to be employed to cause strand breaks specifically at the modified positions. Initially, this involved a sodium borhydride treatment to create abasic sites and cleaving the RNA at abasic sites with aniline.
This classical protocol was converted to a high throughput sequencing method call AlkAnilineSeq and allows modified position be detected by an accumulation of 5’-ends at the N+1 position (Marchand et al. 2018). It was found, that m3C is susceptible to this treatment, which allows m7G, m3C and D to be detected by the same method from the same data sets, since the identify of the unmodified nucleotide informs about the three modified nucleotides.
The ModAlkAnilineSeq
class uses the the NormEnd5SequenceData
class to store
and aggregate data along the transcripts. The calculated scores follow the
nomenclature of (Marchand et al. 2018) with the names scoreNC
(default) and scoreSR
.
## Warning: replacing previous import 'utils::findMatches' by
## 'S4Vectors::findMatches' when loading 'ExperimentHubData'
library(rtracklayer)
library(GenomicRanges)
library(RNAmodR.AlkAnilineSeq)
library(RNAmodR.Data)
The example workflow is limited to 18S rRNA and some tRNA from S.cerevisiae.
As annotation data either a gff file or a TxDb
object and for sequence data
a fasta file or a BSgenome
object can be used. The data is provided as bam
files.
annotation <- GFF3File(RNAmodR.Data.example.AAS.gff3())
sequences <- RNAmodR.Data.example.AAS.fasta()
files <- list("wt" = c(treated = RNAmodR.Data.example.wt.1(),
treated = RNAmodR.Data.example.wt.2(),
treated = RNAmodR.Data.example.wt.3()),
"Bud23del" = c(treated = RNAmodR.Data.example.bud23.1(),
treated = RNAmodR.Data.example.bud23.2()),
"Trm8del" = c(treated = RNAmodR.Data.example.trm8.1(),
treated = RNAmodR.Data.example.trm8.2()))
The analysis is triggered by the construction of a ModSetAlkAnilineSeq
object.
Internally parallelization is used via the BiocParallel
package, which would
allow optimization depending on number/size of input files (number of samples,
number of replicates, number of transcripts, etc).
msaas <- ModSetAlkAnilineSeq(files, annotation = annotation, sequences = sequences)
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
msaas
## ModSetAlkAnilineSeq of length 3
## names(3): wt Bud23del Trm8del
## | Modification type(s): m7G / m3C / D
## wt Bud23del Trm8del
## | Modifications found: yes (9) yes (8) yes (7)
## | Settings:
## minCoverage minReplicate find.mod minLength minSignal minScoreNC
## <integer> <integer> <logical> <integer> <integer> <integer>
## wt 10 1 TRUE 9 10 50
## Bud23del 10 1 TRUE 9 10 50
## Trm8del 10 1 TRUE 9 10 50
## minScoreSR minScoreBaseScore scoreOperator
## <numeric> <numeric> <character>
## wt 0.5 0.9 &
## Bud23del 0.5 0.9 &
## Trm8del 0.5 0.9 &
As expected the m7G1575 is missing from the Bud23del samples.
mod <- modifications(msaas)
lapply(mod,head, n = 2L)
## $wt
## GRanges object with 2 ranges and 6 metadata columns:
## seqnames ranges strand | mod source type
## <Rle> <IRanges> <Rle> | <character> <character> <character>
## [1] chr1 1575 + | m7G RNAmodR.AlkAnilineSeq RNAMOD
## [2] chr3 46 + | m7G RNAmodR.AlkAnilineSeq RNAMOD
## score scoreSR Parent
## <numeric> <numeric> <character>
## [1] 162.228 0.984209 1
## [2] 373.773 0.841166 3
## -------
## seqinfo: 11 sequences from an unspecified genome; no seqlengths
##
## $Bud23del
## GRanges object with 2 ranges and 6 metadata columns:
## seqnames ranges strand | mod source type
## <Rle> <IRanges> <Rle> | <character> <character> <character>
## [1] chr3 46 + | m7G RNAmodR.AlkAnilineSeq RNAMOD
## [2] chr5 50 + | m7G RNAmodR.AlkAnilineSeq RNAMOD
## score scoreSR Parent
## <numeric> <numeric> <character>
## [1] 254.6403 0.858101 3
## [2] 86.3556 0.605249 5
## -------
## seqinfo: 11 sequences from an unspecified genome; no seqlengths
##
## $Trm8del
## GRanges object with 2 ranges and 6 metadata columns:
## seqnames ranges strand | mod source type
## <Rle> <IRanges> <Rle> | <character> <character> <character>
## [1] chr1 1575 + | m7G RNAmodR.AlkAnilineSeq RNAMOD
## [2] chr3 37 + | m7G RNAmodR.AlkAnilineSeq RNAMOD
## score scoreSR Parent
## <numeric> <numeric> <character>
## [1] 117.2479 0.98729 1
## [2] 69.9604 0.97953 3
## -------
## seqinfo: 11 sequences from an unspecified genome; no seqlengths
As outlined in the RNAmodR
package we can compare the samples using the
plotCompareByCoord
to prepare a heatmap. For this we select some position
from the found modifications. In addition we prepare an alias table.
coord <- mod[[1L]]
alias <- data.frame(tx_id = c(1L,3L,5L,6L,7L,8L,10L,11L),
name = c("18S rRNA","tF(GAA)B","tG(GCC)B","tT(AGT)B",
"tQ(TTG)B","tC(GCA)B","tS(CGA)C","tV(AAC)E1"),
stringsAsFactors = FALSE)
plotCompareByCoord(msaas, coord, score = "scoreSR", alias = alias,
normalize = TRUE)
plotCompareByCoord(msaas, coord[1L], score = "scoreSR", alias = alias)
In addition, the aggregate data along the transcript visualized as well.
plotData(msaas, "1", from = 1550L, to = 1600L)
This includes raw data as well.
plotData(msaas[1L:2L], "1", from = 1550L, to = 1600L, showSequenceData = TRUE)
sessionInfo()
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] Rsamtools_2.23.0 RNAmodR.Data_1.19.0
## [3] ExperimentHubData_1.33.0 AnnotationHubData_1.37.0
## [5] futile.logger_1.4.3 ExperimentHub_2.15.0
## [7] AnnotationHub_3.15.0 BiocFileCache_2.15.0
## [9] dbplyr_2.5.0 RNAmodR.AlkAnilineSeq_1.21.0
## [11] RNAmodR_1.21.0 Modstrings_1.23.0
## [13] Biostrings_2.75.0 XVector_0.47.0
## [15] rtracklayer_1.67.0 GenomicRanges_1.59.0
## [17] GenomeInfoDb_1.43.0 IRanges_2.41.0
## [19] S4Vectors_0.45.0 BiocGenerics_0.53.0
## [21] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] BiocIO_1.17.0 bitops_1.0-9
## [3] filelock_1.0.3 tibble_3.2.1
## [5] graph_1.85.0 XML_3.99-0.17
## [7] rpart_4.1.23 lifecycle_1.0.4
## [9] httr2_1.0.5 lattice_0.22-6
## [11] ensembldb_2.31.0 OrganismDbi_1.49.0
## [13] backports_1.5.0 magrittr_2.0.3
## [15] Hmisc_5.2-0 sass_0.4.9
## [17] rmarkdown_2.28 jquerylib_0.1.4
## [19] yaml_2.3.10 RUnit_0.4.33
## [21] Gviz_1.51.0 DBI_1.2.3
## [23] RColorBrewer_1.1-3 abind_1.4-8
## [25] zlibbioc_1.53.0 purrr_1.0.2
## [27] AnnotationFilter_1.31.0 biovizBase_1.55.0
## [29] RCurl_1.98-1.16 nnet_7.3-19
## [31] VariantAnnotation_1.53.0 rappdirs_0.3.3
## [33] GenomeInfoDbData_1.2.13 AnnotationForge_1.49.0
## [35] codetools_0.2-20 DelayedArray_0.33.0
## [37] xml2_1.3.6 tidyselect_1.2.1
## [39] farver_2.1.2 UCSC.utils_1.3.0
## [41] matrixStats_1.4.1 base64enc_0.1-3
## [43] GenomicAlignments_1.43.0 jsonlite_1.8.9
## [45] Formula_1.2-5 tools_4.5.0
## [47] progress_1.2.3 stringdist_0.9.12
## [49] Rcpp_1.0.13 glue_1.8.0
## [51] gridExtra_2.3 SparseArray_1.7.0
## [53] BiocBaseUtils_1.9.0 xfun_0.48
## [55] MatrixGenerics_1.19.0 dplyr_1.1.4
## [57] withr_3.0.2 formatR_1.14
## [59] BiocManager_1.30.25 fastmap_1.2.0
## [61] latticeExtra_0.6-30 fansi_1.0.6
## [63] digest_0.6.37 mime_0.12
## [65] R6_2.5.1 colorspace_2.1-1
## [67] jpeg_0.1-10 dichromat_2.0-0.1
## [69] biomaRt_2.63.0 RSQLite_2.3.7
## [71] utf8_1.2.4 generics_0.1.3
## [73] data.table_1.16.2 prettyunits_1.2.0
## [75] httr_1.4.7 htmlwidgets_1.6.4
## [77] S4Arrays_1.7.0 pkgconfig_2.0.3
## [79] gtable_0.3.6 blob_1.2.4
## [81] htmltools_0.5.8.1 bookdown_0.41
## [83] RBGL_1.83.0 ProtGenerics_1.39.0
## [85] scales_1.3.0 Biobase_2.67.0
## [87] png_0.1-8 colorRamps_2.3.4
## [89] knitr_1.48 lambda.r_1.2.4
## [91] rstudioapi_0.17.1 reshape2_1.4.4
## [93] rjson_0.2.23 checkmate_2.3.2
## [95] curl_5.2.3 biocViews_1.75.0
## [97] cachem_1.1.0 stringr_1.5.1
## [99] BiocVersion_3.21.1 parallel_4.5.0
## [101] foreign_0.8-87 AnnotationDbi_1.69.0
## [103] restfulr_0.0.15 pillar_1.9.0
## [105] grid_4.5.0 vctrs_0.6.5
## [107] cluster_2.1.6 htmlTable_2.4.3
## [109] evaluate_1.0.1 magick_2.8.5
## [111] tinytex_0.53 GenomicFeatures_1.59.0
## [113] cli_3.6.3 compiler_4.5.0
## [115] futile.options_1.0.1 rlang_1.1.4
## [117] crayon_1.5.3 labeling_0.4.3
## [119] interp_1.1-6 plyr_1.8.9
## [121] stringi_1.8.4 deldir_2.0-4
## [123] BiocParallel_1.41.0 BiocCheck_1.43.0
## [125] txdbmaker_1.3.0 munsell_0.5.1
## [127] lazyeval_0.2.2 Matrix_1.7-1
## [129] BSgenome_1.75.0 hms_1.1.3
## [131] bit64_4.5.2 ggplot2_3.5.1
## [133] KEGGREST_1.47.0 highr_0.11
## [135] SummarizedExperiment_1.37.0 ROCR_1.0-11
## [137] memoise_2.0.1 bslib_0.8.0
## [139] bit_4.5.0