Genomic analysis can be utilised to identify differences between RNA populations in two conditions, both in production and abundance. This includes the identification of RNAs produced by multiple genomes within a biological system. For example, RNA produced by pathogens within a host. Or for our purposes, RNAs moving between the roots and shoots across a plant graft junction of partners with distinct genotypes. To locate and identify these RNAs genomic pipelines typically align reads consequential to each genome in the system. This comes with benefits and disadvantages. Here, we address the main disadvantages, the high levels of data noise and false positives. The mobileRNA package provides methods to pre-process, analyse and visualise the sRNA and mRNA populations based on the premise of mapping reads to all genotypes at the same time. This vignette explains the use of the package and demonstrates quick or advanced workflows.
In plants, systemic signalling is an elaborated molecular system which coordinates plant development, integrating and transmitting the information perceived from the environment to distant organs. An important role in long-distance signalling is played by small RNA molecules (sRNAs). The nucleotide length of a sRNA helps researchers identify the class of sRNA and predict its functionality. Micro-RNAs (miRNAs) are involved in directing translational repression and/or the cleavage of messenger RNAs (mRNAs). Whereas small interfering RNAs (siRNAs) are involved in the maintenance and de novo DNA methylation and account for the majority of sRNAs in plants. These endogenous sRNAs can be produced in a tissue and then transported systemically across the vascular system into recipient organs, where they can induce a molecular response and coordinate physiological changes. Similarly, messenger RNAs (mRNAs) can move across distances, and it’s thought they may translate into proteins which act as transcription factors in the recipient tissues.
Plant grafting can be utilised to create chimeric plant systems composed of two
genotypes, such as different species like tomato and eggplant, or plant
varieties or accessions. Grafting has been used as a method to study RNA
mobilomes and their impact on the phenotype. Yet, it is clear that there is no
standardised genomic approach for the analysis of sequencing data to identify an
RNA mobilome. Here we introduce the R package, mobileRNA, a recommended pipeline
and analysis workflow for the identification of a sRNAs/mRNA mobilome. In
addition, the flexibility supports standard RNA analysis between treatment and
control conditions. For example, to identify sRNA population changes due to the
application of a treatment such as cold/heat stress or exposure to a pest.
mobileRNA
ultimately assists in pre-processing and analysis including the
characterization of different populations, visualization of the results, and
supporting output for functional analysis.
As stated, this was developed for applications for plant grafting experimental analysis, however, we believe it could have further applications including the analysis of dual-host systems.
In grafted plants, when different genotypes are used as rootstock and scions, the sequence variation between the two genomes involved can be used to discriminate the origin of a sequenced RNA molecule. Therefore, if an RNA molecule sequenced from one of the grafted partners (scion or rootstock) has been found to match the genome of the other grafting partner, this could empirically demonstrate its movement across the graft junction.
Most available genomics approaches to implement this analysis are based on RNA sequencing, followed by alignment on a genotype of reference and post-alignment screening of genetic variants to identify molecules which have a better match for the genotype of the grafted partner. These methods have many limitations, which might include:
Here, to circumvent such problems we propose a method inspired by the RNAseq analysis of plant hybrids (Lopez-Gomollon 2022), including an alignment step performed simultaneously on both genomes involved. The rationale of this approach considers that alignment tools already implement an algorithm ideal for the identification of the best matches (according to set parameters) in a given genome reference, but they do not account for potential matches to DNA sequences which are not provided as reference. Therefore, the two genomes from all partners involved in the system are merged in a single FASTA file and used as a reference for the unique alignment. Ultimately, in a bid to supply the algorithm with as much information as possible to make the best possible predictions and placement of sequencing reads to each genome.
The summarised workflow is shown below (Figure 1) where it contains a core RNA analysis and a mobile sRNA/mRNA analysis. The core analysis represents the standard workflow for the identification of RNA populations which have been gained, lost or changed in abundance, for example, the sRNA population difference between treatment and condition samples, or similarly in a chimeric system, such as plant grafting, we might want to explore the native sRNA population from the sample tissue origin (i.e. leaf) which have been lost or gained or changes in sRNA abundance. While the mobile analysis represents the workflow for the identification of putative mobile sRNAs or mRNA in a plant graft system.
As input, the pipeline requires cleaned sRNA or mRNA sequencing reads in FASTQ
format, along with the genome assemblies which represent the genotypes in the
system. The diagram below illustrates the complete workflow using mobileRNA
,
including essential, optional, and plotting functions.
The analysis approach includes several underlying features to be aware of which can alter the final output. When working with a chimeric system, the core steps offer the ability to remove mapping errors by comparing control samples to treatment samples. If the genotypes in the chimeric system are fairly distantly related, it is unexpected that unique reads aligned to the mobile genotype will be found in the control samples. With that in mind, we can assume RNAs with reads mapped to the mobile genotype from the control samples could be artifacts or mapping errors. Hence, these RNAs are removed from the analysis when the parameters are utilized. Note that, if the chimeric system is expected to share some or a high level of similarity it might be insightful to the analysis to not remove these sRNA clusters.
At the end of the analysis, the user will have generated a dataframe where rows represent either sRNA clusters or mRNAs and the columns include information on the sRNA clusters or mRNAs, individual sample replicates, and more:
Locus
: Genomic locationchr
: Chromosome or scaffold namestart
: Start position of the clusterend
: End position of the clusterCluster
: Cluster NameDicerConsensus
: Consensus dicercall (Calculated by RNAdicercall()
)DicerCounts
: Number of replicates which contributed to the consensus dicercall (Calculated by RNAdicercall()
)CountMean
: Count mean (Calculated by RNAdifferentialAnalysis()
)log2FoldChange
: Log fold change (Calculated by RNAdifferentialAnalysis()
)pvalue
: p-value (Calculated by RNAdifferentialAnalysis()
)padjusted
: Adjusted p-value (Calculated by RNAdifferentialAnalysis()
)logCPM
: Log counts per million (CPM/RPM) (Calculated by RNAdifferentialAnalysis()
)DicerCall_
: The nucleotide length of most abundant sRNACount_
: Number of aligned sRNA reads. As default, these are uniquely aligned (e.g. not multi-mapping).RPM_
: Reads per MillionFPKM_
: Fragments Per Kilobase of transcript per MillionMajorRNA_
: RNA sequence of most abundant sRNA in the clusterFeature
: mRNA nameSampleCounts
: Consensus dicercall (Calculated by RNAdicercall()
)CountMean
: Count mean (Calculated by RNAdifferentialAnalysis()
)log2FoldChange
: Log fold change (Calculated by RNAdifferentialAnalysis()
)pvalue
: p-value (Calculated by RNAdifferentialAnalysis()
)padjusted
: Adjusted p-value (Calculated by RNAdifferentialAnalysis()
)logCPM
: Log counts per million (CPM/RPM) (Calculated by RNAdifferentialAnalysis()
)Count_
: Number of aligned mRNA reads. As default, these are uniquely aligned (e.g. not multi-mapping).FPKM_
: Fragments Per Kilobase of transcript per MillionIn addition, the user can utilise additional functions to plot a PCA, distance matrix, sRNA class distribution and heatmaps. Plus, mobileRNA offers functions to assist functional analysis to support the determination of the biological implications. For instance, within sRNA analysis, the user can extract the consensus RNA sequence for each sRNA cluster for target prediction analysis, as well as identifying genomic features associates with each sRNA clusters, such as genes or repeat regions. This can be utilised, for example, to explore the RNA expression of sRNA-producing genes in parallel analysis.
The latest version of mobileRNA
can be installed via Bioconductor:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("mobileRNA")
It is also available on GitHub:
if (!require("devtools")) install.packages("devtools")
devtools::install_github("KJeynesCupper/mobileRNA", ref = "main")
Load into R library:
mobileRNA
works on systems with R
, and depending on the type of sequencing
data different OS dependencies installed via Conda (Anaconda-Inc 2020) are required for
the alignment step.
For sRNA data, ShortStack
(>= 4.0) (Axtell 2013) and the dependencies are
required. Please consider that ShortStack
is not available for Windows, hence,
Windows users will either need to opt to use a virtual machine or
Windows Subsystem for Linux
In either case, both R
and ShortStack
will need to be installed and used on
the Linux side. Please head to ShortStack
to see the recommended installation instructions with Conda (Anaconda-Inc 2020).
This will ensure all dependencies are available within the same environment.
For mRNA data, HISAT2 (Kim 2015), HTSeq (Anders, Pyl, and Huber 2014), SAMtools (Danecek P 2021) are required within the same Conda environment (Anaconda-Inc 2020).
The package includes a simulated data set to replicate the grafting between eggplant-tomato (Solanum melongena - Solanum lycopersicon) where eggplant represents the scion and tomato represents the rootstock. The FASTQ files represent data extracted from the eggplant leaf tissue. Here we will locate mobile RNAs produced by tomato roots which have travelled into the eggplant leaves. In the example data set, there are heterograft replicate, where each is an individual tomato replicate spiked with the same random set of tomato sRNA clusters. There are two sets of each for mRNA analysis and three sets for sRNA analysis. These are known as:
heterograft_1
heterograft_2
heterograft_3
(sRNA analysis only)There are self-graft replicates, where each is an individual tomato replicate without the spiked tomato sRNA clusters. These are:
selfgraft_1
selfgraft_2
selfgraft_3
(sRNA analysis only)The replicates mirror each other where, for instance, heterograft_1
and selfgraft_1
are the same replicate, either with or without the spiked
clusters.
Hence, we also provide more comprehensive data sets, for small RNA it is called
sRNA_data
and for messenger RNA it is called mRNA_data
. Each stores an
example dataframe produced by the importation step with RNAimport()
. This can
be loaded in the R environment by using the following command:
For sRNAseq & mRNAseq alignment step, we provide demo FASTQ files; these have a
reduced number of reads and do not cluster as expected, so should not be used
for downstream analysis. These are stored within inst/extdata
. Tomato and
Eggplant reduced genome assemblies and annotations have been provided
based on those generated by Hosmani et al., 2019 (Hosmani 2019) and Barchi
et al., 2021 (Barchi 2021).
The simulated data was generated from the data published Consortium (2012).
This is a quick-start example analysis of sRNAseq data to locate putative mobile
root-to-shoot sRNAs from a plant grafting experiment between eggplant and
tomato.
Merge the FASTA genome assemblies of tomato and eggplant into a single reference file stored in your desired directory. Eggplant represents the scion as genome A, and tomato represents the rootstock (foreign/mobile) as genome B.
fasta_1 <- system.file("extdata","reduced_chr12_Eggplant.fa.gz",
package="mobileRNA")
fasta_2 <-system.file("extdata","reduced_chr2_Tomato.fa.gz",
package="mobileRNA")
# define temporary output directory - replace with your directory
output_assembly_file <- file.path(tempfile("merged_assembly",
fileext = ".fa"))
# merge
merged_reference <- RNAmergeGenomes(genomeA = fasta_1,
genomeB = fasta_2,
output_file = output_assembly_file)
Now, repeat for the genome annotations (GFF) - this is required for downstream analysis or for mRNA mapping.
It is important that the same alterations are made to each genome in the merged
files. Otherwise, the annotation file will not align with the chromosome names
in the assembly file.
anno1 <- system.file("extdata","reduced_chr12_Eggplant.gff.gz",
package="mobileRNA")
anno2 <- system.file("extdata","reduced_chr2_Tomato.gff.gz",
package="mobileRNA")
# define temporary output directory
output_annotation_file <- file.path(tempfile("merged_annotation",
fileext = ".gff3"))
# merge annotation files into a single file
merged_annotation <- RNAmergeAnnotations(annotationA = anno1,
annotationB = anno2,
output_file = output_annotation_file)
Here we will align our samples containing small RNA sequencing reads to the
merged genome assembly. The mapRNA()
function invokes a system
command to ShortStack
; an application which employs Bowtie
(Langmead B 2009)
mapping and performs comprehensive de novo annotation and quantification of
sRNA genes. The function first undertakes de novo sRNA cluster detection
(stored in 1_de_novo_detection
) and then quantification of sRNA genes
(stored in2_sRNA_results
). Since we are using chimeric samples, we
use mmap = n
to exclude multi-mapped reads. We exclude multi-mappers as we
currently do not have a method to distinguish whether a read is mapped to
multiple locations within one or both of the genomes in the merged reference.
PLEASE NOTE: For the alignment & import demo, we are using fastq files which have a reduced size for the package. This data will not be used for the continued analysis step, instead, you will need to load the full dataset.
# directory containing only sRNAseq samples
samples <- system.file("extdata/sRNAseq",package="mobileRNA")
# output location
output_location <- tempdir()
mapRNA(input = "sRNA",
input_files_dir = samples,
output_dir = output_location,
genomefile = output_assembly_file,
condaenv = "/Users/user-name/miniconda3/envs/ShortStack4",
mmap = "n")
If there are issues utilising this function, the manual steps are illustrated in
the Appendix
Import the results from the alignment step into R using the RNAimport()
function.
# Directory containing results
results_dir <- file.path(output_location,"2_sRNA_results")
# Sample names and total number of reads, in the same order.
sample_names <- c("selfgraft_demo_1", "selfgraft_demo_2",
"selfgraft_demo_3", "heterograft_demo_1",
"heterograft_demo_2", "heterograft_demo_3")
sRNA_data_demo <- RNAimport(input = "sRNA",
directory = results_dir,
samples = sample_names)
This will generate a dataframe where rows represent sRNA clusters and contains the following columns:
Locus
: Genomic locationchr
: Name of the chromosome or scaffoldstart
: Start position of the clusterend
: End position of the clusterCluster
: Cluster NameFor each sample, the following columns will be present. Where the sample name follows after the underscore:
DicerCall_
: The size of most abundant small RNA sizeCount_
: Number of uniquely aligned reads that overlap the locus.RPM_
: Reads per MillionFPKM_
: Fragments Per Kilobase of transcript per MillionMajorRNA_
: RNA sequence of the most abundant sRNA in the clusterLoad the full analysis dataset for a more comprehensive analysis:
Select the putative mobile sRNA clusters using RNAmobile()
. This
requires supplying the function with a unique identifier of the
rootstock genome, which is the prefix “B” to the tomato chromosome names. Here,
the function retains sRNA clusters which were aligned to the tomato genome.
# define control samples
controls <- c("selfgraft_1", "selfgraft_2", "selfgraft_3")
mobile_sRNA <- RNAmobile(input = "sRNA",
data = sRNA_data,
controls = controls,
genome.ID = "B",
task = "keep")
For a more advanced analysis, users can include further steps. Here we demonstrate a more advanced analysis for the Core RNA analysis and Mobile RNA analysis, both demonstrated with the sRNAseq dataset.
A useful step before analysis is to assess the overall similarity between
sample replicates to understand which samples are similar and different. This
is known as sample-level quality control and can help us understand where the
largest variation is introduced, whether the data meets the expectations and if
there are outliers.
Here, the RNAdistribution()
function can generate a number of different
customized plots to represent the number of sRNA clusters within each
dicercall sRNA class in a sample. In plants, sRNAs are known to be produced
with the length between 20-24 nucleotides, and the lengths signify the sRNA
class and specific functional role in (epi)genetic regulation. If there was an
inconsistent size profile of sRNAs within the cluster, the dicercall is defined
as “N”, ie, unclassified.
Plotting the distribution of dicercall sRNA classes within each replicate can support expectation for samples. For instance, if a significant number of sRNA clusters are unclassified it might suggest the data contains degraded RNA fragments, or novel types of sRNA genes.
# plot each replicate as a line, separately, and facet
sample_distribution_line <- RNAdistribution(sRNA_data,
style = "line",
overlap = FALSE,
facet = TRUE,
colour = "darkgreen")
# plot each replicate as a bar, separately, and facet
sample_distribution_bar <- RNAdistribution(sRNA_data,
style = "bar",
facet = TRUE,
colour ="lightblue")
Let’s view the plots:
Principal Component Analysis (PCA) is a useful technique to illustrate sample
distance as it emphasizes the variation through the reduction of dimensions in
the data set. Here, we introduce the function plotSamplePCA()
groups <- c("Heterograft", "Heterograft", "Heterograft",
"Selfgraft", "Selfgraft", "Selfgraft")
plotSamplePCA(data = sRNA_data, group = groups, size.ratio = 2)
Have a look at the sRNA_data
data frame, you will see that for each sample the
sRNA class for a given cluster has been determined (see columns with names
containing with “DicerCall_”) which, in this data, will state a number from
20-24. This value represent the length in nucleotides of the most abundant
sRNA within the cluster. For some clusters, there is no particular sRNA which
is more abundant than another, hence, it is stated as “NA” or “N”, which is
referred to as being unclassified.
The RNAdicercall()
function is used to calculate the consensus dicercall
for each sRNA cluster. This is based on the classification predicted for the
cluster by each sample within the analysis. There are several parameters which
will alter the output, including the handling of ties and the
method to draw the consensus from.
When working along the mobile sRNA analysis workflow, the function contains
a specialised parameter which can be utilized. This is chimeric=TRUE
, plus
with the genome.ID
and controls
parameters. In the example below B_
represents the prefix added to the mobile/foreign genotype, which is
the rootstock tomato genome. This helps optimise classification by removal of
potential mapping errors.
# define consensus, store as a data summary file.
sRNA_data_dicercall <- RNAdicercall(data = sRNA_data,
chimeric = TRUE,
genome.ID = "B_",
controls = c("selfgraft_1",
"selfgraft_2",
"selfgraft_3"))
For the downstream analysis, it can be useful to define distinct groups of
sRNA classes depending on your organism. For plant samples, it is beneficial to
select a group of 24-nt and another containing 21/22-nt sRNAs.
To subset the data, use the RNAsubset()
function to choose which sRNA
populations.
# Subset data for analysis: 24-nt sRNAs
sRNA_24 <- RNAsubset(sRNA_data_dicercall, class = 24)
# Subset data for analysis: 21/22-nt sRNAs
sRNA_2122 <- RNAsubset(sRNA_data_dicercall, class = c(21, 22))
DESeq2
or edgeR
Differential analysis is undertaken to identify RNAs which
are statistically significant to discover quantitative changes in the abundance
levels between the treatment and the control groups. This technique can be
undertaken with either the DESeq2
(Love, Huber, and Anders 2014) or edgeR
(McCarthy, Chen, and Smyth 2012)
analytical method.
First, let’s re-order the data frame so we can compare control vs treatment (i.e. Selfgraft vs Heterograft). When the data is imported, it may not be in the correct order/levels for comparison.
#reorder df
controls <- c("selfgraft_1", "selfgraft_2", "selfgraft_3")
reorder_df <- RNAreorder(sRNA_data_dicercall, controls)
# sample conditions in order within dataframe
groups <- c("Selfgraft", "Selfgraft", "Selfgraft",
"Heterograft", "Heterograft", "Heterograft")
## Differential analysis of whole dataset: DESeq2 method
sRNA_DESeq2 <- RNAdifferentialAnalysis(data = reorder_df,
group = groups,
method = "DESeq2")
We can summarise the results using RNAsummary()
:
Comparison : Selfgraft Vs Heterograft out of 8138 with nonzero read count adjusted p-value < 0.1 144 sRNA clusters met the adjusted p-value cutoff LFC > 0 (higher) : 144 , 100 % LFC < 0 (lower) : 0 , 0 %
How about looking at the sRNA population which are statistically significant:
Comparison : Selfgraft Vs Heterograft out of 8138 with nonzero read count adjusted p-value < 0.05 141 sRNA clusters met the adjusted p-value cutoff LFC > 0 (higher) : 141 , 100 % LFC < 0 (lower) : 0 , 0 %
When comparing treatment to control conditions, it might be the case that the same sRNA clusters are found within both, yet, there could be difference in the total abundance of the shared clusters. For instance, for a given sRNA cluster the samples in the treatment condition might have a greater abundance than the samples in the control condition.
The statistical analysis calculated the log2FC values for each sRNA cluster by comparing the normalised counts between treatment and control. Here, a positive log2FC indicates an increased abundance of transcripts for a given sRNA cluster in the treatment compared to the control, while negative log2FC indicates decreased abundance of transcripts for a given sRNA cluster. The statistical significance of the log2FC is determined by the adjusted p-value.
Here we will filter the data to select sRNA clusters which are statistically significant, and then plot the results as a heatmap to compare the conditions.
NOTE: the data used here will not yield any results as the treatment and control samples contain the exact same population of eggplant sRNAs, the only difference in the treatment samples are the spiked tomato sRNA clusters.
# summary of statistical sRNA clusters
RNAsummary(sRNA_DESeq2, alpha=0.05)
#> Comparison : Selfgraft Vs Heterograft
#> out of 8138 with nonzero read count
#> adjusted p-value < 0.05
#> 141 sRNA clusters met the adjusted p-value cutoff
#> LFC > 0 (higher) : 141 , 100 %
#> LFC < 0 (lower) : 0 , 0 %
#>
# select significant
significant_sRNAs <- sRNA_DESeq2[sRNA_DESeq2$padjusted < 0.05, ]
Plot the results:
p1 <- plotHeatmap(significant_sRNAs, value = "RPM", row.names = FALSE,
title = "Heatmap of log-transformed FPKM")
The RNApopulation()
function can be utilised to identify unique
sRNA populations found in the treatment or control conditions. This represents
the sRNA which are gained or lost due to the treatment conditions.
First let’s look at the sRNA clusters gained to the treatment condition.
In the chimeric heterografts, we expect that the foreign sRNAs will also be
selected in this pick-up, therefore, we can use the parameter genome.ID
to
remove sRNA cluster related to the foreign genome.
# select sRNA clusters only found in treatment & not in the control samples
gained_sRNA <- RNApopulation(data = sRNA_DESeq2,
conditions = c("heterograft_1",
"heterograft_2" ,
"heterograft_3"),
chimeric = TRUE,
genome.ID = "B_",
controls = c("selfgraft_1",
"selfgraft_2",
"selfgraft_3"))
#> Comparison : Selfgraft Vs Heterograft
#> out of 8138 sRNA clusters in the dataset
#> unique sRNA clusters : 148 , 1.8 %
#> samples : heterograft_1, heterograft_2, heterograft_3
#> --- No statistical cutoff was implemented.
#>
#> LFC > 0 (higher) : 148 , 100 % LFC < 0 (lower) : 0 , 0 % NULL
# look at number of sRNA cluster only found in treatment
nrow(gained_sRNA)
#> [1] 148
Now, the sRNA clusters lost and only produced in the control condition:
# select sRNA clusters only found in control & not in the treatment samples
lost_sRNA <- RNApopulation(data = sRNA_DESeq2,
conditions = c("selfgraft_1",
"selfgraft_2" ,
"selfgraft_3"),
chimeric = TRUE,
genome.ID = "B_",
controls = c("selfgraft_1",
"selfgraft_2",
"selfgraft_3"))
#> Comparison :
#> out of 8138 sRNA clusters in the dataset
#> unique sRNA clusters : 0 , 0 %
#> samples : selfgraft_1, selfgraft_2, selfgraft_3
#> --- No statistical cutoff was implemented.
#>
#> LFC > 0 (higher) : 0 , NaN % LFC < 0 (lower) : 0 , NaN % NULL
# look at number of sRNA cluster only found in control
nrow(lost_sRNA)
#> [1] 0
Now we have identified unique populations produced or not produced in our treatment samples compared to our control samples, we can extract the RNA sequences to undertaken target prediction and then onward to gene ontology enrichment analysis.
Moreover, we can identify genomic features associated with these sRNA clusters which are unique to the treatment and absent in the control (i.e. gained).
gained_sRNA_attributes <- RNAattributes(data = gained_sRNA,
match ="genes",
annotation = output_annotation_file)
We identify candidate mobile RNAs by identifying those which are mapped
to the genotype representing the mobile RNAs. These can be isolated using the
RNAmobile()
function. Then the user can undertake further steps to assist
functional analysis.
In respect to the example data set, we are looking to identify sRNAs traveling
from the tomato rootstock to the eggplant scion in the heterografts. Hence, this
function will look to select clusters mapped to the tomato genome and remove
those mapped to the eggplant genome. Previously, the example of the prefix B_
was added to the chromosomes of the tomato genome while prefix A_
added to
the eggplant genome. To remove or keep specific clusters, we align this request
with the "task"
parameter.
# vector of control names
control_names <- c("selfgraft_1", "selfgraft_2", "selfgraft_3")
## Identify potential tomato mobile molecules
mobile_sRNA <- RNAmobile(input = "sRNA",
data = sRNA_DESeq2,
controls = control_names,
genome.ID = "B_",
task = "keep",
statistical = FALSE)
We can plot our results as a heatmap, which represents the normalised
reads-per-million (RPM) values which have been log transformed.
Here we will plot all potential mobile molecules and those which are statistically significant:
Now, we can extrapolate information to assist the prediction of their targets and role in the biological system. mobileRNA offer three different tools to assist the functional analysis.
IMPORTANT: Alterations to the genome assemblies by the RNAmergeGenomes()
function must be replicated in the annotations. A merged annotation with the
same amendments can be created with the function RNAmergeAnnotations()
.
Each sRNA cluster contains coordinates, these can be matched with coordinates in an annotation file. A match occurs when the cluster is found within the coordinates of a feature. If there is a match, the function returns the input dataframe with additional fields of information from the annotation file.
This enables users to identify the genomic features producing the mobile sRNAs. Here we will be overlapping the data with genes and adding a buffer region of 1 kb upstream and downstream of each gene.
annotation_file <- system.file("extdata",
"prefix_reduced_chr2_Tomato.gff.gz",
package="mobileRNA")
mobile_attributes <- RNAattributes(data = mobile_sRNA,
match ="genes",
annotation = annotation_file)
Very similar to before, we can find stricter overlaps between our candidate sRNA clusters and genomic features. However, this time we can calculate the number of sRNA clusters which are associated to each type of feature. These include promoter regions, exon, introns, untranslated regions and repeat regions. The results can either be displayed in the data frame as an absolute value or as a percentage of the total:
mobile_features <- RNAfeatures(data = mobile_sRNA,
annotation = annotation_file)
print(mobile_features)
#> Genome Dataset
#> promoters 50% 100%
#> exons 0% 0%
#> introns 50% 0%
#> 5'UTR 0% 0%
#> 3'UTR 0% 0%
#> repeats 0% 0%
#> others 0% 0%
# plot as stacked bar chart
features_plot <- plotRNAfeatures(data = sRNA_data,
annotation = annotation_file)
plot(features_plot)
To predict the targets of the mobile sRNA candidates, we need to extract
the RNA sequence. Here, we introduce the RNAsequences()
function which
extrapolates the most common RNA sequence for a cluster and determines the
complementary sequences.
The output consists of a dataframe of 6 columns where rows represent each putative sRNA cluster. The columns include:
Cluster
: name of sRNA clusterMatch
: whether the RNA sequence is consistent across replicates (either “No”, “Yes” or “Duplicate”; where “Duplicate” indicates a tie)Sequence
: RNA sequence of the most abundant sRNA within a cluster across samplesWidth
: length of nucleotide sequenceComplementary_RNA
: complementary RNA nucleotide sequenceComplementary_DNA
: complementary DNA nucleotide sequenceTo predict the potential targets in a recipient tissue, a target prediction
tool such as psRNATarget
(Xinbin Dai and Zhao 2018) can be used. For such a tool, we can
manipulate the output of RNAsequences()
to match the input style for target
prediction using the following code:
# load `dplyr` package
library(dplyr)
# select the cluster and sequence columns
sequences <- mobile_sequences %>% select(Cluster, Sequence)
# add prefix, remove row with NA
prefix <- ">"
sequences$Cluster <- paste0(prefix, sequences$Cluster)
sequences <- na.omit(sequences)
# convert to required format:
res <- do.call(rbind, lapply(seq(nrow(sequences)),
function(i) t(sequences[i, ])))
Save output:
# save output
write.table(res, file ="./mobile_sRNA_sequences.txt",
row.names = FALSE, col.names = FALSE, quote = FALSE)
Other steps can be taken at this stage including alignment of the mobile sRNA
sequences to identify consensus sequences. For example, this can be undertaken
with the R package bioseq
:
If users wish to undertake the alignment step independently, the manual steps are shown below, of which, will need modifying to suit your files and directory names.
Here, ShortStack
(Axtell 2013) aligns the reads (includes multimapped reads)
with Bowtie (Langmead B 2009) and undertakes de novo sRNA clustering analysis.
bash scripting
Step 2 collates all the sRNA loci information from each sample into a text file.
R scripting
# names of samples, represented by folder names produced in previous step
samples <- c("[sample names]")
# location of output folders from previous step
dir <- "[output_dir]"
# merge information of the detected de novo sRNA clusters
gff_alignment <- GenomicRanges::GRangesList()
for (i in samples) {
file_path <- file.path(dir, i, "Results.gff.gz3")
if (file.exists(file_path)) {
gff_alignment[[i]] <- rtracklayer::import.gff.gz(file_path)
} else{
Stop("File does not exist:", file_path, "\n")
}
}
gff_merged <- GenomicRanges::reduce(unlist(gff_alignment),
ignore.strand = TRUE)
gff_merged <- as.data.frame(gff_merged)
colnames(gff_merged)[1] <- "chr"
if('*' %in% gff_merged$strand){
gff_merged <- gff_merged[, -match("strand", colnames(gff_merged))]
}
locifile_txt <- data.frame(Locus = paste0(gff_merged$chr, ":",
gff_merged$start,"-",
gff_merged$end),
Cluster = paste0("cluster_",
seq_len(nrow(gff_merged))))
# save output to location
dir_locifile <- "[output directory]"
loci_out <- file.path(dir_locifile,"locifile.txt")
utils::write.table(locifile_txt, file = loci_out, quote = FALSE,
sep = "\t", row.names = FALSE, col.names = TRUE)
Finally, we repeat the sRNA clustering using the de novo sRNA loci list to
produce our results. The output folders are used as input for the RNAimport()
function.
bash scripting
ShortStack \
--readfile "[fastqfile.fq]" \
--genomefile "[genomefile.fa.gz]" \
--locifile "[locifile]" \
--threads 6 \
--mmap u \
--nohp \
--pad 200 \
--mincov 0.5 \
--outdir "[output_dir_2]"
Here, Bowtie
(Langmead B 2009) aligns the reads (excluding multimappers) and
ShortStack
(Axtell 2013) undertakes de novo sRNA clustering analysis.
bash scripting
#index reference genome
bowtie-build --threads 6 [genomefile.fa.gz] [genomefile]
# alignment
bowtie -p 6 -v 0 -a -m 1 --sam -x [genomefile] [fastqfile.fq] | samtools view -bS | samtools sort -o [outputfile.bam]
# cluster analysis with ShortStack - use the alignment files as input
ShortStack \
--bamfile "[outputfile.bam]" \
--genomefile "[genomefile.fa.gz]" \
--threads 6 \
--pad 200 \
--mincov 0.5 \
--nohp \
--outdir "[output_dir]"
Step 2 collates all the sRNA loci information from each sample into a text file.
R scripting
# names of samples, represented by folder names produced in previous step
samples <- c("[sample names]")
# location of output folders from previous step
dir <- "[output_dir]"
# merge information of the detected de novo sRNA clusters
gff_alignment <- GenomicRanges::GRangesList()
for (i in samples) {
file_path <- file.path(dir, i, "Results.gff.gz3")
if (file.exists(file_path)) {
gff_alignment[[i]] <- rtracklayer::import.gff.gz(file_path)
} else{
Stop("File does not exist:", file_path, "\n")
}
}
gff_merged <- GenomicRanges::reduce(unlist(gff_alignment),
ignore.strand = TRUE)
gff_merged <- as.data.frame(gff_merged)
colnames(gff_merged)[1] <- "chr"
if('*' %in% gff_merged$strand){
gff_merged <- gff_merged[, -match("strand", colnames(gff_merged))]
}
locifile_txt <- data.frame(Locus = paste0(gff_merged$chr, ":",
gff_merged$start,"-",
gff_merged$end),
Cluster = paste0("cluster_",
seq_len(nrow(gff_merged))))
# save output to location
dir_locifile <- "[output directory]"
loci_out <- file.path(dir_locifile,"locifile.txt")
utils::write.table(locifile_txt, file = loci_out, quote = FALSE,
sep = "\t", row.names = FALSE, col.names = TRUE)
Finally, we repeat the sRNA clustering using the de novo sRNA loci list to
produce our results. The output folders are used as input for the RNAimport()
function.
bash scripting
ShortStack \
--bamfile "[outputfile.bam]" \
--genomefile "[genomefile.fa.gz]" \
--locifile "[locifile.txt]" \
--threads 6 \
--nohp \
--mincov 0.5 \
--pad 200 \
--outdir "[output_dir_2]"
Below states the manual steps for mRNAseq alignment for mobile mRNA identification. Keep in mind these commands need modifying to suit your files and directory names.
These steps utilise HISAT2 (Kim 2015) and HTSeq (Anders, Pyl, and Huber 2014) OS software.
If required, index the genome assembly: bash scripting
bash scripting
# alignment with hisat2
hist2 -p 6 -x [genomefile] -U [fastqfile_1.fq] -S [outputfile.sam]
# select reads only mapped to one location:
grep "NH:i:1" [outputfile.sam] > [outputfile_uniqueReads.sam]
# reheader the sam file:
samtools view -H [outputfile.sam] > [outputfile_header.txt]
cat [outputfile_header.txt] [outputfile_uniqueReads.sam] > [outputfile_uniqueReads_H.sam]
# convert to bam and sort
samtools view -Sb [outputfile_uniqueReads_H.sam] | samtools sort -o [outputfile_uniqueReads_H.bam]
# make folder to store output for sample:
mkdir [samplename]
# raw count with HTSeq
python -m HTSeq.scripts.count \
--format=bam \
--order=pos \
--stranded=no \
--mode=union \
--nonunique=none \
--type=mRNA \
--idattr=Name \
[outputfile_uniqueReads_H.bam] \
[annotationfile.gff] \
> [./samplename/Results.txt]
bash scripting
# alignment with hisat2
hist2 -p 6 -x [genomefile] \
-1 [fastqfile_1.fq] \
-2 [fastqfile_2.fq] \
-S [outputfile.sam]
# select reads only mapped to one location:
grep "NH:i:1" [outputfile.sam] > [outputfile_uniqueReads.sam]
# reheader the sam file:
samtools view -H [outputfile.sam] > [outputfile_header.txt]
cat [outputfile_header.txt] [outputfile_uniqueReads.sam] > [outputfile_uniqueReads_H.sam]
# convert to bam and sort
samtools view -Sb [outputfile_uniqueReads_H.sam] | samtools sort -o [outputfile_uniqueReads_H.bam]
# make folder to store output for sample:
mkdir [samplename]
# raw count with HTSeq
python -m HTSeq.scripts.count \
--format=bam \
--order=pos \
--stranded=no \
--mode=union \
--nonunique=none \
--type=mRNA \
--idattr=Name \
[outputfile_uniqueReads_H.bam] \
[annotationfile.gff] \
> [./samplename/Results.txt]
sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.1.4 mobileRNA_1.3.0 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] bitops_1.0-9 pbapply_1.7-2
#> [3] testthat_3.2.1.1 rlang_1.1.4
#> [5] magrittr_2.0.3 matrixStats_1.4.1
#> [7] compiler_4.5.0 png_0.1-8
#> [9] vctrs_0.6.5 stringr_1.5.1
#> [11] pkgconfig_2.0.3 bioseq_0.1.4
#> [13] crayon_1.5.3 fastmap_1.2.0
#> [15] magick_2.8.5 XVector_0.47.0
#> [17] labeling_0.4.3 utf8_1.2.4
#> [19] Rsamtools_2.23.0 rmarkdown_2.28
#> [21] sessioninfo_1.2.2 UCSC.utils_1.3.0
#> [23] tinytex_0.53 purrr_1.0.2
#> [25] xfun_0.48 zlibbioc_1.53.0
#> [27] cachem_1.1.0 GenomeInfoDb_1.43.0
#> [29] jsonlite_1.8.9 progress_1.2.3
#> [31] highr_0.11 DelayedArray_0.33.0
#> [33] BiocParallel_1.41.0 parallel_4.5.0
#> [35] prettyunits_1.2.0 R6_2.5.1
#> [37] stringi_1.8.4 bslib_0.8.0
#> [39] RColorBrewer_1.1-3 rtracklayer_1.67.0
#> [41] limma_3.63.0 reticulate_1.39.0
#> [43] parallelly_1.38.0 brio_1.1.5
#> [45] GenomicRanges_1.59.0 jquerylib_0.1.4
#> [47] Rcpp_1.0.13 bookdown_0.41
#> [49] SummarizedExperiment_1.37.0 knitr_1.48
#> [51] future.apply_1.11.3 snow_0.4-4
#> [53] audio_0.1-11 R.utils_2.12.3
#> [55] IRanges_2.41.0 Matrix_1.7-1
#> [57] tidyselect_1.2.1 abind_1.4-8
#> [59] yaml_2.3.10 codetools_0.2-20
#> [61] curl_5.2.3 listenv_0.9.1
#> [63] lattice_0.22-6 tibble_3.2.1
#> [65] withr_3.0.2 Biobase_2.67.0
#> [67] evaluate_1.0.1 future_1.34.0
#> [69] Biostrings_2.75.0 pillar_1.9.0
#> [71] BiocManager_1.30.25 MatrixGenerics_1.19.0
#> [73] stats4_4.5.0 generics_0.1.3
#> [75] RCurl_1.98-1.16 S4Vectors_0.45.0
#> [77] hms_1.1.3 ggplot2_3.5.1
#> [79] munsell_0.5.1 scales_1.3.0
#> [81] globals_0.16.3 glue_1.8.0
#> [83] RPushbullet_0.3.4 pheatmap_1.0.12
#> [85] tools_4.5.0 BiocIO_1.17.0
#> [87] data.table_1.16.2 beepr_2.0
#> [89] SimDesign_2.17.1 GenomicAlignments_1.43.0
#> [91] locfit_1.5-9.10 XML_3.99-0.17
#> [93] grid_4.5.0 tidyr_1.3.1
#> [95] edgeR_4.5.0 colorspace_2.1-1
#> [97] GenomeInfoDbData_1.2.13 restfulr_0.0.15
#> [99] cli_3.6.3 fansi_1.0.6
#> [101] S4Arrays_1.7.0 gtable_0.3.6
#> [103] R.methodsS3_1.8.2 DESeq2_1.47.0
#> [105] sass_0.4.9 digest_0.6.37
#> [107] progressr_0.15.0 BiocGenerics_0.53.0
#> [109] SparseArray_1.7.0 ggrepel_0.9.6
#> [111] farver_2.1.2 rjson_0.2.23
#> [113] htmltools_0.5.8.1 R.oo_1.26.0
#> [115] lifecycle_1.0.4 httr_1.4.7
#> [117] statmod_1.5.0