The reason RSeqAn was created was to allow for easy integration of the SeqAn biological sequence analysis C++ library into R packages. While R is an excellent language for many other applications, it is just not fast enough for reading and writing files on the scale of next generation sequencing output. This is where a well-developed and mature library like SeqAn comes in.
This vignette only goes through the first example in the A First Example section as found in the Getting Started section of the SeqAn docs. We have modified the function slightly to make it work here in R, and will go through how and why we did so. The purpose of using this example is to help the user get an idea of how to go between SeqAn and R. To take full advantage of SeqAn though the user will need to read through SeqAn’s documentation.
Besides that, the user is expected to have some experience with both C++ and Rcpp, although it not need be extensive. After all, that is what RSeqAn is for.
The simple example in pattern_search
does, as you might
expect, a pattern search of a short query sequence (pattern) in a long
subject sequence. It returns an integer score value for each position of
the database sequence (text) as the sum of matching characters between
the pattern string and the subject substring of the database
sequence.
// [[Rcpp::depends(RSeqAn)]]
#include <iostream>
#include <seqan/file.h>
#include <seqan/sequence.h>
#include <Rcpp.h>
using namespace Rcpp;
using namespace seqan;
// [[Rcpp::export]]
IntegerVector pattern_search(std::string t, std::string p) {
seqan::String<char> text = t;
seqan::String<char> pattern = p;
seqan::String<int> score;
resize(score, length(text) - length(pattern) + 1);
// Computation of the similarities
// Iteration over the text (outer loop)
for (unsigned i = 0; i < length(text) - length(pattern) + 1; ++i)
{
int localScore = 0;
// Iteration over the pattern for character comparison
for (unsigned j = 0; j < length(pattern); ++j)
{
if (text[i + j] == pattern[j])
++localScore;
}
score[i] = localScore;
}
// Returning the result
IntegerVector s(length(score));
for (unsigned i = 0; i < length(score); ++i)
s[i] = score[i];
return s;
}
The results are shown below. We see that the maximum score possible
is 8, as there are 8 characters in the pattern tutorial
,
and it achieves that maximum score when we match together
tutorial
in the text string and tutorial
in
the pattern string. As well, the first position has a score of 1,
because the i
in the pattern string tutorial
matches the 1 i
in is
; the subject substring
here is This is
.
pattern_search("This is an awesome tutorial to get to know SeqAn!", "tutorial")
#> [1] 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 8 0 1 0 0 0 0 2 0 1 0 0 1 0 3 0 1 0 1
#> [39] 0 0 0 0
As we can see, writing a C++ function that utilizes SeqAn inside R is
quite easy with Rcpp. We included <seqan/file.h>
as
well as <seqan/sequence.h>
as those are the modules
that provide the SeqAn String class.
This is one of the most fundamental classes in SeqAn.
However, the function we wrote does look slightly different from the
one in the A
First Example section. First, instead of the function
int main()
, we have instead written the function
IntegerVector pattern_search(std::string t, std::string p)
.
(Note: we already declared the namespace std, but it was left here in
the function for clarity) Next, instead of printing the score to stdout,
we are returning it as an IntegerVector
.
The reason we did this is that in order for any function using SeqAn
to be useful in R, we probably want it to return something and to take
in input. This means that the input and output object types need to be
translatable between R and C++. SeqAn uses its own template
functions and template classes, and the String
class is one of the most fundamental classes in SeqAn. This makes sense
since SeqAn is all about analyzing sequences. However, the String class
has no direct translation to R. If you try to input
String<char> text
or return
String<int> score
you will end up with loads of
errors from the compiler. So, how do we deal with this?
One way to do this is by writing conversion functions such that R and
C++ both understand what the data type you are using (such as String)
means. Rcpp provides a nice way to do this through
Rcpp::as<T>(obj)
to convert from R to C++ and
Rcpp::wrap(obj)
to convert from C++ to R. More of this is
covered in the Rcpp vignette Extending
Rcpp. Once these functions are written, this is nice for the user as
they can just go ahead and Rcpp::wrap
and
Rcpp::as<T>
as they need. This has not been
implemented in RSeqAn yet though, although it is a future goal. Thus for
now the user will have to pay attention to how to convert between
classes in SeqAn and objects in R for each function that is written.
Rcpp has its own data
types for going between R and C++, and so that is the
IntegerVector
we declare here. Since score
is
essentially a vector of class String
with type
int
, instead of iterating through score
and
printing to stdout, we create an IntegerVector s
with the
same length as score
and iterate through score
copying its values to s
in order to be able to return the
values in score
. Similarly, we make use of the fact that
Rcpp already autoconverts character strings in R to character strings in
C++ and that character strings in C++ can be converted to
String<char>
in SeqAn to write
pattern_search
such that we can run it from R.