This package contains an updated parallel implementation of UniBic biclustering algorithm for gene expression data [Wang2016]. The algorithm locates trend-preserving biclusters within complex and noisy data and is considered one of the most accurate among biclustering methods.
Since their first application to gene expression data [Cheng2000] biclustering algorithms have gained much popularity [Eren2012]. The reason for this is the ability of the methods to simultaneously detect valid patterns within the data that include only subset of rows and columns.
In this package we provide an efficient parallel improved implementation of UniBic biclustering algorithm. This state-of-the-art algorithm is said to outperform multiple biclustering methods on both synthetic and genetic datasets. Our major contributions are reimplementing the method into more modern C++11 language as well as multiple improvements in code (memory management, code refactoring etc.)
This package provides the following main functions: *
BCUnibic
/runibic
- parallel UniBic for
continuous data * BCUnibicD
- parallel UniBic for discrete
data
The package provides some additional functions: *
pairwiseLCS
- calculates Longest Common Subsequence (LCS)
between two vectors * calculateLCS
- calculates LCSes
between all pairs of the input dataset (Fibonacci heap or standard sort)
* backtrackLCS
- recovers LCS from the dynamic programming
matrix * cluster
- main part of UniBic algorithm
(biclusters seeding and expanding) * unisort
- returns
matrix of indexes based on the increasing order in each row *
discretize
- performs discretization using Fibonacci heap
(sorting method used originally in UniBic)
The package may be installed as follows:
This example shows the basic usage of runibic on synthetic data. We start with loading the libraries.
First we prepare a random matrix.
Then, we run UniBic biclustering algorithm on the dataset. We use a
Biclust
package wrapper.
We can inspect the results returned by the method. Let’s see how many biclusters were detected.
We could also inspect the rows of a specific bicluster, for example the third one.
Similarly we could check the indexes of columns of the third bicluster.
Visual analyses are very useful in biclustering. Let’s draw the
heatmap with the obtained bicluster using drawHeatmap
function from biclust
package.
We could also present parallel coordinates plot of the bicluster with
function parallelCoordinates
from biclust
package.
Similarly the package could be used to find trends in discrete data.
We run runiDiscretize
function.
Finally, we run the UniBic algorithm dedicated to work with discrete datasets.
This example presents how to use runibic package on a sample dataset. After loading all required libraries
we apply UniBic biclustering method to the sample yeast dataset from
biclust
package.
The method has found 92 biclusters. Now we are going to analyze the
results. We will start with drawing a heatmap of the first found
bicluster using drawHeatmap
from biclust
package.
Then we may check the result drawing the genes from the bicluster
using parallel coordinates plot with parallelCoordinates
function from biclust
package.
The third application of runibic is using the algorithm on the
dataset from an RNA-Seq experiment (SummarizedExperiment
).
We start with loading necessary packages (runibic
,
SummarizedExperiment
) and load the airway dataset.
We will take only a subset of the data to check if the method can return any patterns.
Let’s see if UniBic detects any pattern in the dataset.
The result could be visualized using parallelCoordinates
method from biclust
package.
We can also draw a heatmap with the second bicluster.
Another very useful application of the runibic package is analyzing
real dataset obtained from Gene Expression Omnibus (GEO). For this task
we will use three packages: getGEO
from
GEOquery
to download the data, GDS2eSet
from
affy
to convert the dataset to ExpressionSet
and finally exprs
to extract gene expression matrix.
We download dataset with Rat peripheral and brain regions from Gene Expression Omnibus.
Now we convert dataset to ExpressionSet and take subset of the dataset.
We perform analysis on first 100 of genes.
Finally, we draw the heatmap with the first bicluster using
drawHeatmap
function from biclust
package.
We load package with QUBIC biclustering algorithm for comparison
Now, we perform biclustering using CC, Bimax, Qubic, Plaid and Unibic:
resCC <- biclust::biclust(BicatYeast, method = BCCC())
resBi <- biclust::biclust(BicatYeast, method = BCBimax())
resQub <- biclust::biclust(BicatYeast, method = BCQU())
resPlaid <- biclust::biclust(BicatYeast, method = BCPlaid())
resUni <- biclust::biclust(BicatYeast, method = BCUnibic())
The results of clustering could be compared using
showinfo
from QUBIC
package.
The Longest Common Subsequence (LCS) between two vectors is the longest series of the data that is present in both analyzed vectors. Let’s prepare two vectors for the analysis.
You may notice that the values (1,2,4) are contained in both vectors: in vector A: (1,2,x,x,4,x) and in vector B: (x,1,x,2,x,4)
Let’s use the methods provided by the package to calculate the
longest common subsequence. We first check which values are common for
both vectors using backtrackLCS
.
Then we calculate using dynamic programming the matrix for Longest
Common Subsequence with pairwiselLCS
In our next example we will find the length of the longest common subsequence within the matrix We start with preparing input matrix.
A <- matrix(c(11, 17, 12, 10, 8, 9, 19, 15, 18, 13, 14, 7, 4, 6, 16, 2, 3, 1, 5, 20,
17, 1, 8, 15, 5, 10, 2, 12, 9, 7, 3, 14, 11, 4, 6, 16, 20, 13, 19, 18,
15, 8, 17, 12, 18, 14, 19, 11, 16, 20, 10, 13, 6, 3, 7, 9, 1, 2, 5, 4,
15, 12, 16, 9, 19, 17, 10, 18, 11, 20, 8, 13, 2, 5, 7, 14, 1, 3, 4, 6,
15, 10, 9, 6, 13, 19, 7, 18, 16, 17, 14, 4, 3, 1, 2, 20, 12, 5, 11, 8,
1, 7, 4, 3, 2, 6, 8, 13, 5, 9, 12, 11, 16, 15, 17, 10, 19, 20, 14, 18,
10, 5, 3, 9, 2, 11, 6, 13, 8, 1, 7, 4, 16, 14, 15, 12, 18, 17, 20, 19,
10, 5, 1, 12, 8, 11, 7, 13, 6, 4, 3, 2, 18, 14, 15, 9, 17, 16, 20, 19,
9, 6, 3, 10, 1, 12, 7, 13, 8, 2, 5, 4, 16, 14, 15, 11, 19, 17, 20, 18,
12, 8, 1, 3, 2, 11, 4, 14, 9, 7, 10, 5, 16, 13, 15, 6, 18, 17, 20, 19), nrow = 10, byrow = TRUE)
We sort data in each row with the function unisort
provided by the runibic
package.
Now we calculate the length of LCS between each pair of rows using
calculateLCS
either using Fibonacci Heap or standard
sort.
You may notice that depending on the method chosen the results may differ.
We can check the length of the longest common subsequence (e.g. LCS between rows 6 and 7 is equal to 10).
First we create a random matrix
Discretization from UniBic (using original sorting method using
Fibonacci Heap) could be applied using runiDiscretize
function.
In order to take full advantage of the modularity offered by runibic, we apply UniBic algorithm step by step using standard stable sort. The last example shows how to use cluster function from the package. After creating a matrix we sort values in each row and calculate LCS between all pairs of rows.
A <- matrix(c(4, 3, 1, 2, 5, 8, 6, 7, 9, 10, 11, 12), nrow = 4, byrow = TRUE)
iA <- unisort(A)
lcsResults <- calculateLCS(A, useFibHeap=FALSE)
Finally, we apply cluster
function to inspect the
results.
For the original sequential biclustering algorithm please use the following citation:
Zhenjia Wang, Guojun Li, Robert W. Robinson, Xiuzhen Huang UniBic: Sequential row-based biclustering algorithm for analysis of gene expression data Scientific Reports 6, 2016; 23466, doi: https://doi:10.1038/srep23466
If you use in your work this package with parallel version of UniBic please use the following citation:
Patryk Orzechowski, Artur Pańszczyk, Xiuzhen Huang, Jason H Moore; runibic: a Bioconductor package for parallel row-based biclustering of gene expression data Bioinformatics, , bty512, https://doi.org/10.1093/bioinformatics/bty512
BibTex entry:
@article{orzechowski2018runibic,
author = {Orzechowski, Patryk and Pańszczyk, Artur and Huang, Xiuzhen and Moore, Jason H},
title = {runibic: a Bioconductor package for parallel row-based biclustering of gene expression data},
journal = {Bioinformatics},
volume = {},
number = {},
pages = {bty512},
year = {2018},
doi = {10.1093/bioinformatics/bty512},
URL = {http://dx.doi.org/10.1093/bioinformatics/bty512},
eprint = {/oup/backfile/content_public/journal/bioinformatics/pap/10.1093_bioinformatics_bty512/4/bty512.pdf}
}