How to use iSEE with big data
iSEE 2.19.0
Compiled date: 2024-10-29
Last edited: 2018-03-08
License: MIT + file LICENSE
Some tweaks can be performed to enable iSEE to run efficiently on large datasets. This includes datasets with many features (methylation, SNPs) or many columns (cytometry, single-cell RNA-seq). To demonstrate some of this functionality, we will use a dataset from the TENxPBMCData dataset:
library(TENxPBMCData)
sce.pbmc <- TENxPBMCData("pbmc68k")
sce.pbmc$Library <- factor(sce.pbmc$Library)
sce.pbmc
#> class: SingleCellExperiment
#> dim: 32738 68579
#> metadata(0):
#> assays(1): counts
#> rownames(32738): ENSG00000243485 ENSG00000237613 ... ENSG00000215616
#> ENSG00000215611
#> rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
#> colnames: NULL
#> colData names(11): Sample Barcode ... Individual Date_published
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
Many SummarizedExperiment
objects store assay matrices as in-memory matrix-like objects,
be they ordinary matrices or alternative representations such as sparse matrices from the Matrix package.
For example, if we looked at the Allen data, we would see that the counts are stored as an ordinary matrix.
library(scRNAseq)
sce.allen <- ReprocessedAllenData("tophat_counts")
class(assay(sce.allen, "tophat_counts"))
#> [1] "matrix" "array"
In situations involving large datasets and limited computational resources, storing the entire assay in memory may not be feasible.
Rather, we can represent the data as a file-backed matrix where contents are stored on disk and retrieved on demand.
Within the Bioconductor ecosystem, the easiest way of doing this is to create a HDF5Matrix
, which uses a HDF5 file to store all the assay data.
We see that this has already been done for use in the 68K PBMC dataset:
counts(sce.pbmc, withDimnames=FALSE)
#> <32738 x 68579> sparse HDF5Matrix object of type "integer":
#> [,1] [,2] [,3] [,4] ... [,68576] [,68577] [,68578]
#> [1,] 0 0 0 0 . 0 0 0
#> [2,] 0 0 0 0 . 0 0 0
#> [3,] 0 0 0 0 . 0 0 0
#> [4,] 0 0 0 0 . 0 0 0
#> [5,] 0 0 0 0 . 0 0 0
#> ... . . . . . . . .
#> [32734,] 0 0 0 0 . 0 0 0
#> [32735,] 0 0 0 0 . 0 0 0
#> [32736,] 0 0 0 0 . 0 0 0
#> [32737,] 0 0 0 0 . 0 0 0
#> [32738,] 0 0 0 0 . 0 0 0
#> [,68579]
#> [1,] 0
#> [2,] 0
#> [3,] 0
#> [4,] 0
#> [5,] 0
#> ... .
#> [32734,] 0
#> [32735,] 0
#> [32736,] 0
#> [32737,] 0
#> [32738,] 0
Despite the dimensions of this matrix, the HDF5Matrix
object occupies very little space in memory.
object.size(counts(sce.pbmc, withDimnames=FALSE))
#> 2496 bytes
However, parts of the data can still be read in on demand.
For all intents and purposes, the HDF5Matrix
appears to be an ordinary matrix to downstream applications and can be used as such.
first.gene <- counts(sce.pbmc)[1,]
head(first.gene)
#> [1] 0 0 0 0 0 0
This means that we can use the 68K PBMC SingleCellExperiment
object in iSEE()
without any extra work.
The app below shows the distribution of counts for everyone’s favorite gene MALAT1 across libraries.
Here, iSEE()
is simply retrieving data on demand from the HDF5Matrix
without ever loading the entire assay matrix into memory.
This enables it to run efficiently on arbitrary large datasets with limited resources.
library(iSEE)
app <- iSEE(sce.pbmc, initial=
list(RowDataTable(Selected="ENSG00000251562", Search="MALAT1"),
FeatureAssayPlot(XAxis="Column data", XAxisColumnData="Library",
YAxisFeatureSource="RowDataTable1")
)
)
Generally speaking, these HDF5 files are written once by a process with sufficient computational resources (i.e., memory and time).
We typically create HDF5Matrix
objects using the writeHDF5Array()
function from the HDF5Array package.
After the file is created, the objects can be read many times in more deprived environments.
sce.h5 <- sce.allen
library(HDF5Array)
assay(sce.h5, "tophat_counts", withDimnames=FALSE) <-
writeHDF5Array(assay(sce.h5, "tophat_counts"), file="assay.h5", name="counts")
class(assay(sce.h5, "tophat_counts", withDimnames=FALSE))
#> [1] "HDF5Matrix"
#> attr(,"package")
#> [1] "HDF5Array"
list.files("assay.h5")
#> character(0)
It is worth noting that iSEE()
does not know or care that the data is stored in a HDF5 file.
The app is fully compatible with any matrix-like representation of the assay data that supports dim()
and [,
.
As such, iSEE()
can be directly used with other memory-efficient objects like the DeferredMatrix
and LowRankMatrix
from the BiocSingular package, or perhaps the ResidualMatrix
from the batchelor package.
It is also possible to downsample points to reduce the time required to generate the plot. This involves subsetting the dataset so that only the most recently plotted point for an overlapping set of points is shown. In this manner, we avoid wasting time in plotting many points that would not be visible anyway. To demonstrate, we will re-use the 68K PBMC example and perform downsampling on the feature assay plot; we can see that its aesthetics are largely similar to the non-downsampled counterpart above.
library(iSEE)
app <- iSEE(sce.pbmc, initial=
list(RowDataTable(Selected="ENSG00000251562", Search="MALAT1"),
FeatureAssayPlot(XAxis="Column data", XAxisColumnData="Library",
YAxisFeatureSource="RowDataTable1",
VisualChoices="Point", Downsample=TRUE,
VisualBoxOpen=TRUE
)
)
)
Downsampling is possible in all iSEE()
plotting panels that represent features or samples as points.
We can turn on downsampling for all such panels using the relevant field in panelDefaults()
,
which spares us the hassle of setting Downsample=
individually in each panel constructor.
panelDefaults(Downsample=TRUE)
The downsampling only affects the visualization and the speed of the plot rendering. Any interactions with other panels occur as if all of the points were still there. For example, if one makes a brush, all of the points therein will be selected regardless of whether they were downsampled.
The downsampling resolution determines the degree to which points are considered to be overlapping. Decreasing the resolution will downsample more aggressively, improving plotting speed but potentially affecting the fidelity of the visualization. This may compromise the aesthetics of the plot when the size of the points is small, in which case an increase in resolution may be required at the cost of speed.
Obviously, downsampling will not preserve overlays for partially transparent points, but any reliance on partial transparency is probably not a good idea in the first place when there are many points.
One can generally improve the speed of the iSEE()
interface by only initializing the app with the desired panels.
For example, it makes little sense to spend time rendering a RowDataPlot
when only the ReducedDimensionPlot
is of interest.
Specification of the initial state is straightforward with the initial=
argument,
as described in a previous vignette.
On occasion, there may be alternative panels with more efficient visualizations for the same data.
The prime example is the ReducedDimensionHexPlot
class from the iSEEu package;
this will create a hexplot rather than a scatter plot, thus avoiding the need to render each point in the latter.
sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] TENxPBMCData_1.23.0 HDF5Array_1.35.0
#> [3] rhdf5_2.51.0 DelayedArray_0.33.0
#> [5] SparseArray_1.7.0 S4Arrays_1.7.0
#> [7] abind_1.4-8 Matrix_1.7-1
#> [9] scater_1.35.0 ggplot2_3.5.1
#> [11] scuttle_1.17.0 scRNAseq_2.19.1
#> [13] iSEE_2.19.0 SingleCellExperiment_1.29.0
#> [15] SummarizedExperiment_1.37.0 Biobase_2.67.0
#> [17] GenomicRanges_1.59.0 GenomeInfoDb_1.43.0
#> [19] IRanges_2.41.0 S4Vectors_0.45.0
#> [21] BiocGenerics_0.53.0 MatrixGenerics_1.19.0
#> [23] matrixStats_1.4.1 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] splines_4.5.0 later_1.3.2 BiocIO_1.17.0
#> [4] bitops_1.0-9 filelock_1.0.3 tibble_3.2.1
#> [7] XML_3.99-0.17 lifecycle_1.0.4 httr2_1.0.5
#> [10] doParallel_1.0.17 lattice_0.22-6 ensembldb_2.31.0
#> [13] alabaster.base_1.7.0 magrittr_2.0.3 sass_0.4.9
#> [16] rmarkdown_2.28 jquerylib_0.1.4 yaml_2.3.10
#> [19] httpuv_1.6.15 DBI_1.2.3 RColorBrewer_1.1-3
#> [22] zlibbioc_1.53.0 Rtsne_0.17 purrr_1.0.2
#> [25] AnnotationFilter_1.31.0 RCurl_1.98-1.16 rappdirs_0.3.3
#> [28] circlize_0.4.16 GenomeInfoDbData_1.2.13 ggrepel_0.9.6
#> [31] irlba_2.3.5.1 alabaster.sce_1.7.0 codetools_0.2-20
#> [34] DT_0.33 tidyselect_1.2.1 shape_1.4.6.1
#> [37] UCSC.utils_1.3.0 ScaledMatrix_1.15.0 viridis_0.6.5
#> [40] shinyWidgets_0.8.7 BiocFileCache_2.15.0 GenomicAlignments_1.43.0
#> [43] jsonlite_1.8.9 GetoptLong_1.0.5 BiocNeighbors_2.1.0
#> [46] iterators_1.0.14 foreach_1.5.2 tools_4.5.0
#> [49] Rcpp_1.0.13 glue_1.8.0 gridExtra_2.3
#> [52] xfun_0.48 mgcv_1.9-1 dplyr_1.1.4
#> [55] gypsum_1.3.0 shinydashboard_0.7.2 withr_3.0.2
#> [58] BiocManager_1.30.25 fastmap_1.2.0 rhdf5filters_1.19.0
#> [61] fansi_1.0.6 shinyjs_2.1.0 digest_0.6.37
#> [64] rsvd_1.0.5 R6_2.5.1 mime_0.12
#> [67] colorspace_2.1-1 listviewer_4.0.0 RSQLite_2.3.7
#> [70] utf8_1.2.4 generics_0.1.3 rtracklayer_1.67.0
#> [73] httr_1.4.7 htmlwidgets_1.6.4 pkgconfig_2.0.3
#> [76] gtable_0.3.6 blob_1.2.4 ComplexHeatmap_2.23.0
#> [79] XVector_0.47.0 htmltools_0.5.8.1 bookdown_0.41
#> [82] ProtGenerics_1.39.0 rintrojs_0.3.4 clue_0.3-65
#> [85] scales_1.3.0 alabaster.matrix_1.7.0 png_0.1-8
#> [88] knitr_1.48 rjson_0.2.23 nlme_3.1-166
#> [91] curl_5.2.3 shinyAce_0.4.3 cachem_1.1.0
#> [94] GlobalOptions_0.1.2 BiocVersion_3.21.1 parallel_4.5.0
#> [97] miniUI_0.1.1.1 vipor_0.4.7 AnnotationDbi_1.69.0
#> [100] restfulr_0.0.15 pillar_1.9.0 grid_4.5.0
#> [103] alabaster.schemas_1.7.0 vctrs_0.6.5 promises_1.3.0
#> [106] BiocSingular_1.23.0 dbplyr_2.5.0 beachmat_2.23.0
#> [109] xtable_1.8-4 cluster_2.1.6 beeswarm_0.4.0
#> [112] evaluate_1.0.1 GenomicFeatures_1.59.0 cli_3.6.3
#> [115] compiler_4.5.0 Rsamtools_2.23.0 rlang_1.1.4
#> [118] crayon_1.5.3 ggbeeswarm_0.7.2 viridisLite_0.4.2
#> [121] alabaster.se_1.7.0 BiocParallel_1.41.0 munsell_0.5.1
#> [124] Biostrings_2.75.0 lazyeval_0.2.2 colourpicker_1.3.0
#> [127] ExperimentHub_2.15.0 bit64_4.5.2 Rhdf5lib_1.29.0
#> [130] KEGGREST_1.47.0 shiny_1.9.1 highr_0.11
#> [133] alabaster.ranges_1.7.0 AnnotationHub_3.15.0 fontawesome_0.5.2
#> [136] igraph_2.1.1 memoise_2.0.1 bslib_0.8.0
#> [139] bit_4.5.0
# devtools::session_info()
5 Comments on deployment
It is straightforward to host iSEE applications on hosting platforms like Shiny Server or Rstudio Connect. All one needs to do is to create an
app.R
file that callsiSEE()
with the desired parameters, and then follow the instructions for the target platform. For a better user experience, we suggest setting a minimum number of processes to avoid the initial delay from R start-up.It is also possible to deploy and host Shiny app on shinyapps.io, a platform as a service (PaaS) provided by RStudio. In many cases, users will need to configure the settings of their deployed apps, in particular selecting larger instances to provide sufficient memory for the app. The maximum amount of 1GB available to free accounts may not be sufficient to deploy large datasets; in which case you may consider using out-of-memory matrices, filtering your dataset (e.g., removing lowly detected features), or going for a paid account. Detailed instructions to get started are available at https://shiny.rstudio.com/articles/shinyapps.html. For example, see the isee-shiny-contest app, winner of the 1st Shiny Contest.