This vignette will review the steps needed to implement the GloScope
methodology. GloScope
is a framework for creating profiles of scRNA-Seq samples in order to globally compare and analyze them across patients or tissue samples. First this methodology estimates a global gene expression distribution for each sample, and then it calculates how divergent pairs of samples are from each other. The output from the package’s main function, gloscope()
, is a \(n\times n\) divergence matrix containing the pairwise statistical divergences between all samples. This divergence matrix can be the input to other downstream statistical and machine learning tools. This package has been submitted to Bioconductor because we believe it may be of particular interest to the bioinformatic community and because it depends on other packages from Bioconductor.
GloScope
estimates the gene expression density from a low dimensional embedding of the UMI count measurements, such as PCA or scVI embeddings. Users must provide GloScope
with a data.frame
containing each cell’s embedding coordinates, along with the metadata which identifies to which sample each cell belongs to.
The package provides a helper function plotMDS
which allows the user to visualize the divergence matrix with multidimensional scaling (MDS), but the divergence matrix can also be visualized with other algorithms as well.
A standard workflow for GloScope
consists of:
Obtain the dimension reduction embedding of the cells and specify how many dimensions to keep. This is computed outside of the GloScope
package.
Choose a density estimation method (Gaussian mixture or k-nearest neighbours) to estimate each sample’s latent distribution.
Calculate the symmetric KL divergence or Jensen-Shannon divergence between all pairs of samples.
Visualize the distance matrix with the first two dimensions of MDS using the plotMDS
function.
You can install the latest stable release of GloScope
from Bioconductor. Make sure that you have the BiocManager
package installed to proceed.
if (!require("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("GloScope")
In this section, we use a toy example to illustrate the input, output and visualization steps of the GloScope
pipeline.
The data is a subset of that presented in Stephenson et al. (2021), and was obtained from this hyperlinked URL. The dataset has a total of 647,366 peripheral blood mononuclear cells (PBMCs) from 130 patients, whose phenotypes are COVID-infected, healthy control donor, patients with other non-COVID respiratory disease, and volunteers administered with intravenous lipopolysaccharide (IV-LPS). To enable faster computation in this tutorial, we subset this dataset to 20 random COVID and healthy control donors and further subsample each patient’s count matrix to 500 random cells. This subsampled data is provided in the SingleCellExperiment object example_SCE
. We emphasize that this subsampling procedure is only for demonstration purposes and is not a recommended step in normal analyses.
We first load the GloScope
package and the aforementioned example dataset.
library(GloScope)
data("example_SCE")
The example SingleCellExperiment object contains the first \(50\) principal components of the subsampled cells, as well as the sample and phenotype labels associated with each cell.
head(SingleCellExperiment::reducedDim(example_SCE,"PCA")[,1:10])
#> PC_1 PC_2 PC_3 PC_4
#> BGCV03_GCGCCAATCTGGCGTG-1 -4.677964 0.26785278 -2.68663740 1.5347402
#> BGCV03_CGTCACTTCCCAAGAT-1 -2.877392 0.04509136 12.38307762 4.5973921
#> BGCV15_TGAGGGACATCGACGC-1 -3.625502 -0.05260641 1.15989053 -5.6360350
#> BGCV03_GATCAGTAGTTACGGG-1 13.487988 -7.37058783 0.77435076 -0.8953347
#> BGCV03_TGCGGGTCACGGACAA-1 -3.740279 0.27755079 0.07352045 -4.6179056
#> BGCV03_CATGACAGTCCAGTAT-1 -2.349224 2.82407236 0.93374163 -4.7708669
#> PC_5 PC_6 PC_7 PC_8
#> BGCV03_GCGCCAATCTGGCGTG-1 0.2855917 -0.39382035 -0.09173895 -1.4540112
#> BGCV03_CGTCACTTCCCAAGAT-1 -3.2936299 -0.07147141 -1.95012903 -1.1909559
#> BGCV15_TGAGGGACATCGACGC-1 -3.1034510 2.69858479 0.35455966 -0.7677660
#> BGCV03_GATCAGTAGTTACGGG-1 -7.4018416 5.53588390 -3.99845910 2.1559813
#> BGCV03_TGCGGGTCACGGACAA-1 -2.6745775 1.97423577 -0.32567784 -0.3111276
#> BGCV03_CATGACAGTCCAGTAT-1 -0.5981533 0.95109445 -1.91192818 1.6709988
#> PC_9 PC_10
#> BGCV03_GCGCCAATCTGGCGTG-1 -0.6873072 3.0339348
#> BGCV03_CGTCACTTCCCAAGAT-1 0.9220569 -1.0549433
#> BGCV15_TGAGGGACATCGACGC-1 0.2317176 -1.9861494
#> BGCV03_GATCAGTAGTTACGGG-1 -2.9655695 3.9789870
#> BGCV03_TGCGGGTCACGGACAA-1 0.4197560 -0.8079447
#> BGCV03_CATGACAGTCCAGTAT-1 -1.9040219 -2.5736504
head(SingleCellExperiment::colData(example_SCE))
#> DataFrame with 6 rows and 3 columns
#> sample_id phenotype cluster_id
#> <factor> <factor> <factor>
#> BGCV03_GCGCCAATCTGGCGTG-1 CV0176 Covid CD8
#> BGCV03_CGTCACTTCCCAAGAT-1 CV0176 Covid B_cell
#> BGCV15_TGAGGGACATCGACGC-1 CV0176 Covid CD4
#> BGCV03_GATCAGTAGTTACGGG-1 CV0176 Covid CD14
#> BGCV03_TGCGGGTCACGGACAA-1 CV0176 Covid CD8
#> BGCV03_CATGACAGTCCAGTAT-1 CV0176 Covid RBC
The following table confirms that each donor provides \(500\) cells, and that both phenotypes are represented.
table(SingleCellExperiment::colData(example_SCE)$sample_id, SingleCellExperiment::colData(example_SCE)$phenotype)
#>
#> Covid Healthy LPS Non_covid
#> AP1 0 0 0 0
#> AP2 0 0 0 0
#> AP3 0 0 0 0
#> AP4 0 0 0 0
#> AP5 0 0 0 0
#> AP6 0 0 0 0
#> AP8 0 0 0 0
#> AP9 0 0 0 0
#> AP10 0 0 0 0
#> AP11 0 0 0 0
#> AP12 0 0 0 0
#> CV0025 500 0 0 0
#> CV0037 500 0 0 0
#> CV0050 500 0 0 0
#> CV0052 0 0 0 0
#> CV0058 500 0 0 0
#> CV0059 0 0 0 0
#> CV0062 0 0 0 0
#> CV0068 0 0 0 0
#> CV0073 0 0 0 0
#> CV0074 0 0 0 0
#> CV0084 0 0 0 0
#> CV0094 0 0 0 0
#> CV0100 500 0 0 0
#> CV0104 0 0 0 0
#> CV0120 500 0 0 0
#> CV0128 500 0 0 0
#> CV0134 0 0 0 0
#> CV0137 500 0 0 0
#> CV0144 500 0 0 0
#> CV0155 500 0 0 0
#> CV0160 500 0 0 0
#> CV0164 0 0 0 0
#> CV0171 0 0 0 0
#> CV0176 500 0 0 0
#> CV0178 0 0 0 0
#> CV0180 500 0 0 0
#> CV0198 0 0 0 0
#> CV0200 0 0 0 0
#> CV0201 0 0 0 0
#> CV0231 0 0 0 0
#> CV0234 500 0 0 0
#> CV0257 500 0 0 0
#> CV0262 0 0 0 0
#> CV0279 500 0 0 0
#> CV0284 0 0 0 0
#> CV0326 0 0 0 0
#> CV0902 0 0 0 0
#> CV0904 0 0 0 0
#> CV0911 0 0 0 0
#> CV0915 0 500 0 0
#> CV0917 0 500 0 0
#> CV0926 0 0 0 0
#> CV0929 0 0 0 0
#> CV0934 0 0 0 0
#> CV0939 0 0 0 0
#> CV0940 0 500 0 0
#> CV0944 0 500 0 0
#> MH8919176 0 0 0 0
#> MH8919177 0 0 0 0
#> MH8919178 0 0 0 0
#> MH8919179 0 0 0 0
#> MH8919226 0 0 0 0
#> MH8919227 0 0 0 0
#> MH8919228 0 0 0 0
#> MH8919229 0 0 0 0
#> MH8919230 0 0 0 0
#> MH8919231 0 0 0 0
#> MH8919232 0 0 0 0
#> MH8919233 0 0 0 0
#> MH8919276 0 0 0 0
#> MH8919277 0 0 0 0
#> MH8919278 0 0 0 0
#> MH8919279 0 0 0 0
#> MH8919280 0 0 0 0
#> MH8919281 0 0 0 0
#> MH8919282 0 0 0 0
#> MH8919283 0 0 0 0
#> MH8919326 0 0 0 0
#> MH8919327 0 0 0 0
#> MH8919328 0 0 0 0
#> MH8919329 0 0 0 0
#> MH8919330 0 0 0 0
#> MH8919331 0 0 0 0
#> MH8919332 0 0 0 0
#> MH8919333 0 0 0 0
#> MH9143270 0 0 0 0
#> MH9143271 0 0 0 0
#> MH9143272 0 0 0 0
#> MH9143273 0 0 0 0
#> MH9143274 0 0 0 0
#> MH9143275 0 0 0 0
#> MH9143276 0 0 0 0
#> MH9143277 0 0 0 0
#> MH9143320 0 0 0 0
#> MH9143321 0 0 0 0
#> MH9143322 0 0 0 0
#> MH9143323 0 0 0 0
#> MH9143324 0 0 0 0
#> MH9143325 0 0 0 0
#> MH9143326 0 0 0 0
#> MH9143327 0 0 0 0
#> MH9143370 0 0 0 0
#> MH9143371 0 0 0 0
#> MH9143372 0 0 0 0
#> MH9143373 0 0 0 0
#> MH9143420 0 0 0 0
#> MH9143421 0 0 0 0
#> MH9143422 0 0 0 0
#> MH9143423 0 0 0 0
#> MH9143424 0 0 0 0
#> MH9143425 0 0 0 0
#> MH9143426 0 0 0 0
#> MH9143427 0 0 0 0
#> MH9179821 0 0 0 0
#> MH9179822 0 0 0 0
#> MH9179823 0 0 0 0
#> MH9179824 0 0 0 0
#> MH9179825 0 0 0 0
#> MH9179826 0 0 0 0
#> MH9179827 0 0 0 0
#> MH9179828 0 0 0 0
#> newcastle004v2 0 0 0 0
#> newcastle20 0 0 0 0
#> newcastle21 0 0 0 0
#> newcastle21v2 0 0 0 0
#> newcastle49 0 0 0 0
#> newcastle59 0 0 0 0
#> newcastle65 0 0 0 0
#> newcastle74 0 0 0 0
GloScope
expects that the user provides GloScope
with a data.frame
containing each cell’s low-dimensional embedding (with cells in rows), along with a vector which contains the sample from which each cell in the embedding matrix is drawn. The PCA embeddings in the example dataset are provided by the authors of Stephenson et al. (2021). In general, users of GloScope
can input any dimensionality reduction to the method. UMI counts are often provided as Seurat
or SingleCellExperiment
objects, and many dimensionality reduction strategies, including PCA and scVI (Lopez et al. (2018)) can be computed and saved within those data structures. This is recommended, as it will allow the user to keep the counts, embeddings, and the meta data (like the sample from which each cell was isolated) in the same structure and will minimize the change of inadvertently mislabelling the sample of a cell.
The following code, which is only pseudo-code and not evaluated (since we do not provide the seurat_object
), demonstrates how PCA embeddings and sample labels can be extracted from a Seurat
object for input into the gloscope
function.
embedding_df <- seurat_object@reductions$pca@cell.embeddings
sample_ids <- seurat_object@meta.data$sample_id
With the cell embeddings and the sample labels in the proper format, the gloscope
function is simple to setup and run. For simplicity, we will save these as separate objects, though for real datasets, this would not be recommended since it would unnecessarily make copies of the data and increase memory usage:
embedding_matrix <- SingleCellExperiment::reducedDim(example_SCE,"PCA")[,1:10]
sample_ids <- SingleCellExperiment::colData(example_SCE)$sample_id
Although the example data contains the first 50 principal components, we chose to use only the first 10 for calculations. Large number of latent variables will make the density estimation unstable, so we do not recommend large increases to the number of latent variables.
The base function call is gloscope(embedding_df, sample_ids)
.
# Can take a couple of minutes to run:
gmm_divergence <- gloscope(embedding_matrix, sample_ids)
The default implementation, run above, implements the GMM
option for density estimation; this is the method primarily considered in the manuscript, but can take longer to run so we haven’t evaluated it here. (You can evaluate it by changing eval=FALSE
to eval=TRUE
).
An alternative methods uses a non-parametric alternative for density estimation based on a \(k\)-nearest neighbours algorithm and can be chosen with the argument dens
. We will use this on our examples simply to make the tutorial run quickly:
knn_divergence <- gloscope(embedding_matrix, sample_ids, dens="KNN")
knn_divergence[1:5,1:5]
#> CV0176 CV0257 CV0279 CV0120 CV0917
#> CV0176 0.000000 2.388813 1.893389 2.024262 2.333317
#> CV0257 2.388813 0.000000 1.608448 3.048070 4.432494
#> CV0279 1.893389 1.608448 0.000000 2.414725 1.690126
#> CV0120 2.024262 3.048070 2.414725 0.000000 2.465142
#> CV0917 2.333317 4.432494 1.690126 2.465142 0.000000
Note, that unlike PCA, not every dimensionality reduction method retains its statistical properties when only a subset of the coordinates is retained, with scVI being an example. For methods like scVI, you should choose the number of latent variables you will want to use when calculating the latent variables and not subset them after the fact.
If the user chooses the default GMM method (dens="GMM"
), gloscope
fits sample-level densities with Gaussian mixture models (GMMs) implemented by mclust
. The mclust
package uses the Bayesian information criterion (BIC) to select a GMM from a family of models, and the user of GloScope
can specify how many centroids should be considered in that family. By default GMMs with \(1\) through \(9\) centroids are compared. With the num_components
optional vector to gloscope
the user can specify the possible number of centroids for mclust
to compute, with the one with the best BIC value being the final choice.
When GMMs are used for density fitting, a Monte-Carlo approximation is then used to compute the pairwise divergences from the estimated densities. This means that the resulting estimate is stochastic, and details about controlling the random seed appear later in this vignette. The number of Monte-Carlo draws from the density of each sample is \(10,000\) by default, and this is controlled by the optional parameter \(r\) in the gloscope
function.
# Can take a couple of minutes to run:
gmm_divergence_alt<-gloscope(embedding_matrix, sample_ids, dens = "GMM", num_components = c(2,4,6),r=20000)
A non-parametric alternative for density and divergence estimation is the \(k\)-nearest neighbours algorithm. To use this technique, the optional argument dens="KNN"
should be set. The number of neighbors is a hyperparameter, equal to \(50\) by default, and governed by the optional argument k
. It is important to note that negative divergences between similar samples are possible with this density estimation choice. The gloscope
function does not censor or round any negative values in the output matrix, leaving that decision to the user.
knn_divergence_alt <- gloscope(embedding_matrix, sample_ids,
dens = "KNN", k = 25)
The default divergence for GloScope
is the symmetric KL divergence, but the Jensen-Shannon divergence is also implemented. This can be controlled by setting the argument dist_mat="KL"
or dist_mat="JS"
, respectively. One beneficial property of the Jensen-Shannon divergence is that its square root is a proper distance metric. Note that gloscope
returns a matrix of untransformed divergences, and the user must take the square root of matrix entries themselves if this is desired.
The plotMDS
function provided by this package visualizes the output divergence matrix with multidimensional scaling. The plotMDS
function utilizes the isoMDS
function from the package MASS
, and then creates a scatter plot with samples color-coded by a user-specified covariate such as phenotype
This function requires a data.frame
with the relevant metadata at the sample level, rather than at the cell level. This can easily be obtained by applying the unique
function from base R to the cell-level metadata data.frame
.
pat_info <- as.data.frame(unique(SingleCellExperiment::colData(example_SCE)[,-c(3)]))
head(pat_info)
#> sample_id phenotype
#> BGCV03_GCGCCAATCTGGCGTG-1 CV0176 Covid
#> BGCV11_AGGCCGTCATCGATGT-1 CV0257 Covid
#> BGCV09_AGCTTGAGTACTTAGC-1 CV0279 Covid
#> BGCV05_CATGCCTGTGTCGCTG-1 CV0120 Covid
#> BGCV09_ATCATCTAGTGATCGG-1 CV0917 Healthy
#> BGCV08_CCACTACGTTAAGGGC-1 CV0915 Healthy
Here we plot the MDS representation with each sample color-coded by the phenotype
variable. Note the function call returns both a matrix of MDS embeddings and a ggplot
visualization of the first two dimensions.
mds_result <- plotMDS(dist_mat = knn_divergence, metadata = pat_info, "sample_id","phenotype", k = 2)
mds_result$plot
Another classical way to visualize a divergence matrix is with a heatmap. The following code demonstrates that the output of gloscope
is easily used in plotting functions beyond the package.
heatmap(knn_divergence)
To speed-up calculations of the pair-wise divergences, GloScope
allows for parallelizing the calculation. The argument BPPARAM
controls the parameters of the parallelization (see bplapply
).
The default is no parallelization, but the iteration across sample-pairs will still via the function bplapply
. In this case (i.e. no parallelization), the argument is simply BPPARAM = BiocParallel::SerialParam()
.
IMPORTANT: Due to the construction of the NAMESPACE file, it is essential that any setting of the BPPARAM optional argument uses the BiocParallel::
namespace prefix. For example, gloscope(...,BPPARAM = MulicoreParam()
will raise an error, and gloscope(...,BPPARAM = BiocParallel::MulicoreParam()
should be used instead.
The calculation of the KL divergence from the GMM density estimate uses Monte-Carlo approximation, and hence has to randomly sample from the estimated density. To set the seed for the pseudo-random number generator used in the simulation, the seed needs to be set within the argument to BPPARAM
and not by a call to set.seed
(see https://bioconductor.org/packages/release/bioc/vignettes/BiocParallel/inst/doc/Random_Numbers.html for more information).
This is how the seed must be set, even if there is no parallelization chosen (the default), because the iteration over sample pairs is sent through bplapply
function regardless, as noted above. Setting the seed outside the function via set.seed
will not have an effect on the function.
The following is an example of how to set the random seed when running the GMM
option, using the default of no parallelization:
gmm_divergence <- gloscope(embedding_matrix, sample_ids, dens = "GMM", dist_mat = "KL",
BPPARAM = BiocParallel::SerialParam(RNGseed = 2))
The same argument (RNGseed
) can be added to other BPPARAM
arguments to set the seed.
Note that the KNN
estimation procedure does not have any Monte-Carlo approximation steps, and thus does not need to have a random seed.
sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] GloScope_1.5.0 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] SummarizedExperiment_1.37.0 gtable_0.3.6
#> [3] xfun_0.48 bslib_0.8.0
#> [5] ggplot2_3.5.1 Biobase_2.67.0
#> [7] lattice_0.22-6 vctrs_0.6.5
#> [9] tools_4.5.0 generics_0.1.3
#> [11] stats4_4.5.0 parallel_4.5.0
#> [13] tibble_3.2.1 fansi_1.0.6
#> [15] highr_0.11 pkgconfig_2.0.3
#> [17] Matrix_1.7-1 S4Vectors_0.45.0
#> [19] lifecycle_1.0.4 GenomeInfoDbData_1.2.13
#> [21] farver_2.1.2 compiler_4.5.0
#> [23] FNN_1.1.4.1 tinytex_0.53
#> [25] munsell_0.5.1 codetools_0.2-20
#> [27] GenomeInfoDb_1.43.0 htmltools_0.5.8.1
#> [29] sass_0.4.9 yaml_2.3.10
#> [31] pillar_1.9.0 crayon_1.5.3
#> [33] jquerylib_0.1.4 MASS_7.3-61
#> [35] BiocParallel_1.41.0 SingleCellExperiment_1.29.0
#> [37] DelayedArray_0.33.0 cachem_1.1.0
#> [39] magick_2.8.5 mclust_6.1.1
#> [41] abind_1.4-8 tidyselect_1.2.1
#> [43] digest_0.6.37 dplyr_1.1.4
#> [45] bookdown_0.41 labeling_0.4.3
#> [47] fastmap_1.2.0 grid_4.5.0
#> [49] colorspace_2.1-1 cli_3.6.3
#> [51] SparseArray_1.7.0 magrittr_2.0.3
#> [53] S4Arrays_1.7.0 utf8_1.2.4
#> [55] withr_3.0.2 UCSC.utils_1.3.0
#> [57] scales_1.3.0 rmarkdown_2.28
#> [59] XVector_0.47.0 httr_1.4.7
#> [61] matrixStats_1.4.1 RANN_2.6.2
#> [63] mvnfast_0.2.8 evaluate_1.0.1
#> [65] knitr_1.48 GenomicRanges_1.59.0
#> [67] IRanges_2.41.0 rlang_1.1.4
#> [69] Rcpp_1.0.13 glue_1.8.0
#> [71] BiocManager_1.30.25 BiocGenerics_0.53.0
#> [73] jsonlite_1.8.9 R6_2.5.1
#> [75] MatrixGenerics_1.19.0 zlibbioc_1.53.0