BayesSpace supports three ways of loading a
SingleCellExperiment
for analysis.
Visium datasets processed with Space
Ranger can be loaded directly via the readVisium()
function. This function takes only the path to the Space Ranger output
directory (containing the spatial/
and
filtered_feature_bc_matrix/
subdirectories) and returns a
SingleCellExperiment
.
Second, all datasets analyzed for the BayesSpace manuscript are
readily accessible via the getRDS()
function. This function
takes two arguments - the name of the dataset, and the name of the
sample in the dataset.
Finally, SingleCellExperiment
objects can be constructed
manually from a counts matrix and tables of row and column data.
BayesSpace only requires that spot array coordinates be provided as
columns named array_row
and array_col
in
colData
. (Note that enhancement of Visium datasets
additionally requires the pixel coordinates of each spot in the tissue
image, but in this case the dataset should be loaded with
readVisium()
, which loads these data automatically.)
library(Matrix)
rowData <- read.csv("path/to/rowData.csv", stringsAsFactors=FALSE)
colData <- read.csv("path/to/colData.csv", stringsAsFactors=FALSE, row.names=1)
counts <- read.csv("path/to/counts.csv.gz",
row.names=1, check.names=F, stringsAsFactors=FALSE))
sce <- SingleCellExperiment(assays=list(counts=as(counts, "dgCMatrix")),
rowData=rowData,
colData=colData)
We’ll continue with the melanoma sample from the 2018 Spatial Transcriptomics paper for the remaining examples in this vignette.
BayesSpace requires minimal data pre-processing, but we provide a helper function to automate it.
spatialPreprocess()
log-normalizes the count matrix and
performs PCA on the top n.HVGs
highly variable genes,
keeping the top n.PCs
principal components. Additionally,
the spatial sequencing platform is added as metadata in the
SingleCellExperiment
for downstream analyses. If you do not
wish to rerun PCA, running spatialPreprocess()
with the
flag skip.PCA=TRUE
will only add the metadata BayesSpace
requires.
Here, we omit log-normalization as all datasets available through
getRDS()
already include log-normalized counts.
We can use the qTune()
and qPlot()
functions to help choose q
, the number of clusters to use
in our analysis.
qTune()
runs the BayesSpace clustering algorithm for
multiple specified values of q
(by default, 3 through 7)
and computes their average pseudo-log-likelihood. It accepts any
arguments to spatialCluster()
.qPlot()
plots the pseudo-log-likelihood as a function
of q
; we suggest choosing a q
around the elbow
of this plot.The spatialCluster()
function clusters the spots, and
adds the predicted cluster labels to the
SingleCellExperiment
. Typically, as we did for the analyses
in the paper, we suggest running with at least 10,000 iterations
(nrep=10000
), but we use 1,000 iteration in this
demonstration for the sake of runtime. (Note that a random seed must be
set in order for the results to be reproducible.)
set.seed(149)
melanoma <- spatialCluster(melanoma, q=4, platform="ST", d=7,
init.method="mclust", model="t", gamma=2,
nrep=1000, burn.in=100,
save.chain=TRUE)
Both the mclust initialization (cluster.init
) and the
BayesSpace cluster assignments (spatial.cluster
) are now
available in the SingleCellExperiment’s colData
.
head(colData(melanoma))
#> DataFrame with 6 rows and 7 columns
#> array_row array_col sizeFactor spot.idx spot.neighbors cluster.init
#> <integer> <integer> <numeric> <integer> <character> <numeric>
#> 7x15 7 15 0.795588 1 2,7 1
#> 7x16 7 16 0.307304 2 1,3,8 1
#> 7x17 7 17 0.331247 3 2,4,9 2
#> 7x18 7 18 0.420747 4 3,10 3
#> 8x13 8 13 0.255453 5 6,12 1
#> 8x14 8 14 1.473439 6 5,7,13 1
#> spatial.cluster
#> <numeric>
#> 7x15 1
#> 7x16 1
#> 7x17 2
#> 7x18 2
#> 8x13 1
#> 8x14 1
We can plot the cluster assignments over the spatial locations of the
spots with clusterPlot()
.
As clusterPlot()
returns a ggplot
object,
it can be customized by composing with familiar ggplot2
functions. Additionally, the argument palette
sets the
colors used for each cluster, and clusterPlot()
takes
additional arguments to geom_polygon()
such as
size
or color
to control the aesthetics of the
spot borders.
The spatialEnhance()
function will enhance the
resolution of the principal components, and add these PCs as well as
predicted cluster labels at subspot resolution to a new
SingleCellExperiment
. As with our demonstration of
spatialCluster()
above, we are using fewer iterations for
the purpose of this example (nrep=1000
) than we recommend
in practice (nrep=100000
or greater). Note that the
jitter_scale
parameter should be tuned so that proposals
for updating subspot-level expression are accepted around 30% of the
time. This can be evaluated using
mcmcChain(melanoma.enhanced, "Ychange")
, where the chain
should stabilize to 0.25-0.40. Typically 1000-2500 iterations are
sufficient to evaluate if jitter_scale
should be increased
if acceptance is too high or decreased if acceptance is too low. After
tuning, proceed to a full run of spatialEnhance
with more
iterations.
melanoma.enhanced <- spatialEnhance(melanoma, q=4, platform="ST", d=7,
model="t", gamma=2,
jitter.prior=0.3, jitter.scale=3.5,
nrep=1000, burn.in=100,
save.chain=TRUE, cores = 1)
The enhanced SingleCellExperiment
includes an index to
the parent spot in the original sce
(spot.idx
), along with an index to the subspot. It adds the
offsets to the original spot coordinates, and provides the enhanced
cluster label (spatial.cluster
).
head(colData(melanoma.enhanced))
#> DataFrame with 6 rows and 9 columns
#> spot.idx spot.neighbors subspot.idx subspot.neighbors spot.row
#> <numeric> <character> <integer> <character> <integer>
#> subspot_1.1 1 2,7 1 294,880 7
#> subspot_2.1 2 1,3,8 1 295,587,881 7
#> subspot_3.1 3 2,4,9 1 296,588,882 7
#> subspot_4.1 4 3,10 1 297,589,883 7
#> subspot_5.1 5 6,12 1 298,884 8
#> subspot_6.1 6 5,7,13 1 299,591,885 8
#> spot.col array_row array_col spatial.cluster
#> <integer> <numeric> <numeric> <numeric>
#> subspot_1.1 15 6.66667 14.6667 1
#> subspot_2.1 16 6.66667 15.6667 1
#> subspot_3.1 17 6.66667 16.6667 1
#> subspot_4.1 18 6.66667 17.6667 3
#> subspot_5.1 13 7.66667 12.6667 2
#> subspot_6.1 14 7.66667 13.6667 1
We can plot the enhanced cluster assignments as above.
BayesSpace operates on the principal components of the gene
expression matrix, and spatialEnhance()
therefore computes
enhanced resolution PC vectors. Enhanced gene expression is not computed
directly, and is instead imputed using a regression algorithm. For each
gene, a model using the PC vectors of each spot is trained to predict
the spot-level gene expression, and the fitted model is used to predict
subspot expression from the subspot PCs.
Gene expression enhancement is implemented in the
enhanceFeatures()
function. BayesSpace predicts expression
with xgboost
by default, but linear and Dirichlet regression are also available via
the model
argument. When using xgboost
, we
suggest automatically tuning the nrounds
parameter by
setting it to 0, although this comes at the cost of increased runtime
(~4x slower than a pre-specified nrounds
in practice).
enhanceFeatures()
can be used to impute subspot-level
expression for all genes, or for a subset of genes of interest. Here,
we’ll demonstrate by enhancing the expression of four marker genes: PMEL
(melanoma), CD2 (T-cells), CD19 (B-cells), and COL1A1 (fibroblasts).
markers <- c("PMEL", "CD2", "CD19", "COL1A1")
melanoma.enhanced <- enhanceFeatures(melanoma.enhanced, melanoma,
feature_names=markers,
nrounds=0)
By default, log-normalized expression (logcounts(sce)
)
is imputed, although other assays or arbitrary feature matrices can be
specified.
logcounts(melanoma.enhanced)[markers, 1:5]
#> subspot_1.1 subspot_2.1 subspot_3.1 subspot_4.1 subspot_5.1
#> PMEL 2.4383025 1.5922332 2.23760223 3.0099206 2.0412400
#> CD2 0.3343147 0.5033790 0.28379297 0.2837930 0.2837930
#> CD19 0.6677281 0.4452745 0.38306668 0.2275964 0.4397965
#> COL1A1 0.7656322 0.9640545 0.07482389 0.8325472 0.7904220
Diagnostic measures from each predictive model, such as
rmse
when using xgboost
, are added to the
rowData
of the enhanced dataset.
rowData(melanoma.enhanced)[markers, ]
#> DataFrame with 4 rows and 4 columns
#> gene_id gene_name is.HVG enhanceFeatures.rmse
#> <character> <character> <logical> <numeric>
#> PMEL ENSG00000185664 PMEL TRUE 0.805485
#> CD2 ENSG00000116824 CD2 TRUE 0.635091
#> CD19 ENSG00000177455 CD19 TRUE 0.628285
#> COL1A1 ENSG00000108821 COL1A1 TRUE 0.555659
Spatial gene expression is visualized with
featurePlot()
.
Here, we compare the spatial expression of the imputed marker genes.
enhanced.plots <- purrr::map(markers, function(x) featurePlot(melanoma.enhanced, x))
patchwork::wrap_plots(enhanced.plots, ncol=2)
And we can compare to the spot-level expression.
If save.chain
is set to TRUE
in either
spatialCluster()
or spatialEnhance()
, the
chain associated with the respective MCMC run is preserved to disk as an
HDF5 file. The path to this file is stored in the SingleCellExperiment’s
metadata at metadata(sce)$h5.chain
, and can be read
directly using mcmcChain()
.
The chain is provided as a coda::mcmc
object, which can
be analyzed with TidyBayes or as a matrix.
The object has one row per iteration, with the values of the parameters
concatenated across the row. Columns are named with the parameter name
and index (if any).
chain <- mcmcChain(melanoma)
chain[1:5, 1:5]
#> lambda[1,1] lambda[1,2] lambda[1,3] lambda[1,4] lambda[1,5]
#> [1,] 0.0100000 0.00000000 0.00000000 0.000000e+00 0.00000000
#> [2,] 0.1836711 0.01019059 0.08664012 3.994578e-02 -0.02443957
#> [3,] 0.1790958 -0.04576098 0.07102969 8.286855e-05 -0.10919746
#> [4,] 0.1730928 -0.02089801 0.08429716 2.803978e-02 -0.10322589
#> [5,] 0.1923552 -0.02112783 0.07563260 2.088451e-02 -0.04688843
To remove the HDF5 file from disk and remove its path from the
metadata, use removeChain()
.