Flow and mass cytometry are important modern immunology tools for measuring expression levels of multiple proteins on single cells. The goal is to better understand the mechanisms of responses on a single cell basis by studying differential expression of proteins. Most current data analysis tools compare expressions across many computationally discovered cell types. Our goal is to focus on just one cell type. Differential analysis of marker expressions can be difficult due to marker correlations and inter-subject heterogeneity, particularly for studies of human immunology. We address these challenges with two multiple regression strategies: A bootstrapped generalized linear model (GLM) and a generalized linear mixed model (GLMM). Here, we illustrate the CytoGLMM
R package and workflow for simulated mass cytometry data.
We construct our simulated datasets by sampling from a Poisson GLM. We confirmed—with predictive posterior checks—that Poisson GLMs with mixed effects provide a good fit to mass cytometry data (Seiler et al. 2019). We consider one underlying data generating mechanisms described by a hierarchical model for the \(i\)th cell and \(j\)th donor:
\[ \begin{aligned} \boldsymbol{X}_{ij} & \sim \text{Poisson}(\boldsymbol{\lambda}_{ij}) \\ \log(\boldsymbol{\lambda}_{ij}) & = \boldsymbol{B}_{ij} + \boldsymbol{U}_j \\ \boldsymbol{B}_{ij} & \sim \begin{cases} \text{Normal}(\boldsymbol{\delta}^{(0)}, \boldsymbol{\Sigma}_B) & \text{if } Y_{ij} = 0, \text{ cell unstimulated} \\ \text{Normal}(\boldsymbol{\delta}^{(1)}, \boldsymbol{\Sigma}_B) & \text{if } Y_{ij} = 1, \text{ cell stimulated} \end{cases} \\ \boldsymbol{U}_j & \sim \text{Normal}(\boldsymbol{0}, \boldsymbol{\Sigma}_U). \end{aligned} \]
The following graphic shows a representation of the hierarchical model.
The stimulus activates proteins and induces a difference in marker expression. We define the effect size to be the difference between expected expression levels of stimulated versus unstimulated cells on the \(\log\)-scale. All markers that belong to the active set , have a non-zero effect size, whereas, all markers that are not, have a zero effect size:
\[ \begin{cases} \delta^{(1)}_p - \delta^{(0)}_p > 0 & \text{if protein } p \text{ is in activation set } p \in C \\ \delta^{(1)}_{p'} - \delta^{(0)}_{p'} = 0 & \text{if protein } p' \text{ is not in activation set } p' \notin C. \end{cases} \]
Both covariance matrices have an autoregressive structure,
\[ \begin{aligned} \Omega_{rs} & = \rho^{|r-s|} \\ \boldsymbol{\Sigma} & = \operatorname{diag}(\boldsymbol{\sigma}) \, \boldsymbol{\Omega} \, \operatorname{diag}(\boldsymbol{\sigma}), \end{aligned} \]
where \(\Omega_{rs}\) is the \(r\)th row and \(s\)th column of the correlation matrix \(\boldsymbol{\Omega}\). We regulate two separate correlation parameters: a cell-level \(\rho_B\) and a donor-level \(\rho_U\) coefficient. Non-zero \(\rho_B\) or \(\rho_U\) induce a correlation between condition and marker expression even for markers with a zero effect size.
library("CytoGLMM")
library("magrittr")
set.seed(23)
df = generate_data()
df[1:5,1:5]
## # A tibble: 5 × 5
## donor condition m01 m02 m03
## <int> <fct> <int> <int> <int>
## 1 1 treatment 2 127 116
## 2 1 treatment 0 96 54
## 3 1 treatment 1 14 52
## 4 1 treatment 6 20 34
## 5 1 treatment 18 30 15
We define the marker names that we will focus on in our analysis by extracting them from the simulated data frame.
protein_names = names(df)[3:12]
We recommend that marker expressions be corrected for batch effects (Nowicka et al. 2017; Chevrier et al. 2018; Schuyler et al. 2019; Van Gassen et al. 2020; Trussart et al. 2020) and transformed using variance stabilizing transformations to account for heteroskedasticity, for instance with an inverse hyperbolic sine transformation with the cofactor set to 150 for flow cytometry, and 5 for mass cytometry (Bendall et al. 2011). This transformation assumes a two-component model for the measurement error (Rocke and Lorenzato 1995; Huber et al. 2003) where small counts are less noisy than large counts. Intuitively, this corresponds to a noise model with additive and multiplicative noise depending on the magnitude of the marker expression; see (Holmes and Huber 2019) for details.
df %<>% dplyr::mutate_at(protein_names, function(x) asinh(x/5))
The goal of the CytoGLMM::cytoglm
function is to find protein expression patterns that are associated with the condition of interest, such as a response to a stimulus. We set up the GLM to predict the experimental condition (condition
) from protein marker expressions (protein_names
), thus our experimental conditions are response variables and marker expressions are explanatory variables.
glm_fit = CytoGLMM::cytoglm(df,
protein_names = protein_names,
condition = "condition",
group = "donor",
num_cores = 1,
num_boot = 1000)
glm_fit
##
## #######################
## ## paired analysis ####
## #######################
##
## number of bootstrap samples: 1000
##
## number of cells per group and condition:
## control treatment
## 1 1000 1000
## 2 1000 1000
## 3 1000 1000
## 4 1000 1000
## 5 1000 1000
## 6 1000 1000
## 7 1000 1000
## 8 1000 1000
##
## proteins included in the analysis:
## m01 m02 m03 m04 m05 m06 m07 m08 m09 m10
##
## condition compared: condition
## grouping variable: donor
We plot the maximum likelihood estimates with 95% confidence intervals for the fixed effects \(\boldsymbol{\beta}\). The estimates are on the \(\log\)-odds scale. We see that markers m1, m2, and m3 are strong predictors of the treatment. This means that one unit increase in the transformed marker expression makes it more likely to be a cell from the treatment group, while holding the other markers constant.
plot(glm_fit)
The summary
function returns a table about the model fit with unadjusted and Benjamini-Hochberg (BH) adjusted \(p\)-values.
summary(glm_fit)
## # A tibble: 10 × 3
## protein_name pvalues_unadj pvalues_adj
## <chr> <dbl> <dbl>
## 1 m02 0.002 0.02
## 2 m03 0.004 0.02
## 3 m01 0.016 0.0533
## 4 m06 0.334 0.835
## 5 m04 0.508 0.865
## 6 m08 0.526 0.865
## 7 m09 0.67 0.865
## 8 m05 0.692 0.865
## 9 m10 0.84 0.93
## 10 m07 0.93 0.93
We can extract the proteins below an False Discovery Rate (FDR) of 0.05 from the \(p\)-value table by filtering the table.
summary(glm_fit) %>% dplyr::filter(pvalues_adj < 0.05)
## # A tibble: 2 × 3
## protein_name pvalues_unadj pvalues_adj
## <chr> <dbl> <dbl>
## 1 m02 0.002 0.02
## 2 m03 0.004 0.02
In this simulated dataset, the markers m2 and m3 are below an FDR of 0.05.
In the CytoGLMM::cytoglmm
function, we make additional modeling assumptions by adding a random effect term in the standard logistic regression model to account for the subject effect (group
). In paired experimental design—when the same donor provides two samples, one for each condition—CytoGLMM::cytoglmm
is statistically more powerful.
glmm_fit = CytoGLMM::cytoglmm(df,
protein_names = protein_names,
condition = "condition",
group = "donor",
num_cores = 1)
glmm_fit
## number of cells per group and condition:
## control treatment
## 1 1000 1000
## 2 1000 1000
## 3 1000 1000
## 4 1000 1000
## 5 1000 1000
## 6 1000 1000
## 7 1000 1000
## 8 1000 1000
##
## proteins included in the analysis:
## m01 m02 m03 m04 m05 m06 m07 m08 m09 m10
##
## condition compared: condition
## grouping variable: donor
We plot the method of moments estimates with 95% confidence intervals for the fixed effects \(\boldsymbol{\beta}\).
plot(glmm_fit)
The summary
function returns a table about the model fit with unadjusted and BH adjusted \(p\)-values.
summary(glmm_fit)
## # A tibble: 10 × 3
## protein_name pvalues_unadj pvalues_adj
## <chr> <dbl> <dbl>
## 1 m03 0.0000682 0.000682
## 2 m02 0.000163 0.000813
## 3 m01 0.000681 0.00227
## 4 m06 0.155 0.388
## 5 m05 0.371 0.646
## 6 m08 0.388 0.646
## 7 m04 0.512 0.731
## 8 m09 0.859 0.954
## 9 m10 0.874 0.954
## 10 m07 0.954 0.954
We can extract the proteins below an FDR of 0.05 from the \(p\)-value table by filtering the table.
summary(glmm_fit) %>% dplyr::filter(pvalues_adj < 0.05)
## # A tibble: 3 × 3
## protein_name pvalues_unadj pvalues_adj
## <chr> <dbl> <dbl>
## 1 m03 0.0000682 0.000682
## 2 m02 0.000163 0.000813
## 3 m01 0.000681 0.00227
In this simulated dataset, the markers m1, m2 and m3 are below an FDR of 0.05. This illustrates the improved power for paired samples as CytoGLMM::cytoglmm
correctly detects all three differentially expressed proteins. We performed extensive power simulations and comparisons to other methods in Seiler et al. (2021).
sessionInfo()
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] magrittr_2.0.3 CytoGLMM_1.15.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] pROC_1.18.5 mbest_0.6 sandwich_3.1-1
## [4] rlang_1.1.4 compiler_4.5.0 flexmix_2.3-19
## [7] vctrs_0.6.5 reshape2_1.4.4 stringr_1.5.1
## [10] pkgconfig_2.0.3 fastmap_1.2.0 magick_2.8.5
## [13] labeling_0.4.3 utf8_1.2.4 rmarkdown_2.28
## [16] prodlim_2024.06.25 nloptr_2.1.1 tinytex_0.53
## [19] purrr_1.0.2 xfun_0.48 modeltools_0.2-23
## [22] cachem_1.1.0 jsonlite_1.8.9 recipes_1.1.0
## [25] highr_0.11 uuid_1.2-1 BiocParallel_1.41.0
## [28] parallel_4.5.0 R6_2.5.1 bslib_0.8.0
## [31] stringi_1.8.4 RColorBrewer_1.1-3 parallelly_1.38.0
## [34] boot_1.3-31 rpart_4.1.23 lubridate_1.9.3
## [37] jquerylib_0.1.4 Rcpp_1.0.13 bookdown_0.41
## [40] iterators_1.0.14 knitr_1.48 future.apply_1.11.3
## [43] zoo_1.8-12 Matrix_1.7-1 splines_4.5.0
## [46] nnet_7.3-19 timechange_0.3.0 tidyselect_1.2.1
## [49] abind_1.4-8 yaml_2.3.10 timeDate_4041.110
## [52] doParallel_1.0.17 codetools_0.2-20 listenv_0.9.1
## [55] lattice_0.22-6 tibble_3.2.1 plyr_1.8.9
## [58] withr_3.0.2 evaluate_1.0.1 future_1.34.0
## [61] survival_3.7-0 pillar_1.9.0 BiocManager_1.30.25
## [64] foreach_1.5.2 stats4_4.5.0 generics_0.1.3
## [67] ggplot2_3.5.1 munsell_0.5.1 scales_1.3.0
## [70] minqa_1.2.8 globals_0.16.3 class_7.3-22
## [73] glue_1.8.0 pheatmap_1.0.12 tools_4.5.0
## [76] data.table_1.16.2 lme4_1.1-35.5 ModelMetrics_1.2.2.2
## [79] gower_1.0.1 cowplot_1.1.3 grid_4.5.0
## [82] bigmemory_4.6.4 tidyr_1.3.1 ipred_0.9-15
## [85] colorspace_2.1-1 nlme_3.1-166 cli_3.6.3
## [88] bigmemory.sri_0.1.8 fansi_1.0.6 lava_1.8.0
## [91] dplyr_1.1.4 strucchange_1.5-4 gtable_0.3.6
## [94] logging_0.10-108 sass_0.4.9 digest_0.6.37
## [97] caret_6.0-94 ggrepel_0.9.6 farver_2.1.2
## [100] htmltools_0.5.8.1 lifecycle_1.0.4 factoextra_1.0.7
## [103] hardhat_1.4.0 MASS_7.3-61