1 Introduction

netSmooth implements a network-smoothing framework to smooth single-cell gene expression data as well as other omics datasets. The algorithm is a graph based diffusion process on networks. The intuition behind the algorithm is that gene networks encoding coexpression patterns may be used to smooth scRNA-seq expression data, since the gene expression values of connected nodes in the network will be predictive of each other. Protein-protein interaction (PPI) networks and coexpression networks are among the networks that could be used for such procedure.

More precisely, netSmooth works as follows. First, the gene expression values or other quantitative values per gene from each sample is projected on to the provided network. Then, the diffusion process is used to smooth the expression values of adjacent genes in the graph, so that a genes expression value represent an estimate of expression levels based the gene it self, as well as the expression values of the neighbors in the graph. The rate at which expression values of genes diffuse to their neighbors is degree-normalized, so that genes with many edges will affect their neighbors less than genes with more specific interactions. The implementation has one free parameter, alpha, which controls if the diffusion will be local or will reach further in the graph. Higher the value, the further the diffusion will reach. The netSmooth package implements strategies to optimize the value of alpha.

Network-smoothing concept

Figure 1: Network-smoothing concept

In summary, netSmooth enables users to smooth quantitative values associated with genes using a gene interaction network such as a protein-protein interaction network. The following sections of this vignette demonstrate functionality of netSmooth package.

2 Smoothing single-cell gene expression data with netSmooth() function

The workhorse of the netSmooth package is the netSmooth() function. This function takes at least two arguments, a network and genes-by-samples matrix as input, and performs smoothing on genes-by-samples matrix. The network should be organized as an adjacency matrix and its row and column names should match the row names of genes-by-samples matrix.

We will demonstrate the usage of the netSmooth() function using a subset of human PPI and a subset of single-cell RNA-seq data from GSE44183-GPL11154. We will first load the example datasets that are available through netSmooth package.

data(smallPPI)
data(smallscRNAseq)

We can now smooth the gene expression network now with netSmooth() function. We will use alpha=0.5.

smallscRNAseq.sm.se <- netSmooth(smallscRNAseq, smallPPI, alpha=0.5)
## Using given alpha: 0.5
smallscRNAseq.sm.sce <- SingleCellExperiment(
    assays=list(counts=assay(smallscRNAseq.sm.se)),
    colData=colData(smallscRNAseq.sm.se)
)

Now, we can look at the smoothed and raw expression values using a heatmap.

anno.df <- data.frame(cell.type=colData(smallscRNAseq)$source_name_ch1)
rownames(anno.df) <- colnames(smallscRNAseq)
pheatmap(log2(assay(smallscRNAseq)+1), annotation_col = anno.df,
         show_rownames = FALSE, show_colnames = FALSE,
         main="before netSmooth")

pheatmap(log2(assay(smallscRNAseq.sm.sce)+1), annotation_col = anno.df,
         show_rownames = FALSE, show_colnames = FALSE,
         main="after netSmooth")

2.1 Optimizing the smoothing parameter alpha

By default, the parameter alpha will be optimized using a robust clustering statistic. Briefly, this approach will try different clustering algorithms and/or parameters and find clusters that can be reproduced with different algorithms. The netSmooth() function will try different alpha values controlled by additional arguments to maximize the number of samples in robust clusters.

Now, we smooth the expression values using automated alpha optimization and plot the heatmaps of raw and smooth versions.

smallscRNAseq.sm.se <- netSmooth(smallscRNAseq, smallPPI, alpha='auto')
smallscRNAseq.sm.sce <- SingleCellExperiment(
    assays=list(counts=assay(smallscRNAseq.sm.se)),
    colData=colData(smallscRNAseq.sm.se)
)

pheatmap(log2(assay(smallscRNAseq.sm.sce)+1), annotation_col = anno.df,
         show_rownames = FALSE, show_colnames = FALSE,
         main="after netSmooth (optimal alpha)")

3 Getting robust clusters from data

There is no standard method especially for clustering single cell RNAseq data, as different studies produce data with different topologies, which respond differently to the various clustering algorithms. In order to avoid optimizing different clustering routines for the different datasets, we have implemented a robust clustering routine based on clusterExperiment. The clusterExperiment framework for robust clustering is based on consensus clustering of clustering assignments obtained from different views of the data and different clustering algorithms. The different views are different reduced dimensionality projections of the data based on different techniques; thus, no single clustering result will dominate the data, and only cluster structures which are robust to different analyses will prevail. We implemented a clustering framework using the components of clusterExperiment and different dimensionality reduction methods.

We can directly use the robust clustering function robustClusters.

yhat <- robustClusters(smallscRNAseq, makeConsensusMinSize=2, makeConsensusProportion=.9)$clusters
## Picked dimReduceFlavor: tsne
## 6 parameter combinations, 0 use sequential method, 0 use subsampling method
## Running Clustering on Parameter Combinations...
## done.
## Note: Merging will be done on ' makeConsensus ', with clustering index 1
yhat.sm <- robustClusters(smallscRNAseq.sm.se, makeConsensusMinSize=2, makeConsensusProportion=.9)$clusters
## Picked dimReduceFlavor: pca
## 6 parameter combinations, 0 use sequential method, 0 use subsampling method
## Running Clustering on Parameter Combinations...
## done.
## Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
## TRUE, : You're computing too large a percentage of total singular values, use a
## standard svd instead.
## Note: Merging will be done on ' makeConsensus ', with clustering index 1
cell.types <- colData(smallscRNAseq)$source_name_ch1
knitr::kable(
  table(cell.types, yhat), caption = 'Cell types and `robustClusters` in the raw data.'
)

Table 1: Cell types and robustClusters in the raw data.
-1 1 2
2-cell blastomere 0 3 0
4-cell blastomere 1 3 0
8-cell blastomere 6 0 4
8-cell embryo 1 0 0
morula 0 0 3
oocyte 0 3 0
pronucleus 0 3 0
zygote 0 2 0
knitr::kable(
  table(cell.types, yhat.sm), caption = 'Cell types and `robustClusters` in the smoothed data.'
)

Table 1: Cell types and robustClusters in the smoothed data.
-1 1 2 3 4
2-cell blastomere 1 0 2 0 0
4-cell blastomere 3 0 1 0 0
8-cell blastomere 6 0 0 2 2
8-cell embryo 1 0 0 0 0
morula 0 3 0 0 0
oocyte 0 0 3 0 0
pronucleus 0 0 3 0 0
zygote 0 0 2 0 0

A cluster assignment of -1 indicates that the cell could not be placed in a robust cluster, and has consequently been omitted. We see that the clusters are completely uninformative in the raw data, while the smoothed data at least permitted the robustClusters procedure to identify a subset of the 8-cell blastomeres as a separate cluster.

4 Deciding for the best dimension reduction method for visualization and clustering

The robustClusters() function works by clustering samples in a lower dimension embedding using either PCA or t-SNE. Different single cell datasets might respond better to different dimensionality reduction techniques. In order to pick the right technique algorithmically, we compute the entropy in a 2D embedding. We obtained 2D embeddings from the 500 most variable genes using either PCA or t-SNE, binned them in a 20x20 grid, and computed the entropy. The entropy in the 2D embedding is a measure for the information captured by it. We pick the embedding with the highest information content. pickDimReduction() function implements this strategy and returns the best embedding according to this strategy.

Below, we pick the best embedding for our example dataset and plot scatter plots for different 2D embedding methods.

smallscRNAseq <- runPCA(smallscRNAseq, ncomponents=2)
smallscRNAseq <- runTSNE(smallscRNAseq, ncomponents=2)
smallscRNAseq <- runUMAP(smallscRNAseq, ncomponents=2)

plotPCA(smallscRNAseq, colour_by='source_name_ch1') + ggtitle("PCA plot")

plotTSNE(smallscRNAseq, colour_by='source_name_ch1') + ggtitle("tSNE plot")

plotUMAP(smallscRNAseq, colour_by='source_name_ch1') + ggtitle("UMAP plot")

The pickDimReduction method picks the dimensionality reduction method which produces the highest entropy embedding:

pickDimReduction(smallscRNAseq)
## [1] "tsne"

5 Frequently asked questions

5.0.1 How can I make smoothing faster ?

Make sure you compile R with openBLAS or variants that are faster.

5.0.2 What happens if all the genes are not in my network ?

The smoothing will only be done using the genes in the network then unsmoothed genes will be attached to the gene expression matrix.


sessionInfo()
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] pheatmap_1.0.12             netSmooth_1.27.0           
##  [3] clusterExperiment_2.27.0    bigmemory_4.6.4            
##  [5] scater_1.35.0               ggplot2_3.5.1              
##  [7] scuttle_1.17.0              SingleCellExperiment_1.29.0
##  [9] SummarizedExperiment_1.37.0 Biobase_2.67.0             
## [11] GenomicRanges_1.59.0        GenomeInfoDb_1.43.0        
## [13] IRanges_2.41.0              S4Vectors_0.45.0           
## [15] BiocGenerics_0.53.0         MatrixGenerics_1.19.0      
## [17] matrixStats_1.4.1           BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.5.0           tibble_3.2.1            XML_3.99-0.17          
##   [4] lifecycle_1.0.4         edgeR_4.5.0             doParallel_1.0.17      
##   [7] lattice_0.22-6          MASS_7.3-61             MAST_1.33.0            
##  [10] magrittr_2.0.3          limma_3.63.0            sass_0.4.9             
##  [13] rmarkdown_2.28          jquerylib_0.1.4         yaml_2.3.10            
##  [16] NMF_0.28                cowplot_1.1.3           zinbwave_1.29.0        
##  [19] DBI_1.2.3               RColorBrewer_1.1-3      ade4_1.7-22            
##  [22] abind_1.4-8             zlibbioc_1.53.0         Rtsne_0.17             
##  [25] purrr_1.0.2             GenomeInfoDbData_1.2.13 ggrepel_0.9.6          
##  [28] irlba_2.3.5.1           genefilter_1.89.0       annotate_1.85.0        
##  [31] codetools_0.2-20        DelayedArray_0.33.0     xml2_1.3.6             
##  [34] tidyselect_1.2.1        RNeXML_2.4.11           locfdr_1.1-8           
##  [37] UCSC.utils_1.3.0        farver_2.1.2            ScaledMatrix_1.15.0    
##  [40] viridis_0.6.5           jsonlite_1.8.9          BiocNeighbors_2.1.0    
##  [43] phylobase_0.8.12        survival_3.7-0          iterators_1.0.14       
##  [46] foreach_1.5.2           tools_4.5.0             progress_1.2.3         
##  [49] Rcpp_1.0.13             glue_1.8.0              gridExtra_2.3          
##  [52] SparseArray_1.7.0       xfun_0.48               dplyr_1.1.4            
##  [55] HDF5Array_1.35.0        withr_3.0.2             BiocManager_1.30.25    
##  [58] fastmap_1.2.0           rhdf5filters_1.19.0     fansi_1.0.6            
##  [61] entropy_1.3.1           digest_0.6.37           rsvd_1.0.5             
##  [64] R6_2.5.1                colorspace_2.1-1        RSQLite_2.3.7          
##  [67] utf8_1.2.4              tidyr_1.3.1             generics_0.1.3         
##  [70] data.table_1.16.2       FNN_1.1.4.1             prettyunits_1.2.0      
##  [73] httr_1.4.7              S4Arrays_1.7.0          uwot_0.2.2             
##  [76] pkgconfig_2.0.3         gtable_0.3.6            blob_1.2.4             
##  [79] registry_0.5-1          XVector_0.47.0          htmltools_0.5.8.1      
##  [82] bookdown_0.41           scales_1.3.0            png_0.1-8              
##  [85] bigmemory.sri_0.1.8     knitr_1.48              reshape2_1.4.4         
##  [88] rncl_0.8.7              uuid_1.2-1              nlme_3.1-166           
##  [91] cachem_1.1.0            rhdf5_2.51.0            stringr_1.5.1          
##  [94] parallel_4.5.0          vipor_0.4.7             softImpute_1.4-1       
##  [97] AnnotationDbi_1.69.0    pillar_1.9.0            grid_4.5.0             
## [100] vctrs_0.6.5             BiocSingular_1.23.0     beachmat_2.23.0        
## [103] xtable_1.8-4            cluster_2.1.6           beeswarm_0.4.0         
## [106] evaluate_1.0.1          tinytex_0.53            magick_2.8.5           
## [109] cli_3.6.3               locfit_1.5-9.10         compiler_4.5.0         
## [112] rlang_1.1.4             crayon_1.5.3            rngtools_1.5.2         
## [115] labeling_0.4.3          plyr_1.8.9              ggbeeswarm_0.7.2       
## [118] stringi_1.8.4           viridisLite_0.4.2       gridBase_0.4-7         
## [121] BiocParallel_1.41.0     munsell_0.5.1           Biostrings_2.75.0      
## [124] Matrix_1.7-1            hms_1.1.3               bit64_4.5.2            
## [127] Rhdf5lib_1.29.0         KEGGREST_1.47.0         statmod_1.5.0          
## [130] highr_0.11              kernlab_0.9-33          memoise_2.0.1          
## [133] bslib_0.8.0             bit_4.5.0               ape_5.8