chromVAR is an R package for the analysis of sparse chromatin accessibility. chromVAR takes as inputs aligned fragments (filtered for duplicates and low quality) from ATAC-seq or DNAse-seq experiments as well as genomic annotations such as motif positions. chromVAR computes for each annotation and each cell or sample a bias corrected “deviation” in accessibility from the expected accessibility based on the average of all the cells or samples.
This vignette covers basic usage of chromVAR with standard inputs. For more detailed documentation covering different options for various inputs and additional applications, view the additional vignettes on the documentation website. For more description of the method and applications, please see the publication (pdf, supplement).
##Loading the package
Use library or require to load the package and useful additional packages.
library(chromVAR)
library(motifmatchr)
library(Matrix)
library(SummarizedExperiment)
library(BiocParallel)
set.seed(2017)
##Setting multiprocessing options The package uses BiocParallel to do the multiprocessing. Check the documentation for BiocParallel to see available options. The settings can be set using the register function. For example, to use MulticoreParam with 8 cores:
## Warning: 'IS_BIOC_BUILD_MACHINE' environment variable detected, setting
## BiocParallel workers to 4 (was 8)
To enable progress bars for multiprocessed tasks, use
## Warning: 'IS_BIOC_BUILD_MACHINE' environment variable detected, setting
## BiocParallel workers to 4 (was 8)
For Windows, MulticoreParam
will not work, but you can
use SnowParam:
Even if you don’t want to use more than one core, it is best to explicitly register that choice using SerialParam:
Please see the documentation for BiocParallel
for more information about the register
function and the
various options for multi-processing.
##Reading in inputs
chromVAR takes as input a table of counts of fragments falling in open chromatin peaks. There are numerous software packages that enable the creation of count tables from epigenomics data; chromVAR also provides a method that is tailored for single-cell ATAC-seq counts. To learn more about the options for the count table and how to format a count table computed via other software, see the Counts section of the documentation website.
###Peaks
For using chromVAR, it is recommended to use fixed-width, non-overlapping peaks. The method is fairly robust to the exact choice of peak width, but a width of 250 to 500 bp is recommended. See the supplement of the paper for a disussion section and supplementary figures related to the choice of peak width.
If analyzing single cell data, it can make sense to use peaks derived from bulk ATAC or DNAse-seq data, either from the same population or a similar population (or possibly from a public resource, like the Roadmap Epigenomics project).
If combining multiple peak files from different populations, it is
recommended to combine the peaks together. The filterPeaks
function (demonstrated a bit further down this vignette) will reduce the
peaks to a non-overlapping set based on which overlapping peak has
stronger signal across all the data.
The function getPeaks
reads in the peaks as a
GenomicRanges object. We will show its use by reading in a tiny sample
bed file. We’ll use the sort_peaks
argument to indicate we
want to sort the peaks.
peakfile <- system.file("extdata/test_bed.txt", package = "chromVAR")
peaks <- getPeaks(peakfile, sort_peaks = TRUE)
## Peaks sorted
For reading in peaks in the narrowpeak format, chromVAR includes a
function, readNarrowpeaks
, that will read in the file,
resize the peaks to a given size based on the width
argument, and remove peaks that overlap a peak with stronger signal (if
non_overlapping
is set to TRUE – the default).
###Counts
The function getCounts
returns a chromVARCounts object
with a Matrix of fragment counts per sample/cell for each peak in
assays. This data can be accessed with
counts(fragment_counts)
.The Matrix package is used so that
if the matrix is sparse, the matrix will be stored as a sparse Matrix.
We will demonstrate this function with a very tiny set of reads included
in the package:
bamfile <- system.file("extdata/test_RG.bam", package = "chromVAR")
fragment_counts <- getCounts(bamfile,
peaks,
paired = TRUE,
by_rg = TRUE,
format = "bam",
colData = DataFrame(celltype = "GM"))
## Reading in file: /tmp/RtmpIMh5Gx/Rinst8be54d24dfcf/chromVAR/extdata/test_RG.bam
Here we passed only one bam file, but we can also pass a vector of
bam file names. In that case, for the column data we would specify the
appropriate value per file, e.g
DataFrame(celltype = c("GM","K562"))
if we were passing in
one file for GM cells and one for K562 cells.
If RG tags are not used for combining multiple samples within a file,
use by_rg = FALSE
. For more on reading in counts, see the
Counts
section of the documentation website.
For the rest of the vignette, we will use a very small (but slightly larger than the previous example) data set of 10 GM cells and 10 H1 cells that has already been read in as a SummarizedExperiment object.
## class: RangedSummarizedExperiment
## dim: 6 50
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(3): score qval name
## colnames(50): singles-GM12878-140905-1 singles-GM12878-140905-2 ...
## singles-H1ESC-140820-24 singles-H1ESC-140820-25
## colData names(2): Cell_Type depth
##Getting GC content of peaks
The GC content will be used for determining background peaks. The
function addGCBias
returns an updated SummarizedExperiment
with a new rowData column named “bias”. The function requires an input
of a genome sequence, which can be provided as a BSgenome, FaFile, or
DNAStringSet object.
library(BSgenome.Hsapiens.UCSC.hg19)
example_counts <- addGCBias(example_counts,
genome = BSgenome.Hsapiens.UCSC.hg19)
head(rowData(example_counts))
## DataFrame with 6 rows and 4 columns
## score qval name bias
## <integer> <numeric> <character> <numeric>
## 1 259 25.99629 GM_peak_6 0.652
## 2 82 8.21494 H1_peak_7 0.680
## 3 156 15.65456 GM_peak_17 0.788
## 4 82 8.21494 H1_peak_13 0.674
## 5 140 14.03065 GM_peak_27 0.600
## 6 189 18.99529 GM_peak_29a 0.742
Check out available.genomes
from the BSgenome package
for what genomes are available as BSgenome packages. For making your own
BSgenome object, check out BSgenomeForge
.
##Filtering inputs
If working with single cell data, it is advisable to filter out samples with insufficient reads or a low proportion of reads in peaks as these may represent empty wells or dead cells. Two parameters are used for filtering – min_in_peaks and min_depth. If not provided (as above), these cutoffs are estimated based on the medians from the data. min_in_peaks is set to 0.5 times the median proportion of fragments in peaks. min_depth is set to the maximum of 500 or 10% of the median library size.
Unless plot = FALSE
given as argument to function
filterSamples
, a plot will be generated.
#find indices of samples to keep
counts_filtered <- filterSamples(example_counts, min_depth = 1500,
min_in_peaks = 0.15, shiny = FALSE)
If shiny argument is set to TRUE (the default), a shiny gadget will pop up which allows you to play with the filtering parameters and see which cells pass filters or not.
To get just the plot of what is filtered, use
filterSamplesPlot
. By default, the plot is interactive– to
set it as not interactive use use_plotly = FALSE
.
#find indices of samples to keep
filtering_plot <- filterSamplesPlot(example_counts, min_depth = 1500,
min_in_peaks = 0.15, use_plotly = FALSE)
filtering_plot
To instead return the indexes of the samples to keep instead of a new SummarizedExperiment object, use ix_return = TRUE.
## min_in_peaks set to 0.105
## min_depth set to 1120.65
For both bulk and single cell data, peaks should be filtered based on
having at least a certain number of fragments. At minimum, each peak
should have at least one fragment across all the samples (it might be
possible to have peaks with zero reads due to using a peak set defined
by other data). Otherwise, downstream functions won’t work. The function
filterPeaks
will also reduce the peak set to
non-overlapping peaks (keeping the peak with higher counts for peaks
that overlap) if non_overlapping argument is set to TRUE (which is the
default).
The function getJasparMotifs
fetches motifs from the
JASPAR database.
The function getJasparMotifs() by default gets human motifs from JASPAR core database. For other species motifs, change the species argument.
For using a collection other than core, use the
collection
argument. Options include: “CORE”, “CNE”,
“PHYLOFACTS”, “SPLICE”, “POLII”, “FAM”, “PBM”, “PBM_HOMEO”,
“PBM_HLH”.
The getJasparMotifs
function is simply a wrapper around
getMatrixSet
from TFBSTools– you can also use that function
to fetch motifs from JASPAR if you prefer, and/or check out the
documentation for that function for more information.
The function matchMotifs
from the motifmatchr package
finds which peaks contain which motifs. By default, it returns a
SummarizedExperiment object, which contains a sparse matrix indicating
motif match or not.The function requires an input of a genome sequence,
which can be provided as a BSgenome, FaFile, or DNAStringSet object.
library(motifmatchr)
motif_ix <- matchMotifs(motifs, counts_filtered,
genome = BSgenome.Hsapiens.UCSC.hg19)
One option is the p.cutoff for determing how stringent motif calling should be. The default value is 0.00005, which tends to give reasonable numbers of motif matches for human motifs.
Instead of returning just motif matches, the function can also return
additional matrices (stored as assays) with the number of motif matches
per peak and the maximum motif score per peak. For this additional
information, use out = scores
. To return the actual
positions of motif matches, use out = positions
. Either the
output with out = matches
or out = scores
can
be passed to the computeDeviations function.
If instead of using known motifs, you want to use all kmers of a
certain length, the matchKmers
function can be used. For
more about using kmers as inputs, see the the Annotations
section of the documentation website.
The function computeDeviations
returns a
SummarizedExperiment with two “assays”. The first matrix (accessible via
deviations(dev)
or assays(dev)$deviations
)
will give the bias corrected “deviation” in accessibility for each set
of peaks (rows) for each cell or sample (columns). This metric represent
how accessible the set of peaks is relative to the expectation based on
equal chromatin accessibility profiles across cells/samples, normalized
by a set of background peak sets matched for GC and average
accessability. The second matrix (deviationScores(dev)
or
assays(deviations)$z
) gives the deviation Z-score, which
takes into account how likely such a score would occur if randomly
sampling sets of beaks with similar GC content and average
accessibility.
The function computeDeviations will use a set of background peaks for
normalizing the deviation scores. This computation is done internally by
default and not returned – to have greater control over this step, a
user can run the getBackgroundPeaks
function themselves and
pass the result to computeDeviations under the background_peaks
parameter.
Background peaks are peaks that are similar to a peak in GC content and average accessibility.
The result from getBackgroundPeaks
is a matrix of
indices, where each column represents the index of the peak that is a
background peak.
To use the background peaks computed, simply add those to the call to computeDeviations:
The function computeVariability
returns a data.frame
that contains the variability (standard deviation of the z scores
computed above across all cell/samples for a set of peaks), bootstrap
confidence intervals for that variability (by resampling cells/samples),
and a p-value for the variability being greater than the null hypothesis
of 1.
plotVariability
takes the output of
computeVariability
and returns a plot of rank sorted
annotation sets and their variability. By default, the plot will be
interactive, unless you set use_plotly = FALSE
.
For visualizing cells, it can be useful to project the deviation
values into two dimension using TSNE. A convenience function for doing
so is provided in deviationsTsne
. If running in an
interactive session, shiny can be set to TRUE to load up a shiny gadget
for exploring parameters.
To plot the results, plotDeviationsTsne
can be used. If
running in an interactive session or an interactive Rmarkdown document,
shiny can be set to TRUE to generate a shiny widget. Here we will show
static results.
## [1] "2024-10-29"
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.75.0
## [3] rtracklayer_1.67.0 BiocIO_1.17.0
## [5] Biostrings_2.75.0 XVector_0.47.0
## [7] BiocParallel_1.41.0 SummarizedExperiment_1.37.0
## [9] Biobase_2.67.0 GenomicRanges_1.59.0
## [11] GenomeInfoDb_1.43.0 IRanges_2.41.0
## [13] S4Vectors_0.45.0 BiocGenerics_0.53.0
## [15] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [17] Matrix_1.7-1 motifmatchr_1.29.0
## [19] chromVAR_1.29.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 jsonlite_1.8.9
## [3] magrittr_2.0.3 farver_2.1.2
## [5] rmarkdown_2.28 zlibbioc_1.53.0
## [7] vctrs_0.6.5 memoise_2.0.1
## [9] Rsamtools_2.23.0 RCurl_1.98-1.16
## [11] htmltools_0.5.8.1 S4Arrays_1.7.0
## [13] curl_5.2.3 CNEr_1.43.0
## [15] SparseArray_1.7.0 sass_0.4.9
## [17] pracma_2.4.4 bslib_0.8.0
## [19] htmlwidgets_1.6.4 plyr_1.8.9
## [21] plotly_4.10.4 cachem_1.1.0
## [23] GenomicAlignments_1.43.0 mime_0.12
## [25] lifecycle_1.0.4 pkgconfig_2.0.3
## [27] R6_2.5.1 fastmap_1.2.0
## [29] GenomeInfoDbData_1.2.13 shiny_1.9.1
## [31] digest_0.6.37 colorspace_2.1-1
## [33] TFMPvalue_0.0.9 AnnotationDbi_1.69.0
## [35] RSQLite_2.3.7 seqLogo_1.73.0
## [37] labeling_0.4.3 fansi_1.0.6
## [39] httr_1.4.7 abind_1.4-8
## [41] compiler_4.5.0 bit64_4.5.2
## [43] withr_3.0.2 DBI_1.2.3
## [45] highr_0.11 R.utils_2.12.3
## [47] poweRlaw_0.80.0 DelayedArray_0.33.0
## [49] rjson_0.2.23 gtools_3.9.5
## [51] caTools_1.18.3 tools_4.5.0
## [53] httpuv_1.6.15 R.oo_1.26.0
## [55] glue_1.8.0 restfulr_0.0.15
## [57] promises_1.3.0 grid_4.5.0
## [59] Rtsne_0.17 reshape2_1.4.4
## [61] TFBSTools_1.45.0 generics_0.1.3
## [63] gtable_0.3.6 tzdb_0.4.0
## [65] R.methodsS3_1.8.2 tidyr_1.3.1
## [67] nabor_0.5.0 data.table_1.16.2
## [69] hms_1.1.3 utf8_1.2.4
## [71] pillar_1.9.0 stringr_1.5.1
## [73] vroom_1.6.5 later_1.3.2
## [75] dplyr_1.1.4 lattice_0.22-6
## [77] bit_4.5.0 annotate_1.85.0
## [79] tidyselect_1.2.1 DirichletMultinomial_1.49.0
## [81] GO.db_3.20.0 miniUI_0.1.1.1
## [83] knitr_1.48 xfun_0.48
## [85] DT_0.33 stringi_1.8.4
## [87] UCSC.utils_1.3.0 lazyeval_0.2.2
## [89] yaml_2.3.10 evaluate_1.0.1
## [91] codetools_0.2-20 archive_1.1.9
## [93] tibble_3.2.1 cli_3.6.3
## [95] xtable_1.8-4 munsell_0.5.1
## [97] jquerylib_0.1.4 Rcpp_1.0.13
## [99] png_0.1-8 XML_3.99-0.17
## [101] parallel_4.5.0 ggplot2_3.5.1
## [103] readr_2.1.5 blob_1.2.4
## [105] JASPAR2016_1.33.0 bitops_1.0-9
## [107] pwalign_1.3.0 viridisLite_0.4.2
## [109] scales_1.3.0 purrr_1.0.2
## [111] crayon_1.5.3 rlang_1.1.4
## [113] KEGGREST_1.47.0