chipseqDBData 1.21.0
This package contains several ChIP-seq datasets for use in differential binding (DB) analyses:
These datasets are mainly used in the chipseqDB workflow (Lun and Smyth 2015) and the csaw user’s guide (Lun and Smyth 2016). This vignette will briefly demonstrate how to obtain each dataset and investigate some of the processing statistics.
We obtain the H3K9ac dataset from ExperimentHub using the H3K9acData()
function.
This downloads sorted and indexed BAM files to a local cache, along with the associated index files.
The function returns a DataFrame
of file paths and sample descriptions to further use in workflows.
library(chipseqDBData)
h3k9ac.paths <- H3K9acData()
h3k9ac.paths
## DataFrame with 4 rows and 3 columns
## Name Description Path
## <character> <character> <List>
## 1 h3k9ac-proB-8113 pro-B H3K9ac (8113) <BamFile>
## 2 h3k9ac-proB-8108 pro-B H3K9ac (8108) <BamFile>
## 3 h3k9ac-matureB-8059 mature B H3K9ac (8059) <BamFile>
## 4 h3k9ac-matureB-8086 mature B H3K9ac (8086) <BamFile>
Note that the time-consuming download only occurs upon the first use of the function. Later uses will simply re-use the same files, thus avoiding the need to re-download these large files. (Some readers may notice that the paths point to the temporary directory, which is destroyed at the end of each R session. Here, the temporary directory contains only soft-links to the persistent BAM files in the local cache. This is a low-cost illusion to ensure that the index files have the same prefixes as the BAM files.)
The same approach is used for all of the other datasets, e.g., CBPData()
, NFYAData()
.
Be aware that the initial download time will depend on the size and number of the BAM files in each dataset.
We use functions from the Rsamtools package to examine the mapping statistics. This includes the number of mapped reads, the number of marked reads (i.e., potential PCR duplicates) and the number of high-quality alignments with high mapping scores.
library(Rsamtools)
diagnostics <- list()
for (i in seq_len(nrow(h3k9ac.paths))) {
stats <- scanBam(h3k9ac.paths$Path[[i]],
param=ScanBamParam(what=c("mapq", "flag")))
flag <- stats[[1]]$flag
mapq <- stats[[1]]$mapq
mapped <- bitwAnd(flag, 0x4)==0
diagnostics[[h3k9ac.paths$Name[i]]] <- c(
Total=length(flag),
Mapped=sum(mapped),
HighQual=sum(mapq >= 10 & mapped),
DupMarked=sum(bitwAnd(flag, 0x400)!=0)
)
}
diag.stats <- data.frame(do.call(rbind, diagnostics))
diag.stats$Prop.mapped <- diag.stats$Mapped/diag.stats$Total*100
diag.stats$Prop.marked <- diag.stats$DupMarked/diag.stats$Mapped*100
diag.stats
## Total Mapped HighQual DupMarked Prop.mapped Prop.marked
## h3k9ac-proB-8113 10724526 8832006 8815503 434884 82.35335 4.923955
## h3k9ac-proB-8108 10413135 7793913 7786335 252271 74.84694 3.236770
## h3k9ac-matureB-8059 16675372 4670364 4568908 396785 28.00756 8.495805
## h3k9ac-matureB-8086 6347683 4551692 4535587 141583 71.70635 3.110558
More comprehensive quality checks are beyond the scope of this document, but can be performed with other packages such as ChIPQC.
sessionInfo()
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] Rsamtools_2.21.2 Biostrings_2.73.2 XVector_0.45.0
## [4] GenomicRanges_1.57.2 GenomeInfoDb_1.41.2 IRanges_2.39.2
## [7] S4Vectors_0.43.2 BiocGenerics_0.51.3 chipseqDBData_1.21.0
## [10] knitr_1.48 BiocStyle_2.33.1
##
## loaded via a namespace (and not attached):
## [1] KEGGREST_1.45.1 xfun_0.48 bslib_0.8.0
## [4] Biobase_2.65.1 vctrs_0.6.5 tools_4.5.0
## [7] bitops_1.0-9 generics_0.1.3 parallel_4.5.0
## [10] curl_5.2.3 tibble_3.2.1 fansi_1.0.6
## [13] AnnotationDbi_1.67.0 RSQLite_2.3.7 blob_1.2.4
## [16] pkgconfig_2.0.3 dbplyr_2.5.0 lifecycle_1.0.4
## [19] GenomeInfoDbData_1.2.13 compiler_4.5.0 codetools_0.2-20
## [22] htmltools_0.5.8.1 sass_0.4.9 yaml_2.3.10
## [25] pillar_1.9.0 crayon_1.5.3 jquerylib_0.1.4
## [28] BiocParallel_1.39.0 cachem_1.1.0 mime_0.12
## [31] ExperimentHub_2.13.1 AnnotationHub_3.13.3 tidyselect_1.2.1
## [34] digest_0.6.37 purrr_1.0.2 dplyr_1.1.4
## [37] bookdown_0.41 BiocVersion_3.20.0 fastmap_1.2.0
## [40] cli_3.6.3 magrittr_2.0.3 utf8_1.2.4
## [43] withr_3.0.1 filelock_1.0.3 UCSC.utils_1.1.0
## [46] rappdirs_0.3.3 bit64_4.5.2 rmarkdown_2.28
## [49] httr_1.4.7 bit_4.5.0 png_0.1-8
## [52] memoise_2.0.1 evaluate_1.0.1 BiocFileCache_2.13.2
## [55] rlang_1.1.4 glue_1.8.0 DBI_1.2.3
## [58] BiocManager_1.30.25 jsonlite_1.8.9 R6_2.5.1
## [61] zlibbioc_1.51.2