1 Basics

1.1 Motivation

Some RNA-sequencing library preparations generate sequencing reads from specific locations in transcripts. For example, 3’-end tag methods, which include methods like Lexogen’s Quant-Seq or 10X’s Chromium Single Cell 3’, generate reads from the 3’ ends of transcripts. For applications interested in quantifying alternative cleavage site usage, using a truncated form of transcriptome annotation can help simplify downstream analysis and, in some pipelines, provide for more accurate estimates. This package implements methods to simplify the generation of such truncated annotations.

1.2 Install txcutr

R is an open-source statistical environment which can be easily modified to enhance its functionality via packages. txcutr is a R package available via the Bioconductor repository for packages. R can be installed on any operating system from CRAN after which you can install txcutr by using the following commands in your R session:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
      install.packages("BiocManager")
  }

BiocManager::install("txcutr")

## Check that you have a valid Bioconductor installation
BiocManager::valid()

1.3 Required Background

txcutr is based on many other packages and in particular on those that have implemented the infrastructure needed for dealing with transcriptome annotations. That is, packages like GenomicRanges and GenomicFeatures. While we anticipate most use cases for txcutr to not require manipulating GRanges or TxDb objects directly, it would be beneficial to be familiar with the vignettes provided in these packages.

1.4 Asking for Help

As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But R and Bioconductor have a steep learning curve so it is critical to learn where to ask for help. We would like to highlight the Bioconductor support site as the main resource for getting help: remember to use the txcutr tag and check any previous posts. Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the posting guidelines. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.

1.5 Citing txcutr

We hope that txcutr will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!

## Citation info
citation("txcutr")
#> To cite package 'txcutr' in publications use:
#> 
#>   Fansler M (2024). _txcutr: Transcriptome CUTteR_.
#>   doi:10.18129/B9.bioc.txcutr
#>   <https://doi.org/10.18129/B9.bioc.txcutr>, R package version 1.13.0,
#>   <https://bioconductor.org/packages/txcutr>.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Manual{,
#>     title = {txcutr: Transcriptome CUTteR},
#>     author = {Mervin Fansler},
#>     year = {2024},
#>     note = {R package version 1.13.0},
#>     url = {https://bioconductor.org/packages/txcutr},
#>     doi = {10.18129/B9.bioc.txcutr},
#>   }

2 Quick Start to Using txcutr

2.1 Starting from TxDb Objects

Many transcriptome annotations, especially for model organisms, are already directly available as TxDb annotation packages. One can browse a list of available TxDb objects on the Bioconductor website.

2.1.1 Example TxDb

For demonstration purposes, we will work with the SGD genes for the yeast Saccharomyces cerevisiae, and restrict the processing to chrI.

library(GenomicFeatures)
library(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene)
txdb <- TxDb.Scerevisiae.UCSC.sacCer3.sgdGene

## constrain to "chrI"
seqlevels(txdb) <- "chrI"

## view the lengths of first ten transcripts
transcriptLengths(txdb)[1:10,]
#>    tx_id   tx_name   gene_id nexon tx_len
#> 1      1   YAL069W   YAL069W     1    315
#> 2      2 YAL068W-A   YAL069W     1    255
#> 3      3 YAL067W-A YAL067W-A     1    228
#> 4      4   YAL066W   YAL066W     1    309
#> 5      5 YAL064W-B YAL064W-B     1    381
#> 6      6   YAL064W   YAL064W     1    285
#> 7      7   YAL062W   YAL062W     1   1374
#> 8      8   YAL061W   YAL061W     1   1254
#> 9      9   YAL060W   YAL060W     1   1149
#> 10    10   YAL059W   YAL059W     1    639

As seen from the above output, the transcript lengths are variable, but in a 3’-end tag-based RNA-sequencing library, we expect reads to only come from a narrow region at the 3’ end. We can use the methods in the txcutr package to restrict the annotation to that region for each transcript.

2.1.2 Transcriptome Truncation

Let’s use the truncateTxome method to restrict transcripts to their last 500 nucleotides (nts).

library(txcutr)

txdb_w500 <- truncateTxome(txdb, maxTxLength=500)
#> 'select()' returned 1:1 mapping between keys and columns
#> Truncating transcripts...
#> Done.
#> Checking for duplicate transcripts...
#> Removed 0 duplicates.
#> Creating exon ranges...
#> Done.
#> Creating tx ranges...
#> Done.
#> Creating gene ranges...
#> Done.

transcriptLengths(txdb_w500)[1:10,]
#>    tx_id   tx_name   gene_id nexon tx_len
#> 1      1   YAL069W   YAL069W     1    315
#> 2      2 YAL068W-A   YAL069W     1    255
#> 3      3 YAL067W-A YAL067W-A     1    228
#> 4      4   YAL066W   YAL066W     1    309
#> 5      5 YAL064W-B YAL064W-B     1    381
#> 6      6   YAL064W   YAL064W     1    285
#> 7      7   YAL062W   YAL062W     1    500
#> 8      8   YAL061W   YAL061W     1    500
#> 9      9   YAL060W   YAL060W     1    500
#> 10    10   YAL059W   YAL059W     1    500

Observe how the all transcripts that were previously longer than 500 nts are now exactly 500 nts. Also, any transcripts that were already short enough remain unchanged.

2.1.3 Exporting a New Annotation

We can now export this new transcriptome as a GTF file that could be used in an alignment pipeline or genome viewer. Note that ending the file name with .gz will automatically signal to that the file should be exported with gzip compression.

gtf_file <- tempfile("sacCer3_chrI.sgd.txcutr_w500", fileext=".gtf.gz")
exportGTF(txdb_w500, file=gtf_file)

2.1.4 Exporting Sequences

If we need to prepare an index for alignment or pseudo-alignment, this requires exporting the corresponding sequences of the truncated transcripts. To do this, will need to load a BSgenome object for sacCer3.

library(BSgenome.Scerevisiae.UCSC.sacCer3)
sacCer3 <- BSgenome.Scerevisiae.UCSC.sacCer3

fasta_file <- tempfile("sacCer3_chrI.sgd.txcutr_w500", fileext=".fa.gz")
exportFASTA(txdb_w500, genome=sacCer3, file=fasta_file)

2.1.5 Merge Table

Another file that might be needed by some quantification tools is a merge table. That is, a file that represents a map from the input transcripts to either output transcripts or genes. In some pipelines, one may wish to merge any transcripts that have any overlap. In others, there might be some minimum amount of distance between 3’ ends below which it may be uninteresting to discern between. For this, txcutr provides a generateMergeTable method, together with a minDistance argument.

To create a merge for any overlaps whatsoever, one would set the minDistance to be identical to the maximum transcript length.

df_merge <- generateMergeTable(txdb_w500, minDistance=500)

head(df_merge, 10)
#>      tx_in  tx_out gene_out
#> 1  YAL001C YAL001C  YAL001C
#> 2  YAL002W YAL002W  YAL002W
#> 3  YAL003W YAL003W  YAL003W
#> 4  YAL004W YAL004W  YAL004W
#> 5  YAL005C YAL005C  YAL005C
#> 6  YAL007C YAL007C  YAL007C
#> 7  YAL008W YAL008W  YAL008W
#> 8  YAL009W YAL009W  YAL009W
#> 9  YAL010C YAL010C  YAL010C
#> 10 YAL011W YAL011W  YAL011W

In this particular case, these first transcripts each map to themselves. This should not be a surprise, since very few genes in the input annotation had alternative transcripts. However, we can filter to highlight any transcripts that would get merged.

df_merge[df_merge$tx_in != df_merge$tx_out,]
#>       tx_in    tx_out gene_out
#> 28  YAL026C YAL026C-A  YAL026C
#> 84  YAL069W YAL068W-A  YAL069W
#> 118 YAR073W   YAR075W  YAR073W

Algorithmically, txcutr will give preference to more distal transcripts in the merge assignment. For example, looking at transcripts from gene YAL026C:

transcripts(txdb_w500, columns=c("tx_name", "gene_id"),
            filter=list(gene_id=c("YAL026C")))
#> GRanges object with 2 ranges and 2 metadata columns:
#>       seqnames      ranges strand |     tx_name         gene_id
#>          <Rle>   <IRanges>  <Rle> | <character> <CharacterList>
#>   [1]     chrI 95386-95823      - |   YAL026C-A         YAL026C
#>   [2]     chrI 95630-96129      - |     YAL026C         YAL026C
#>   -------
#>   seqinfo: 1 sequence from sacCer3 genome

we see that they are on the negative strand and YAL026C-A is more distal in this orientation. This is consistent with YAL026C being merged into YAL026C-A if we are merging all overlapping transcripts.

Note that one can also directly export the merge table using the exportMergeTable function.

2.2 Notes on Usage

2.2.1 Making TxDb Objects

The txdbmaker package provides a number of means to create custom TxDb objects that can then be used by txcutr. Alternative workflows might include starting from a GTF/GFF3 transcriptome annotation and using the txdbmaker::makeTxDbFromGFF method, or using the txdbmaker::makeTxDbFromBiomart method to create the object from Biomart resources.

2.2.2 BiocParallel

The truncation step can be computationally intensive, especially for species whose annotations include many alternative transcript isoforms, such as mouse or human. The truncateTxome method makes use of BiocParallel::bplapply when applicable, and will respect the BiocParallel::register() setting, unless an explicit BPPARAM argument is supplied.

We encourage use of a parallel parameter option, such as MulticoreParam, but it should be noted that with mammalian transcriptomes this results in a maximum RAM usage between 3-4GB per core. Please ensure adequate memory per core is available.

2.2.3 Alternative Cleavage and Polyadenylation

This tool developed out of a pipeline for quantifying 3’ UTR isoform expression from scRNA-seq. In such an application, it can be helpful to pre-filter transcripts that may be included in an annotation, but do not have validated 3’ ends. Of particular note are the GENCODE annotations: we recommend pre-filtering GENCODE GTF/GFF files to remove any entries with the tag mRNA_end_NF.

3 Reproducibility

The txcutr package (Fansler, 2024) was made possible thanks to:

  • R (R Core Team, 2024)
  • BiocStyle (Oleś, 2024)
  • knitr (Xie, 2024)
  • RefManageR (McLean, 2017)
  • rmarkdown (Allaire, Xie, Dervieux, McPherson, Luraschi, Ushey, Atkins, Wickham, Cheng, Chang, and Iannone, 2024)
  • sessioninfo (Wickham, Chang, Flight, Müller, and Hester, 2021)
  • testthat (Wickham, 2011)

This package was developed using biocthis.

Code for creating the vignette

## Create the vignette
library("rmarkdown")
system.time(render("intro.Rmd", "BiocStyle::html_document"))

## Extract the R code
library("knitr")
knit("intro.Rmd", tangle = TRUE)

Date the vignette was generated.

#> [1] "2024-10-29 20:00:17 EDT"

Wallclock time spent generating the vignette.

#> Time difference of 30.9 secs

R session information.

#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R Under development (unstable) (2024-10-21 r87258)
#>  os       Ubuntu 24.04.1 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  C
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2024-10-29
#>  pandoc   3.1.3 @ /usr/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  package                               * version   date (UTC) lib source
#>  abind                                   1.4-8     2024-09-12 [2] CRAN (R 4.5.0)
#>  AnnotationDbi                         * 1.69.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  backports                               1.5.0     2024-05-23 [2] CRAN (R 4.5.0)
#>  bibtex                                  0.5.1     2023-01-26 [2] CRAN (R 4.5.0)
#>  Biobase                               * 2.67.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  BiocFileCache                           2.15.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  BiocGenerics                          * 0.53.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  BiocIO                                  1.17.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  BiocManager                             1.30.25   2024-08-28 [2] CRAN (R 4.5.0)
#>  BiocParallel                            1.41.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  BiocStyle                             * 2.35.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  biomaRt                                 2.63.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  Biostrings                              2.75.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  bit                                     4.5.0     2024-09-20 [2] CRAN (R 4.5.0)
#>  bit64                                   4.5.2     2024-09-22 [2] CRAN (R 4.5.0)
#>  bitops                                  1.0-9     2024-10-03 [2] CRAN (R 4.5.0)
#>  blob                                    1.2.4     2023-03-17 [2] CRAN (R 4.5.0)
#>  bookdown                                0.41      2024-10-16 [2] CRAN (R 4.5.0)
#>  bslib                                   0.8.0     2024-07-29 [2] CRAN (R 4.5.0)
#>  cachem                                  1.1.0     2024-05-16 [2] CRAN (R 4.5.0)
#>  cli                                     3.6.3     2024-06-21 [2] CRAN (R 4.5.0)
#>  codetools                               0.2-20    2024-03-31 [3] CRAN (R 4.5.0)
#>  crayon                                  1.5.3     2024-06-20 [2] CRAN (R 4.5.0)
#>  curl                                    5.2.3     2024-09-20 [2] CRAN (R 4.5.0)
#>  DBI                                     1.2.3     2024-06-02 [2] CRAN (R 4.5.0)
#>  dbplyr                                  2.5.0     2024-03-19 [2] CRAN (R 4.5.0)
#>  DelayedArray                            0.33.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  digest                                  0.6.37    2024-08-19 [2] CRAN (R 4.5.0)
#>  dplyr                                   1.1.4     2023-11-17 [2] CRAN (R 4.5.0)
#>  evaluate                                1.0.1     2024-10-10 [2] CRAN (R 4.5.0)
#>  fansi                                   1.0.6     2023-12-08 [2] CRAN (R 4.5.0)
#>  fastmap                                 1.2.0     2024-05-15 [2] CRAN (R 4.5.0)
#>  filelock                                1.0.3     2023-12-11 [2] CRAN (R 4.5.0)
#>  generics                                0.1.3     2022-07-05 [2] CRAN (R 4.5.0)
#>  GenomeInfoDb                          * 1.43.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  GenomeInfoDbData                        1.2.13    2024-10-23 [2] Bioconductor
#>  GenomicAlignments                       1.43.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  GenomicFeatures                       * 1.59.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  GenomicRanges                         * 1.59.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  glue                                    1.8.0     2024-09-30 [2] CRAN (R 4.5.0)
#>  hms                                     1.1.3     2023-03-21 [2] CRAN (R 4.5.0)
#>  htmltools                               0.5.8.1   2024-04-04 [2] CRAN (R 4.5.0)
#>  httr                                    1.4.7     2023-08-15 [2] CRAN (R 4.5.0)
#>  httr2                                   1.0.5     2024-09-26 [2] CRAN (R 4.5.0)
#>  IRanges                               * 2.41.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  jquerylib                               0.1.4     2021-04-26 [2] CRAN (R 4.5.0)
#>  jsonlite                                1.8.9     2024-09-20 [2] CRAN (R 4.5.0)
#>  KEGGREST                                1.47.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  knitr                                   1.48      2024-07-07 [2] CRAN (R 4.5.0)
#>  lattice                                 0.22-6    2024-03-20 [3] CRAN (R 4.5.0)
#>  lifecycle                               1.0.4     2023-11-07 [2] CRAN (R 4.5.0)
#>  lubridate                               1.9.3     2023-09-27 [2] CRAN (R 4.5.0)
#>  magrittr                                2.0.3     2022-03-30 [2] CRAN (R 4.5.0)
#>  Matrix                                  1.7-1     2024-10-18 [3] CRAN (R 4.5.0)
#>  MatrixGenerics                          1.19.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  matrixStats                             1.4.1     2024-09-08 [2] CRAN (R 4.5.0)
#>  memoise                                 2.0.1     2021-11-26 [2] CRAN (R 4.5.0)
#>  pillar                                  1.9.0     2023-03-22 [2] CRAN (R 4.5.0)
#>  pkgconfig                               2.0.3     2019-09-22 [2] CRAN (R 4.5.0)
#>  plyr                                    1.8.9     2023-10-02 [2] CRAN (R 4.5.0)
#>  png                                     0.1-8     2022-11-29 [2] CRAN (R 4.5.0)
#>  prettyunits                             1.2.0     2023-09-24 [2] CRAN (R 4.5.0)
#>  progress                                1.2.3     2023-12-06 [2] CRAN (R 4.5.0)
#>  R6                                      2.5.1     2021-08-19 [2] CRAN (R 4.5.0)
#>  rappdirs                                0.3.3     2021-01-31 [2] CRAN (R 4.5.0)
#>  Rcpp                                    1.0.13    2024-07-17 [2] CRAN (R 4.5.0)
#>  RCurl                                   1.98-1.16 2024-07-11 [2] CRAN (R 4.5.0)
#>  RefManageR                            * 1.4.0     2022-09-30 [2] CRAN (R 4.5.0)
#>  restfulr                                0.0.15    2022-06-16 [2] CRAN (R 4.5.0)
#>  rjson                                   0.2.23    2024-09-16 [2] CRAN (R 4.5.0)
#>  rlang                                   1.1.4     2024-06-04 [2] CRAN (R 4.5.0)
#>  rmarkdown                               2.28      2024-08-17 [2] CRAN (R 4.5.0)
#>  Rsamtools                               2.23.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  RSQLite                                 2.3.7     2024-05-27 [2] CRAN (R 4.5.0)
#>  rtracklayer                             1.67.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  S4Arrays                                1.7.0     2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  S4Vectors                             * 0.45.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  sass                                    0.4.9     2024-03-15 [2] CRAN (R 4.5.0)
#>  sessioninfo                           * 1.2.2     2021-12-06 [2] CRAN (R 4.5.0)
#>  SparseArray                             1.7.0     2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  stringi                                 1.8.4     2024-05-06 [2] CRAN (R 4.5.0)
#>  stringr                                 1.5.1     2023-11-14 [2] CRAN (R 4.5.0)
#>  SummarizedExperiment                    1.37.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  tibble                                  3.2.1     2023-03-20 [2] CRAN (R 4.5.0)
#>  tidyselect                              1.2.1     2024-03-11 [2] CRAN (R 4.5.0)
#>  timechange                              0.3.0     2024-01-18 [2] CRAN (R 4.5.0)
#>  txcutr                                * 1.13.0    2024-10-29 [1] Bioconductor 3.21 (R 4.5.0)
#>  TxDb.Scerevisiae.UCSC.sacCer3.sgdGene * 3.2.2     2024-10-24 [2] Bioconductor
#>  txdbmaker                               1.3.0     2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  UCSC.utils                              1.3.0     2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  utf8                                    1.2.4     2023-10-22 [2] CRAN (R 4.5.0)
#>  vctrs                                   0.6.5     2023-12-01 [2] CRAN (R 4.5.0)
#>  xfun                                    0.48      2024-10-03 [2] CRAN (R 4.5.0)
#>  XML                                     3.99-0.17 2024-06-25 [2] CRAN (R 4.5.0)
#>  xml2                                    1.3.6     2023-12-04 [2] CRAN (R 4.5.0)
#>  XVector                                 0.47.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>  yaml                                    2.3.10    2024-07-26 [2] CRAN (R 4.5.0)
#>  zlibbioc                                1.53.0    2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#> 
#>  [1] /tmp/RtmpfdjVRQ/Rinst1fd6a81b87d6a8
#>  [2] /home/biocbuild/bbs-3.21-bioc/R/site-library
#>  [3] /home/biocbuild/bbs-3.21-bioc/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

4 Bibliography

This vignette was generated using BiocStyle (Oleś, 2024) with knitr (Xie, 2024) and rmarkdown (Allaire, Xie, Dervieux et al., 2024) running behind the scenes.

Citations made with RefManageR (McLean, 2017).

[1] J. Allaire, Y. Xie, C. Dervieux, et al. rmarkdown: Dynamic Documents for R. R package version 2.28. 2024. URL: https://github.com/rstudio/rmarkdown.

[2] M. Fansler. txcutr: Transcriptome CUTteR. R package version 1.13.0. 2024. DOI: 10.18129/B9.bioc.txcutr. URL: https://bioconductor.org/packages/txcutr.

[3] M. W. McLean. “RefManageR: Import and Manage BibTeX and BibLaTeX References in R”. In: The Journal of Open Source Software (2017). DOI: 10.21105/joss.00338.

[4] A. Oleś. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.35.0. 2024. DOI: 10.18129/B9.bioc.BiocStyle. URL: https://bioconductor.org/packages/BiocStyle.

[5] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2024. URL: https://www.R-project.org/.

[6] H. Wickham. “testthat: Get Started with Testing”. In: The R Journal 3 (2011), pp. 5–10. URL: https://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.

[7] H. Wickham, W. Chang, R. Flight, et al. sessioninfo: R Session Information. R package version 1.2.2. 2021. DOI: 10.32614/CRAN.package.sessioninfo. URL: https://CRAN.R-project.org/package=sessioninfo.

[8] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.48. 2024. URL: https://yihui.org/knitr/.