DFplyr 1.1.0
DFplyr
DFplyr is a R
package available via the
Bioconductor repository for packages and can be
downloaded via BiocManager::install()
:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("DFplyr")
## Check that you have a valid Bioconductor installation
BiocManager::valid()
DFplyr is inspired by dplyr which implements a
wide variety of common data manipulations (mutate
, select
, filter
) but
which only operates on objects of class data.frame
or tibble
(from r CRANpkg("tibble")
).
When working with S4Vectors DataFrame
s - which are frequently
used as components of, for example SummarizedExperiment objects -
a common workaround is to convert the DataFrame
to a tibble
in order to then
use dplyr functions to manipulate the contents, before converting
back to a DataFrame
.
This has several drawbacks, including the fact that tibble
does not support
rownames (and dplyr frequently does not preserve them), does not
support S4 columns (e.g. IRanges vectors), and requires the back
and forth transformation any time manipulation is desired.
DFplyr
library("DFplyr")
To being with, we create an S4Vectors DataFrame
, including some
S4 columns
library(S4Vectors)
m <- mtcars[, c("cyl", "hp", "am", "gear", "disp")]
d <- as(m, "DataFrame")
d$grX <- GenomicRanges::GRanges("chrX", IRanges::IRanges(1:32, width = 10))
d$grY <- GenomicRanges::GRanges("chrY", IRanges::IRanges(1:32, width = 10))
d$nl <- IRanges::NumericList(lapply(d$gear, function(n) round(rnorm(n), 2)))
d
#> DataFrame with 32 rows and 8 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Mazda RX4 6 110 1 4 160 chrX:1-10
#> Mazda RX4 Wag 6 110 1 4 160 chrX:2-11
#> Datsun 710 4 93 1 4 108 chrX:3-12
#> Hornet 4 Drive 6 110 0 3 258 chrX:4-13
#> Hornet Sportabout 8 175 0 3 360 chrX:5-14
#> ... ... ... ... ... ... ...
#> Lotus Europa 4 113 1 5 95.1 chrX:28-37
#> Ford Pantera L 8 264 1 5 351.0 chrX:29-38
#> Ferrari Dino 6 175 1 5 145.0 chrX:30-39
#> Maserati Bora 8 335 1 5 301.0 chrX:31-40
#> Volvo 142E 4 109 1 4 121.0 chrX:32-41
#> grY nl
#> <GRanges> <CompressedNumericList>
#> Mazda RX4 chrY:1-10 -0.25, 1.83, 0.03,...
#> Mazda RX4 Wag chrY:2-11 0.07,-2.85,-2.93,...
#> Datsun 710 chrY:3-12 -0.53,-1.39,-0.52,...
#> Hornet 4 Drive chrY:4-13 -0.71, 0.85, 0.67
#> Hornet Sportabout chrY:5-14 0.02, 1.16,-0.33
#> ... ... ...
#> Lotus Europa chrY:28-37 0.04, 0.48,-0.23,...
#> Ford Pantera L chrY:29-38 0.62,-0.33,-0.12,...
#> Ferrari Dino chrY:30-39 0.79,-1.64,-0.07,...
#> Maserati Bora chrY:31-40 -0.09,-1.99,-0.84,...
#> Volvo 142E chrY:32-41 -0.49, 1.25, 0.94,...
This will appear in RStudio’s environment pane as a
Formal class DataFrame (dplyr-compatible)
when using DFplyr. No interference with the actual object is required, but this helps identify that dplyr-compatibility is available.
DataFrame
s can then be used in dplyr-like calls the same as
data.frame
or tibble
objects. Support for working with S4 columns is enabled
provided they have appropriate functions. Adding multiple columns will result in
the new columns being created in alphabetical order. For example, adding a new
column newvar
which is the sum of the cyl
and hp
columns
mutate(d, newvar = cyl + hp)
#> DataFrame with 32 rows and 9 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Mazda RX4 6 110 1 4 160 chrX:1-10
#> Mazda RX4 Wag 6 110 1 4 160 chrX:2-11
#> Datsun 710 4 93 1 4 108 chrX:3-12
#> Hornet 4 Drive 6 110 0 3 258 chrX:4-13
#> Hornet Sportabout 8 175 0 3 360 chrX:5-14
#> ... ... ... ... ... ... ...
#> Lotus Europa 4 113 1 5 95.1 chrX:28-37
#> Ford Pantera L 8 264 1 5 351.0 chrX:29-38
#> Ferrari Dino 6 175 1 5 145.0 chrX:30-39
#> Maserati Bora 8 335 1 5 301.0 chrX:31-40
#> Volvo 142E 4 109 1 4 121.0 chrX:32-41
#> grY nl newvar
#> <GRanges> <CompressedNumericList> <numeric>
#> Mazda RX4 chrY:1-10 -0.25, 1.83, 0.03,... 116
#> Mazda RX4 Wag chrY:2-11 0.07,-2.85,-2.93,... 116
#> Datsun 710 chrY:3-12 -0.53,-1.39,-0.52,... 97
#> Hornet 4 Drive chrY:4-13 -0.71, 0.85, 0.67 116
#> Hornet Sportabout chrY:5-14 0.02, 1.16,-0.33 183
#> ... ... ... ...
#> Lotus Europa chrY:28-37 0.04, 0.48,-0.23,... 117
#> Ford Pantera L chrY:29-38 0.62,-0.33,-0.12,... 272
#> Ferrari Dino chrY:30-39 0.79,-1.64,-0.07,... 181
#> Maserati Bora chrY:31-40 -0.09,-1.99,-0.84,... 343
#> Volvo 142E chrY:32-41 -0.49, 1.25, 0.94,... 113
or doubling the nl
column as nl2
mutate(d, nl2 = nl * 2)
#> DataFrame with 32 rows and 9 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Mazda RX4 6 110 1 4 160 chrX:1-10
#> Mazda RX4 Wag 6 110 1 4 160 chrX:2-11
#> Datsun 710 4 93 1 4 108 chrX:3-12
#> Hornet 4 Drive 6 110 0 3 258 chrX:4-13
#> Hornet Sportabout 8 175 0 3 360 chrX:5-14
#> ... ... ... ... ... ... ...
#> Lotus Europa 4 113 1 5 95.1 chrX:28-37
#> Ford Pantera L 8 264 1 5 351.0 chrX:29-38
#> Ferrari Dino 6 175 1 5 145.0 chrX:30-39
#> Maserati Bora 8 335 1 5 301.0 chrX:31-40
#> Volvo 142E 4 109 1 4 121.0 chrX:32-41
#> grY nl nl2
#> <GRanges> <CompressedNumericList> <CompressedNumericList>
#> Mazda RX4 chrY:1-10 -0.25, 1.83, 0.03,... -0.50, 3.66, 0.06,...
#> Mazda RX4 Wag chrY:2-11 0.07,-2.85,-2.93,... 0.14,-5.70,-5.86,...
#> Datsun 710 chrY:3-12 -0.53,-1.39,-0.52,... -1.06,-2.78,-1.04,...
#> Hornet 4 Drive chrY:4-13 -0.71, 0.85, 0.67 -1.42, 1.70, 1.34
#> Hornet Sportabout chrY:5-14 0.02, 1.16,-0.33 0.04, 2.32,-0.66
#> ... ... ... ...
#> Lotus Europa chrY:28-37 0.04, 0.48,-0.23,... 0.08, 0.96,-0.46,...
#> Ford Pantera L chrY:29-38 0.62,-0.33,-0.12,... 1.24,-0.66,-0.24,...
#> Ferrari Dino chrY:30-39 0.79,-1.64,-0.07,... 1.58,-3.28,-0.14,...
#> Maserati Bora chrY:31-40 -0.09,-1.99,-0.84,... -0.18,-3.98,-1.68,...
#> Volvo 142E chrY:32-41 -0.49, 1.25, 0.94,... -0.98, 2.50, 1.88,...
or calculating the length()
of the nl
column cells as length_nl
mutate(d, length_nl = lengths(nl))
#> DataFrame with 32 rows and 9 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Mazda RX4 6 110 1 4 160 chrX:1-10
#> Mazda RX4 Wag 6 110 1 4 160 chrX:2-11
#> Datsun 710 4 93 1 4 108 chrX:3-12
#> Hornet 4 Drive 6 110 0 3 258 chrX:4-13
#> Hornet Sportabout 8 175 0 3 360 chrX:5-14
#> ... ... ... ... ... ... ...
#> Lotus Europa 4 113 1 5 95.1 chrX:28-37
#> Ford Pantera L 8 264 1 5 351.0 chrX:29-38
#> Ferrari Dino 6 175 1 5 145.0 chrX:30-39
#> Maserati Bora 8 335 1 5 301.0 chrX:31-40
#> Volvo 142E 4 109 1 4 121.0 chrX:32-41
#> grY nl length_nl
#> <GRanges> <CompressedNumericList> <integer>
#> Mazda RX4 chrY:1-10 -0.25, 1.83, 0.03,... 4
#> Mazda RX4 Wag chrY:2-11 0.07,-2.85,-2.93,... 4
#> Datsun 710 chrY:3-12 -0.53,-1.39,-0.52,... 4
#> Hornet 4 Drive chrY:4-13 -0.71, 0.85, 0.67 3
#> Hornet Sportabout chrY:5-14 0.02, 1.16,-0.33 3
#> ... ... ... ...
#> Lotus Europa chrY:28-37 0.04, 0.48,-0.23,... 5
#> Ford Pantera L chrY:29-38 0.62,-0.33,-0.12,... 5
#> Ferrari Dino chrY:30-39 0.79,-1.64,-0.07,... 5
#> Maserati Bora chrY:31-40 -0.09,-1.99,-0.84,... 5
#> Volvo 142E chrY:32-41 -0.49, 1.25, 0.94,... 4
Transformations can involve S4-related functions, such as extracting the
seqnames()
, strand()
, and end()
of the grX
column
mutate(d,
chr = GenomeInfoDb::seqnames(grX),
strand_X = BiocGenerics::strand(grX),
end_X = BiocGenerics::end(grX)
)
#> DataFrame with 32 rows and 11 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Mazda RX4 6 110 1 4 160 chrX:1-10
#> Mazda RX4 Wag 6 110 1 4 160 chrX:2-11
#> Datsun 710 4 93 1 4 108 chrX:3-12
#> Hornet 4 Drive 6 110 0 3 258 chrX:4-13
#> Hornet Sportabout 8 175 0 3 360 chrX:5-14
#> ... ... ... ... ... ... ...
#> Lotus Europa 4 113 1 5 95.1 chrX:28-37
#> Ford Pantera L 8 264 1 5 351.0 chrX:29-38
#> Ferrari Dino 6 175 1 5 145.0 chrX:30-39
#> Maserati Bora 8 335 1 5 301.0 chrX:31-40
#> Volvo 142E 4 109 1 4 121.0 chrX:32-41
#> grY nl chr end_X strand_X
#> <GRanges> <CompressedNumericList> <Rle> <integer> <Rle>
#> Mazda RX4 chrY:1-10 -0.25, 1.83, 0.03,... chrX 10 *
#> Mazda RX4 Wag chrY:2-11 0.07,-2.85,-2.93,... chrX 11 *
#> Datsun 710 chrY:3-12 -0.53,-1.39,-0.52,... chrX 12 *
#> Hornet 4 Drive chrY:4-13 -0.71, 0.85, 0.67 chrX 13 *
#> Hornet Sportabout chrY:5-14 0.02, 1.16,-0.33 chrX 14 *
#> ... ... ... ... ... ...
#> Lotus Europa chrY:28-37 0.04, 0.48,-0.23,... chrX 37 *
#> Ford Pantera L chrY:29-38 0.62,-0.33,-0.12,... chrX 38 *
#> Ferrari Dino chrY:30-39 0.79,-1.64,-0.07,... chrX 39 *
#> Maserati Bora chrY:31-40 -0.09,-1.99,-0.84,... chrX 40 *
#> Volvo 142E chrY:32-41 -0.49, 1.25, 0.94,... chrX 41 *
the object returned remains a standard DataFrame
, and further calls can be
piped with %>%
, in this case extracting the newly created newvar
column
mutate(d, newvar = cyl + hp) %>%
pull(newvar)
#> [1] 116 116 97 116 183 111 253 66 99 129 129 188 188 188 213 223 238 70 56
#> [20] 69 101 158 158 253 183 70 95 117 272 181 343 113
Some of the variants of the dplyr
verbs also work, such as transforming the
numeric columns using a quosure style lambda function, in this case squaring
them
mutate_if(d, is.numeric, ~ .^2)
#> DataFrame with 32 rows and 8 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Mazda RX4 36 12100 1 16 25600 chrX:1-10
#> Mazda RX4 Wag 36 12100 1 16 25600 chrX:2-11
#> Datsun 710 16 8649 1 16 11664 chrX:3-12
#> Hornet 4 Drive 36 12100 0 9 66564 chrX:4-13
#> Hornet Sportabout 64 30625 0 9 129600 chrX:5-14
#> ... ... ... ... ... ... ...
#> Lotus Europa 16 12769 1 25 9044.01 chrX:28-37
#> Ford Pantera L 64 69696 1 25 123201.00 chrX:29-38
#> Ferrari Dino 36 30625 1 25 21025.00 chrX:30-39
#> Maserati Bora 64 112225 1 25 90601.00 chrX:31-40
#> Volvo 142E 16 11881 1 16 14641.00 chrX:32-41
#> grY nl
#> <GRanges> <CompressedNumericList>
#> Mazda RX4 chrY:1-10 -0.25, 1.83, 0.03,...
#> Mazda RX4 Wag chrY:2-11 0.07,-2.85,-2.93,...
#> Datsun 710 chrY:3-12 -0.53,-1.39,-0.52,...
#> Hornet 4 Drive chrY:4-13 -0.71, 0.85, 0.67
#> Hornet Sportabout chrY:5-14 0.02, 1.16,-0.33
#> ... ... ...
#> Lotus Europa chrY:28-37 0.04, 0.48,-0.23,...
#> Ford Pantera L chrY:29-38 0.62,-0.33,-0.12,...
#> Ferrari Dino chrY:30-39 0.79,-1.64,-0.07,...
#> Maserati Bora chrY:31-40 -0.09,-1.99,-0.84,...
#> Volvo 142E chrY:32-41 -0.49, 1.25, 0.94,...
or extracting the start
of all of the "GRanges"
columns
mutate_if(d, ~ isa(., "GRanges"), BiocGenerics::start)
#> DataFrame with 32 rows and 8 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <integer>
#> Mazda RX4 6 110 1 4 160 1
#> Mazda RX4 Wag 6 110 1 4 160 2
#> Datsun 710 4 93 1 4 108 3
#> Hornet 4 Drive 6 110 0 3 258 4
#> Hornet Sportabout 8 175 0 3 360 5
#> ... ... ... ... ... ... ...
#> Lotus Europa 4 113 1 5 95.1 28
#> Ford Pantera L 8 264 1 5 351.0 29
#> Ferrari Dino 6 175 1 5 145.0 30
#> Maserati Bora 8 335 1 5 301.0 31
#> Volvo 142E 4 109 1 4 121.0 32
#> grY nl
#> <integer> <CompressedNumericList>
#> Mazda RX4 1 -0.25, 1.83, 0.03,...
#> Mazda RX4 Wag 2 0.07,-2.85,-2.93,...
#> Datsun 710 3 -0.53,-1.39,-0.52,...
#> Hornet 4 Drive 4 -0.71, 0.85, 0.67
#> Hornet Sportabout 5 0.02, 1.16,-0.33
#> ... ... ...
#> Lotus Europa 28 0.04, 0.48,-0.23,...
#> Ford Pantera L 29 0.62,-0.33,-0.12,...
#> Ferrari Dino 30 0.79,-1.64,-0.07,...
#> Maserati Bora 31 -0.09,-1.99,-0.84,...
#> Volvo 142E 32 -0.49, 1.25, 0.94,...
Use of tidyselect helpers is limited to within vars()
calls and using the _at
variants
mutate_at(d, vars(starts_with("c")), ~ .^2)
#> DataFrame with 32 rows and 8 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Mazda RX4 36 110 1 4 160 chrX:1-10
#> Mazda RX4 Wag 36 110 1 4 160 chrX:2-11
#> Datsun 710 16 93 1 4 108 chrX:3-12
#> Hornet 4 Drive 36 110 0 3 258 chrX:4-13
#> Hornet Sportabout 64 175 0 3 360 chrX:5-14
#> ... ... ... ... ... ... ...
#> Lotus Europa 16 113 1 5 95.1 chrX:28-37
#> Ford Pantera L 64 264 1 5 351.0 chrX:29-38
#> Ferrari Dino 36 175 1 5 145.0 chrX:30-39
#> Maserati Bora 64 335 1 5 301.0 chrX:31-40
#> Volvo 142E 16 109 1 4 121.0 chrX:32-41
#> grY nl
#> <GRanges> <CompressedNumericList>
#> Mazda RX4 chrY:1-10 -0.25, 1.83, 0.03,...
#> Mazda RX4 Wag chrY:2-11 0.07,-2.85,-2.93,...
#> Datsun 710 chrY:3-12 -0.53,-1.39,-0.52,...
#> Hornet 4 Drive chrY:4-13 -0.71, 0.85, 0.67
#> Hornet Sportabout chrY:5-14 0.02, 1.16,-0.33
#> ... ... ...
#> Lotus Europa chrY:28-37 0.04, 0.48,-0.23,...
#> Ford Pantera L chrY:29-38 0.62,-0.33,-0.12,...
#> Ferrari Dino chrY:30-39 0.79,-1.64,-0.07,...
#> Maserati Bora chrY:31-40 -0.09,-1.99,-0.84,...
#> Volvo 142E chrY:32-41 -0.49, 1.25, 0.94,...
and also works with other verbs
select_at(d, vars(starts_with("gr")))
#> DataFrame with 32 rows and 2 columns
#> grX grY
#> <GRanges> <GRanges>
#> Mazda RX4 chrX:1-10 chrY:1-10
#> Mazda RX4 Wag chrX:2-11 chrY:2-11
#> Datsun 710 chrX:3-12 chrY:3-12
#> Hornet 4 Drive chrX:4-13 chrY:4-13
#> Hornet Sportabout chrX:5-14 chrY:5-14
#> ... ... ...
#> Lotus Europa chrX:28-37 chrY:28-37
#> Ford Pantera L chrX:29-38 chrY:29-38
#> Ferrari Dino chrX:30-39 chrY:30-39
#> Maserati Bora chrX:31-40 chrY:31-40
#> Volvo 142E chrX:32-41 chrY:32-41
Importantly, grouped operations are supported. DataFrame
does not
natively support groups (the same way that data.frame
does not) so these
are implemented specifically for DFplyr
with group information shown at the
top of the printed output
group_by(d, cyl, am)
#> DataFrame with 32 rows and 8 columns
#> Groups: cyl, am
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Mazda RX4 6 110 1 4 160 chrX:1-10
#> Mazda RX4 Wag 6 110 1 4 160 chrX:2-11
#> Datsun 710 4 93 1 4 108 chrX:3-12
#> Hornet 4 Drive 6 110 0 3 258 chrX:4-13
#> Hornet Sportabout 8 175 0 3 360 chrX:5-14
#> ... ... ... ... ... ... ...
#> Lotus Europa 4 113 1 5 95.1 chrX:28-37
#> Ford Pantera L 8 264 1 5 351.0 chrX:29-38
#> Ferrari Dino 6 175 1 5 145.0 chrX:30-39
#> Maserati Bora 8 335 1 5 301.0 chrX:31-40
#> Volvo 142E 4 109 1 4 121.0 chrX:32-41
#> grY nl
#> <GRanges> <CompressedNumericList>
#> Mazda RX4 chrY:1-10 -0.25, 1.83, 0.03,...
#> Mazda RX4 Wag chrY:2-11 0.07,-2.85,-2.93,...
#> Datsun 710 chrY:3-12 -0.53,-1.39,-0.52,...
#> Hornet 4 Drive chrY:4-13 -0.71, 0.85, 0.67
#> Hornet Sportabout chrY:5-14 0.02, 1.16,-0.33
#> ... ... ...
#> Lotus Europa chrY:28-37 0.04, 0.48,-0.23,...
#> Ford Pantera L chrY:29-38 0.62,-0.33,-0.12,...
#> Ferrari Dino chrY:30-39 0.79,-1.64,-0.07,...
#> Maserati Bora chrY:31-40 -0.09,-1.99,-0.84,...
#> Volvo 142E chrY:32-41 -0.49, 1.25, 0.94,...
Other verbs are similarly implemented, and preserve row names where possible. For example, selecting a limited set of columns using non-standard evaluation (NSE)
select(d, am, cyl)
#> DataFrame with 32 rows and 2 columns
#> am cyl
#> <numeric> <numeric>
#> Mazda RX4 1 6
#> Mazda RX4 Wag 1 6
#> Datsun 710 1 4
#> Hornet 4 Drive 0 6
#> Hornet Sportabout 0 8
#> ... ... ...
#> Lotus Europa 1 4
#> Ford Pantera L 1 8
#> Ferrari Dino 1 6
#> Maserati Bora 1 8
#> Volvo 142E 1 4
Arranging rows according to the ordering of a column
arrange(d, desc(hp))
#> DataFrame with 32 rows and 8 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Maserati Bora 8 335 1 5 301 chrX:31-40
#> Ford Pantera L 8 264 1 5 351 chrX:29-38
#> Duster 360 8 245 0 3 360 chrX:7-16
#> Camaro Z28 8 245 0 3 350 chrX:24-33
#> Chrysler Imperial 8 230 0 3 440 chrX:17-26
#> ... ... ... ... ... ... ...
#> Fiat 128 4 66 1 4 78.7 chrX:18-27
#> Fiat X1-9 4 66 1 4 79.0 chrX:26-35
#> Toyota Corolla 4 65 1 4 71.1 chrX:20-29
#> Merc 240D 4 62 0 4 146.7 chrX:8-17
#> Honda Civic 4 52 1 4 75.7 chrX:19-28
#> grY nl
#> <GRanges> <CompressedNumericList>
#> Maserati Bora chrY:31-40 -0.09,-1.99,-0.84,...
#> Ford Pantera L chrY:29-38 0.62,-0.33,-0.12,...
#> Duster 360 chrY:7-16 0.54,0.21,0.69
#> Camaro Z28 chrY:24-33 -0.75, 0.63,-1.34
#> Chrysler Imperial chrY:17-26 -1.22,-0.28, 0.75
#> ... ... ...
#> Fiat 128 chrY:18-27 0.14,-0.37, 1.42,...
#> Fiat X1-9 chrY:26-35 2.05,0.42,0.18,...
#> Toyota Corolla chrY:20-29 0.13, 0.99,-0.26,...
#> Merc 240D chrY:8-17 -0.35,-1.39, 1.89,...
#> Honda Civic chrY:19-28 -1.11, 1.94,-0.65,...
Filtering to only specific values appearing in a column
filter(d, am == 0)
#> DataFrame with 19 rows and 8 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Hornet 4 Drive 6 110 0 3 258.0 chrX:4-13
#> Hornet Sportabout 8 175 0 3 360.0 chrX:5-14
#> Valiant 6 105 0 3 225.0 chrX:6-15
#> Duster 360 8 245 0 3 360.0 chrX:7-16
#> Merc 240D 4 62 0 4 146.7 chrX:8-17
#> ... ... ... ... ... ... ...
#> Toyota Corona 4 97 0 3 120.1 chrX:21-30
#> Dodge Challenger 8 150 0 3 318.0 chrX:22-31
#> AMC Javelin 8 150 0 3 304.0 chrX:23-32
#> Camaro Z28 8 245 0 3 350.0 chrX:24-33
#> Pontiac Firebird 8 175 0 3 400.0 chrX:25-34
#> grY nl
#> <GRanges> <CompressedNumericList>
#> Hornet 4 Drive chrY:4-13 -0.71, 0.85, 0.67
#> Hornet Sportabout chrY:5-14 0.02, 1.16,-0.33
#> Valiant chrY:6-15 0.00, 0.26,-0.27
#> Duster 360 chrY:7-16 0.54,0.21,0.69
#> Merc 240D chrY:8-17 -0.35,-1.39, 1.89,...
#> ... ... ...
#> Toyota Corona chrY:21-30 0.73,-1.65,-0.09
#> Dodge Challenger chrY:22-31 -0.35,-1.50,-0.58
#> AMC Javelin chrY:23-32 1.30,0.79,0.63
#> Camaro Z28 chrY:24-33 -0.75, 0.63,-1.34
#> Pontiac Firebird chrY:25-34 -0.38,-1.69, 1.81
Selecting specific rows by index
slice(d, 3:6)
#> DataFrame with 4 rows and 8 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Datsun 710 4 93 1 4 108 chrX:3-12
#> Hornet 4 Drive 6 110 0 3 258 chrX:4-13
#> Hornet Sportabout 8 175 0 3 360 chrX:5-14
#> Valiant 6 105 0 3 225 chrX:6-15
#> grY nl
#> <GRanges> <CompressedNumericList>
#> Datsun 710 chrY:3-12 -0.53,-1.39,-0.52,...
#> Hornet 4 Drive chrY:4-13 -0.71, 0.85, 0.67
#> Hornet Sportabout chrY:5-14 0.02, 1.16,-0.33
#> Valiant chrY:6-15 0.00, 0.26,-0.27
These also work for grouped objects, and also preserve the rownames, e.g.
selecting the first two rows from each group of gear
group_by(d, gear) %>%
slice(1:2)
#> DataFrame with 6 rows and 8 columns
#> Groups: gear
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Hornet Sportabout 8 175 0 3 360.0 chrX:5-14
#> Merc 450SL 8 180 0 3 275.8 chrX:13-22
#> Mazda RX4 6 110 1 4 160.0 chrX:1-10
#> Mazda RX4 Wag 6 110 1 4 160.0 chrX:2-11
#> Porsche 914-2 4 91 1 5 120.3 chrX:27-36
#> Ford Pantera L 8 264 1 5 351.0 chrX:29-38
#> grY nl
#> <GRanges> <CompressedNumericList>
#> Hornet Sportabout chrY:5-14 0.02, 1.16,-0.33
#> Merc 450SL chrY:13-22 0.09,0.83,0.71
#> Mazda RX4 chrY:1-10 -0.25, 1.83, 0.03,...
#> Mazda RX4 Wag chrY:2-11 0.07,-2.85,-2.93,...
#> Porsche 914-2 chrY:27-36 0.95,-0.31,-2.11,...
#> Ford Pantera L chrY:29-38 0.62,-0.33,-0.12,...
rename
is itself renamed to rename2
due to conflicts between
dplyr and S4Vectors, but works in the
dplyr sense of taking new = old
replacements with NSE syntax
select(d, am, cyl) %>%
rename2(foo = am)
#> DataFrame with 32 rows and 2 columns
#> foo cyl
#> <numeric> <numeric>
#> Mazda RX4 1 6
#> Mazda RX4 Wag 1 6
#> Datsun 710 1 4
#> Hornet 4 Drive 0 6
#> Hornet Sportabout 0 8
#> ... ... ...
#> Lotus Europa 1 4
#> Ford Pantera L 1 8
#> Ferrari Dino 1 6
#> Maserati Bora 1 8
#> Volvo 142E 1 4
Row names are not preserved when there may be duplicates or they don’t make
sense, otherwise the first label (according to the current de-duplication
method, in the case of distinct
, this is via BiocGenerics::duplicated
). This
may have complications for S4 columns.
distinct(d)
#> DataFrame with 32 rows and 8 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Mazda RX4 6 110 1 4 160 chrX:1-10
#> Mazda RX4 Wag 6 110 1 4 160 chrX:2-11
#> Datsun 710 4 93 1 4 108 chrX:3-12
#> Hornet 4 Drive 6 110 0 3 258 chrX:4-13
#> Hornet Sportabout 8 175 0 3 360 chrX:5-14
#> ... ... ... ... ... ... ...
#> Lotus Europa 4 113 1 5 95.1 chrX:28-37
#> Ford Pantera L 8 264 1 5 351.0 chrX:29-38
#> Ferrari Dino 6 175 1 5 145.0 chrX:30-39
#> Maserati Bora 8 335 1 5 301.0 chrX:31-40
#> Volvo 142E 4 109 1 4 121.0 chrX:32-41
#> grY nl
#> <GRanges> <CompressedNumericList>
#> Mazda RX4 chrY:1-10 -0.25, 1.83, 0.03,...
#> Mazda RX4 Wag chrY:2-11 0.07,-2.85,-2.93,...
#> Datsun 710 chrY:3-12 -0.53,-1.39,-0.52,...
#> Hornet 4 Drive chrY:4-13 -0.71, 0.85, 0.67
#> Hornet Sportabout chrY:5-14 0.02, 1.16,-0.33
#> ... ... ...
#> Lotus Europa chrY:28-37 0.04, 0.48,-0.23,...
#> Ford Pantera L chrY:29-38 0.62,-0.33,-0.12,...
#> Ferrari Dino chrY:30-39 0.79,-1.64,-0.07,...
#> Maserati Bora chrY:31-40 -0.09,-1.99,-0.84,...
#> Volvo 142E chrY:32-41 -0.49, 1.25, 0.94,...
Behaviours are ideally the same as those of dplyr wherever possible, for example a grouped tally
group_by(d, cyl, am) %>%
tally(gear)
#> DataFrame with 6 rows and 3 columns
#> cyl am n
#> <numeric> <numeric> <numeric>
#> 1 4 0 11
#> 2 4 1 34
#> 3 6 0 14
#> 4 6 1 13
#> 5 8 0 36
#> 6 8 1 10
or a count with weights
count(d, gear, am, cyl)
#> DataFrame with 10 rows and 4 columns
#> gear am cyl n
#> <factor> <Rle> <Rle> <integer>
#> 1 3 0 4 1
#> 2 3 0 6 2
#> 3 3 0 8 12
#> 4 4 0 4 2
#> 5 4 0 6 2
#> 6 4 1 4 6
#> 7 4 1 6 2
#> 8 5 1 4 2
#> 9 5 1 6 1
#> 10 5 1 8 2
DFplyr
We hope that DFplyr will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!
citation("DFplyr")
#> To cite package 'DFplyr' in publications use:
#>
#> Carroll J (2024). _DFplyr: A `DataFrame` (`S4Vectors`) backend for
#> `dplyr`_. doi:10.18129/B9.bioc.DFplyr
#> <https://doi.org/10.18129/B9.bioc.DFplyr>, R package version 1.1.0,
#> <https://bioconductor.org/packages/DFplyr>.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Manual{,
#> title = {DFplyr: A `DataFrame` (`S4Vectors`) backend for `dplyr`},
#> author = {Jonathan Carroll},
#> year = {2024},
#> note = {R package version 1.1.0},
#> url = {https://bioconductor.org/packages/DFplyr},
#> doi = {10.18129/B9.bioc.DFplyr},
#> }
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#> setting value
#> version R Under development (unstable) (2024-10-21 r87258)
#> os Ubuntu 24.04.1 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate C
#> ctype en_US.UTF-8
#> tz America/New_York
#> date 2024-10-29
#> pandoc 3.1.3 @ /usr/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> BiocGenerics * 0.53.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#> BiocManager 1.30.25 2024-08-28 [2] CRAN (R 4.5.0)
#> BiocStyle * 2.35.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#> bookdown 0.41 2024-10-16 [2] CRAN (R 4.5.0)
#> bslib 0.8.0 2024-07-29 [2] CRAN (R 4.5.0)
#> cachem 1.1.0 2024-05-16 [2] CRAN (R 4.5.0)
#> cli 3.6.3 2024-06-21 [2] CRAN (R 4.5.0)
#> DFplyr * 1.1.0 2024-10-29 [1] Bioconductor 3.21 (R 4.5.0)
#> digest 0.6.37 2024-08-19 [2] CRAN (R 4.5.0)
#> dplyr * 1.1.4 2023-11-17 [2] CRAN (R 4.5.0)
#> evaluate 1.0.1 2024-10-10 [2] CRAN (R 4.5.0)
#> fansi 1.0.6 2023-12-08 [2] CRAN (R 4.5.0)
#> fastmap 1.2.0 2024-05-15 [2] CRAN (R 4.5.0)
#> generics 0.1.3 2022-07-05 [2] CRAN (R 4.5.0)
#> GenomeInfoDb 1.43.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#> GenomeInfoDbData 1.2.13 2024-10-23 [2] Bioconductor
#> GenomicRanges 1.59.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#> glue 1.8.0 2024-09-30 [2] CRAN (R 4.5.0)
#> htmltools 0.5.8.1 2024-04-04 [2] CRAN (R 4.5.0)
#> httr 1.4.7 2023-08-15 [2] CRAN (R 4.5.0)
#> IRanges 2.41.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#> jquerylib 0.1.4 2021-04-26 [2] CRAN (R 4.5.0)
#> jsonlite 1.8.9 2024-09-20 [2] CRAN (R 4.5.0)
#> knitr 1.48 2024-07-07 [2] CRAN (R 4.5.0)
#> lifecycle 1.0.4 2023-11-07 [2] CRAN (R 4.5.0)
#> magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.5.0)
#> pillar 1.9.0 2023-03-22 [2] CRAN (R 4.5.0)
#> pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.5.0)
#> R6 2.5.1 2021-08-19 [2] CRAN (R 4.5.0)
#> rlang 1.1.4 2024-06-04 [2] CRAN (R 4.5.0)
#> rmarkdown 2.28 2024-08-17 [2] CRAN (R 4.5.0)
#> S4Vectors * 0.45.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#> sass 0.4.9 2024-03-15 [2] CRAN (R 4.5.0)
#> sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.5.0)
#> tibble 3.2.1 2023-03-20 [2] CRAN (R 4.5.0)
#> tidyselect 1.2.1 2024-03-11 [2] CRAN (R 4.5.0)
#> UCSC.utils 1.3.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#> utf8 1.2.4 2023-10-22 [2] CRAN (R 4.5.0)
#> vctrs 0.6.5 2023-12-01 [2] CRAN (R 4.5.0)
#> withr 3.0.2 2024-10-28 [2] CRAN (R 4.5.0)
#> xfun 0.48 2024-10-03 [2] CRAN (R 4.5.0)
#> XVector 0.47.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#> yaml 2.3.10 2024-07-26 [2] CRAN (R 4.5.0)
#> zlibbioc 1.53.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
#>
#> [1] /tmp/RtmppTL1u8/Rinstaac28794b921c
#> [2] /home/biocbuild/bbs-3.21-bioc/R/site-library
#> [3] /home/biocbuild/bbs-3.21-bioc/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────