r - Ways to add multiple columns to data frame using plyr/dplyr/purrr -
i have need mutate data frame through additional of several columns @ once using custom function, preferably using parallelization. below ways know how this.
setup
library(dplyr) library(plyr) library(purrr) library(domc) registerdomc(2) df <- data.frame(x = rnorm(10), y = rnorm(10), z = rnorm(10))
suppose want 2 new columns, foocol = x + y
, barcol = (x + y) * 100
, these complex calculations done in custom function.
method 1: add columns separately using rowwise
, mutate
foo <- function(x, y) return(x + y) bar <- function(x, y) return((x + y) * 100) df_out1 <- df %>% rowwise() %>% mutate(foocol = foo(x, y), barcol = bar(x, y))
this not solution since requires 2 function calls each row , 2 "expensive" calculations of x + y
. it's not parallelized.
method 2: trick ddply
rowwise operation
df2 <- df df2$id <- 1:nrow(df2) df_out2 <- ddply(df2, .(id), function(r) { foocol <- r$x + r$y barcol <- foocol * 100 return(cbind(r, foocol, barcol)) }, .parallel = t)
here trick ddply
calling function on each row splitting on unique id
column created. it's clunky, though, , requires maintaining useless column.
method 3: splat
foobar <- function(x, y, ...) { foocol <- x + y barcol <- foocol * 100 return(data.frame(x, y, ..., foocol, barcol)) } df_out3 <- splat(foobar)(df)
i solution since can reference columns of df
in custom function (which can anonymous if desired) without array comprehension. however, method isn't parallelized.
method 4: by_row
df_out4 <- df %>% by_row(function(r) { foocol <- r$x + r$y barcol <- foocol * 100 return(data.frame(foocol = foocol, barcol = barcol)) }, .collate = "cols")
the by_row
function purrr eliminates need unique id
column, operation isn't parallelized.
method 5: pmap_df
df_out5 <- pmap_df(df, foobar) # or equivalently... df_out5 <- df %>% pmap_df(foobar)
this best option i've found. pmap
family of functions accept anonymous functions apply arguments. believe pmap_df
converts df
list , back, though, maybe there performance hit.
it's bit annoying need reference columns plan on using calculation in function definition function(x, y, ...)
instead of function(r)
row object.
am missing or better options? there concerns methods described?
how using data.table
?
library(data.table) foo <- function(x, y) return(x + y) bar <- function(x, y) return((x + y) * 100) dt <- as.data.table(df) dt[, foocol:=foo(x,y)] dt[, barcol:=bar(x,y)]
the data.table
library quite fast , has @ least some potential parallelization.
Comments
Post a Comment