r - Ways to add multiple columns to data frame using plyr/dplyr/purrr -
i have need mutate data frame through additional of several columns @ once using custom function, preferably using parallelization. below ways know how this.
setup
library(dplyr) library(plyr) library(purrr) library(domc) registerdomc(2) df <- data.frame(x = rnorm(10), y = rnorm(10), z = rnorm(10)) suppose want 2 new columns, foocol = x + y , barcol = (x + y) * 100, these complex calculations done in custom function.
method 1: add columns separately using rowwise , mutate
foo <- function(x, y) return(x + y) bar <- function(x, y) return((x + y) * 100) df_out1 <- df %>% rowwise() %>% mutate(foocol = foo(x, y), barcol = bar(x, y)) this not solution since requires 2 function calls each row , 2 "expensive" calculations of x + y. it's not parallelized.
method 2: trick ddply rowwise operation
df2 <- df df2$id <- 1:nrow(df2) df_out2 <- ddply(df2, .(id), function(r) { foocol <- r$x + r$y barcol <- foocol * 100 return(cbind(r, foocol, barcol)) }, .parallel = t) here trick ddply calling function on each row splitting on unique id column created. it's clunky, though, , requires maintaining useless column.
method 3: splat
foobar <- function(x, y, ...) { foocol <- x + y barcol <- foocol * 100 return(data.frame(x, y, ..., foocol, barcol)) } df_out3 <- splat(foobar)(df) i solution since can reference columns of df in custom function (which can anonymous if desired) without array comprehension. however, method isn't parallelized.
method 4: by_row
df_out4 <- df %>% by_row(function(r) { foocol <- r$x + r$y barcol <- foocol * 100 return(data.frame(foocol = foocol, barcol = barcol)) }, .collate = "cols") the by_row function purrr eliminates need unique id column, operation isn't parallelized.
method 5: pmap_df
df_out5 <- pmap_df(df, foobar) # or equivalently... df_out5 <- df %>% pmap_df(foobar) this best option i've found. pmap family of functions accept anonymous functions apply arguments. believe pmap_df converts df list , back, though, maybe there performance hit.
it's bit annoying need reference columns plan on using calculation in function definition function(x, y, ...) instead of function(r) row object.
am missing or better options? there concerns methods described?
how using data.table?
library(data.table) foo <- function(x, y) return(x + y) bar <- function(x, y) return((x + y) * 100) dt <- as.data.table(df) dt[, foocol:=foo(x,y)] dt[, barcol:=bar(x,y)] the data.table library quite fast , has @ least some potential parallelization.
Comments
Post a Comment