r - Subset by group with data.table -
assume have data table containing baseball players:
library(plyr) library(data.table) bdt <- as.data.table(baseball) for each player (given id), want find row corresponding year in played games. straightforward in plyr:
ddply(baseball, "id", subset, g == max(g)) what's equivalent code data.table?
i tried:
setkey(bdt, "id") bdt[g == max(g)] # 1 row bdt[g == max(g), = id] # error: 'by' or 'keyby' supplied not j bdt[, .sd[g == max(g)]] # 1 row this works:
bdt[, .sd[g == max(g)], = id] but it's 30% faster plyr, suggesting it's not idiomatic.
here's fast data.table way:
bdt[bdt[, .i[g == max(g)], = id]$v1] this avoids constructing .sd, bottleneck in expressions.
edit: actually, main reason op slow not has .sd in it, fact uses in particular way - calling [.data.table, @ moment has huge overhead, running in loop (when 1 by) accumulates large penalty.
Comments
Post a Comment