r - Subset by group with data.table -
assume have data table containing baseball players:
library(plyr) library(data.table) bdt <- as.data.table(baseball)
for each player (given id), want find row corresponding year in played games. straightforward in plyr:
ddply(baseball, "id", subset, g == max(g))
what's equivalent code data.table?
i tried:
setkey(bdt, "id") bdt[g == max(g)] # 1 row bdt[g == max(g), = id] # error: 'by' or 'keyby' supplied not j bdt[, .sd[g == max(g)]] # 1 row
this works:
bdt[, .sd[g == max(g)], = id]
but it's 30% faster plyr, suggesting it's not idiomatic.
here's fast data.table
way:
bdt[bdt[, .i[g == max(g)], = id]$v1]
this avoids constructing .sd
, bottleneck in expressions.
edit: actually, main reason op slow not has .sd
in it, fact uses in particular way - calling [.data.table
, @ moment has huge overhead, running in loop (when 1 by
) accumulates large penalty.
Comments
Post a Comment