physics - How to apply Marshall Palmer function at ID level in R? -
i'm analyzing dual-polarization radar data , want add result of marshall palmer relation id-level variable in my data.
there's no cran function can find, r user has script wherein applies relation estimate of expected value in data:
# troy w. (thanks!) # few small changes hack-r ## better in r me clean up/refactor code bit. library(dplyr) library(data.table) test <- fread('../input/test.csv') mpalmer <- function(ref, minutes_past) { # order reflectivity values , minutes_past sort_min_index = order(minutes_past) minutes_past <- minutes_past[sort_min_index] ref <- ref[sort_min_index] # calculate length of time each reflectivity value valid valid_time <- rep(0, length(minutes_past)) valid_time[1] <- minutes_past[1] if (length(valid_time) > 1) { (i in seq(2, length(minutes_past))) { valid_time[i] <- minutes_past[i] - minutes_past[i-1] } valid_time[length(valid_time)] = valid_time[length(valid_time)] + 60 - sum(valid_time) } else { # if 1 observation, make valid entire hour valid_time <- 60 } valid_time = valid_time / 60 # calculate hourly rain rates using marshall-palmer weighted valid times sum <- 0 (i in seq(length(ref))) { if (!is.na(ref[i])) { mmperhr <- ((10^(ref[i]/10))/200) ^ 0.625 sum <- sum + mmperhr * valid_time[i] } } return(sum) } results <- test %>% group_by(id) %>% summarize(expected=sum) write.csv(results, file='sample_solution.csv', row.names=false)
in addition being incredibly slow big data, problem code above doesn't create column of results within original data, allow productionalized in etl pipeline created relation @ id level 1 predictive variable in dataset.
i tried rewriting function this:
mpalmer <- function(ref, minutes_past) { # credit troy # edits jason miller, hack-r.com # order reflectivity values , minutes_past sort_min_index = order(minutes_past) minutes_past <- minutes_past[sort_min_index] ref <- ref[sort_min_index] # calculate length of time each reflectivity value valid valid_time <- rep(0, length(minutes_past)) valid_time[1] <- minutes_past[1] if (length(valid_time) > 1) { (i in seq(2, length(minutes_past))) { valid_time[i] <- minutes_past[i] - minutes_past[i-1] } valid_time[length(valid_time)] = valid_time[length(valid_time)] + 60 - sum(valid_time) } else { # if 1 observation, make valid entire hour valid_time <- 60 } valid_time = valid_time / 60 # calculate hourly rain rates using marshall-palmer weighted valid times sum <- 0 (i in seq(length(ref))) { if (!is.na(ref[i])) { mmperhr <- ((10^(ref[i]/10))/200) ^ 0.625 sum <- sum + mmperhr * valid_time[i] } } return(sum) }
and applying this:
train.samp$mp <- aggregate(train.samp$ref, by=list(train.samp$id), fun = mpalmer, minutes_past = train.samp$minutes_past)
which think works, after running long time, returned error this:
error in `$<-.data.frame`(`*tmp*`, "mp", value = list(group.1 = c(10l, : replacement has 9765 rows, data has 10000
i've tried on different samples of data , error message in format, though specific numbers may change. there's no missing data in dataset.
any idea how fix function (and/or make faster)?
update: i've got working loop so slow...
this i'm going now, i'm still open other solutions.
basically, went original function broke apart overly large dataset manageable chunks , ran loops on each chunk:
train.samp <- train.samp[order(train.samp$id),] train.samp1 <- train.samp1[order(train.samp1$id),] train.samp.id <- unique(train.samp$id) train.samp.id.1 <- train.samp.id[1:25000] train.samp.id.2 <- train.samp.id[25001:50000] train.samp.id.3 <- train.samp.id[50001:75000] train.samp.id.4 <- train.samp.id[75001:100000] train.samp.id.6 <- train.samp.id[100001:125000] train.samp.id.5 <- train.samp.id[125001:150000] train.samp.id.7 <- train.samp.id[150001:175000] train.samp.id.8 <- train.samp.id[175001:200000] train.samp.id.9 <- train.samp.id[200001:length(train.samp.id)] train.samp.1 <- train.samp[train.samp$id %in% train.samp.id.1,] train.samp.2 <- train.samp[train.samp$id %in% train.samp.id.2,] train.samp.3 <- train.samp[train.samp$id %in% train.samp.id.3,] train.samp.4 <- train.samp[train.samp$id %in% train.samp.id.4,] train.samp.5 <- train.samp[train.samp$id %in% train.samp.id.5,] train.samp.6 <- train.samp[train.samp$id %in% train.samp.id.6,] train.samp.7 <- train.samp[train.samp$id %in% train.samp.id.7,] train.samp.8 <- train.samp[train.samp$id %in% train.samp.id.8,] train.samp.9 <- train.samp[train.samp$id %in% train.samp.id.9,] system.time( for(i in unique(train.samp.1$id)){ train.samp.1$mp[train.samp.1$id == i] <- mpalmer(train.samp.1$ref[train.samp.1$id == i], minutes_past = train.samp.1$minutes_past[train.samp.1$id == i]) } ) for(i in unique(train.samp$id[train.samp$id %in% train.samp.id.2])){ train.samp$mp[train.samp$id == i] <- mpalmer(train.samp$ref[train.samp$id == i], minutes_past = train.samp$minutes_past[train.samp$id == i]) } for(i in unique(train.samp$id[train.samp$id %in% train.samp.id.3])){ train.samp$mp[train.samp$id == i] <- mpalmer(train.samp$ref[train.samp$id == i], minutes_past = train.samp$minutes_past[train.samp$id == i]) } for(i in unique(train.samp$id[train.samp$id %in% train.samp.id.4])){ train.samp$mp[train.samp$id == i] <- mpalmer(train.samp$ref[train.samp$id == i], minutes_past = train.samp$minutes_past[train.samp$id == i]) } for(i in unique(train.samp$id[train.samp$id %in% train.samp.id.5])){ train.samp$mp[train.samp$id == i] <- mpalmer(train.samp$ref[train.samp$id == i], minutes_past = train.samp$minutes_past[train.samp$id == i]) } for(i in unique(train.samp$id[train.samp$id %in% train.samp.id.6])){ train.samp$mp[train.samp$id == i] <- mpalmer(train.samp$ref[train.samp$id == i], minutes_past = train.samp$minutes_past[train.samp$id == i]) } for(i in unique(train.samp$id[train.samp$id %in% train.samp.id.7])){ train.samp$mp[train.samp$id == i] <- mpalmer(train.samp$ref[train.samp$id == i], minutes_past = train.samp$minutes_past[train.samp$id == i]) } for(i in unique(train.samp$id[train.samp$id %in% train.samp.id.8])){ train.samp$mp[train.samp$id == i] <- mpalmer(train.samp$ref[train.samp$id == i], minutes_past = train.samp$minutes_past[train.samp$id == i]) } for(i in unique(train.samp$id[train.samp$id %in% train.samp.id.9])){ train.samp$mp[train.samp$id == i] <- mpalmer(train.samp$ref[train.samp$id == i], minutes_past = train.samp$minutes_past[train.samp$id == i]) } system.time( for(i in unique(train.samp1$id)){ train.samp1$mp[train.samp1$id == i] <- mpalmer(train.samp1$ref[train.samp1$id == i], minutes_past = train.samp1$minutes_past[train.samp1$id == i]) }
the function not shown here, take advantage of @gregor's suggestion in comment.
Comments
Post a Comment