python - Efficiently compute historical aggregates -
i have dataframe containing historical records, example sales.
import pandas pd first_salesman = pd.series([1, 2, 3], index=[pd.to_datetime('2015-01-01'), pd.to_datetime('2015-01-02'), pd.to_datetime('2015-01-03')]) second_salesman = pd.series([2, 0, 5], index=[pd.to_datetime('2015-01-01'), pd.to_datetime('2015-01-02'), pd.to_datetime('2015-01-03')]) third_salesman = pd.series([6, 7, 1], index=[pd.to_datetime('2015-01-01'), pd.to_datetime('2015-01-02'), pd.to_datetime('2015-01-03')]) df = pd.concat([first_salesman, second_salesman, third_salesman], axis=1).reset_index() df = pd.melt(df, id_vars=['index'], value_vars=[0, 1, 2]) df.columns = ['date', 'salesman_id', 'sales'] df = df.set_index('date')
i compute historical statistics each of salesmen: in other terms, each salesman , each day, know how mean, std, max, min, etc... looked until day.
so focusing on salesman 1 , average, [1, 1.5, 2].
i can build for-loop, have big number of salesmen , dates , doesn't efficient thing do. i'm looking way of computing in efficient manner.
after tinkering, found solution. in pandas >= 0.18, can use df.expanding(). see documentation.
a clean solution scenario one-liner:
df.set_index(['salesman_id', 'date']).expanding(min_periods=1).agg([np.mean, np.std, np.min, np.max])
note .expanding() supports .agg() can use whatever aggregates you'd like.
Comments
Post a Comment