python - Multiple Count and Median Values from a Dataframe -
i trying perform several operations in 1 program @ same time. have data-frame has dates
of have no clue of start , end , want find:
- total number of days data-set has
- total number of hours
- median of count
- write separate output median per day/date.
- if possible median-of-median in possible simple way.
input: few rows large file of gb size
2004-01-05,16:00:00,17:00:00,mon,10766,656 2004-01-05,17:00:00,18:00:00,mon,12223,670 2004-01-05,18:00:00,19:00:00,mon,12646,710 2004-01-05,19:00:00,20:00:00,mon,19269,778 2004-01-05,20:00:00,21:00:00,mon,20504,792 2004-01-05,21:00:00,22:00:00,mon,16553,783 2004-01-05,22:00:00,23:00:00,mon,18944,790 2004-01-05,23:00:00,00:00:00,mon,17534,750 2004-01-06,00:00:00,01:00:00,tue,17262,747 2004-01-06,01:00:00,02:00:00,tue,19072,777 2004-01-06,02:00:00,03:00:00,tue,18275,785 2004-01-06,03:00:00,04:00:00,tue,13589,757 2004-01-06,04:00:00,05:00:00,tue,16053,735
the start , end date not known.
edit: expected output:1 have 1 row of results
days,hours,median,median-of-median 2,17262,13,17398
median-of-median median value of median
column output 2
expected output:2, have medians of every date used find median-of-median
date,median 2004-01-05,17534 2004-01-06,17262
code:
import pandas pd datetime import datetime df = pd.read_csv('one_hour.csv') df.columns = ['date', 'starttime', 'endtime', 'day', 'count', 'unique'] date_count = df.count(['date']) all_median = df.median(['count']) all_hours = df.count(['starttime']) med_med = df.groupby(['date','count']).median() print date_count print all_median print all_hours stats = ['date_count', 'all_median', 'all_hours', 'median-of-median'] stats.to_csv('stats_all.csv', index=false) med_med.to_csv('med_day.csv', index=false, header=false)
obviously code not give result supposed to.
the error shown below.
error:
traceback (most recent call last): file "day_median.py", line 8, in <module> all_median = df.median(['count']) file "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 5310, in stat_func numeric_only=numeric_only) file "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4760, in _reduce axis = self._get_axis_number(axis) file "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 308, in _get_axis_number axis = self._axis_aliases.get(axis, axis) typeerror: unhashable type: 'list'
iiuc maybe change:
date_count = df.count(['date']) all_median = df.median(['count']) all_hours = df.count(['starttime'])
to:
date_count = df['date'].count() all_median = df['count'].median() all_hours = df['starttime'].count() print (date_count) print (all_median) print (all_hours) 13 17262.0 13
if need count statistics columns date
, count
, starttime
.
edit comment:
if need count unique values of column use nunique
:
date_count = df['date'].nunique() print (date_count) 2
dataframe stats
:
cols = ['date_count', 'all_median', 'all_hours'] stats = pd.dataframe([[date_count, all_median, all_hours]], columns = cols) print (stats) date_count all_median all_hours 0 2 17262.0 13
Comments
Post a Comment