hive - Combining data based on column in spark -
i have data in following format in hive table.
user | purchase | time_of_purchase
i want data in
user | list of purchases ordered time
how do in pyspark or hiveql?
i have tried using collect_list in hive not retain order correctly timestamp.
edit : adding sample data asked kartikkannapur. here sample data
94438fef-c503-4326-9562-230e78796f16 | bread | jul 7 20:48 94438fef-c503-4326-9562-230e78796f16 | shaving cream | july 10 14:20 a0dcbb3b-d1dd-43aa-91d7-e92f48cee0ad | milk | july 7 3:48 a0dcbb3b-d1dd-43aa-91d7-e92f48cee0ad | bread | july 7 3:49 a0dcbb3b-d1dd-43aa-91d7-e92f48cee0ad | lotion | july 7 15:30
the output want
94438fef-c503-4326-9562-230e78796f16 | bread , shaving cream a0dcbb3b-d1dd-43aa-91d7-e92f48cee0ad | milk , bread , lotion
one way of doing is
first create hive context , read table rdd.
from pyspark import hivecontext purchaselist = hivecontext(sc).sql('from purchaselist select *')
then process rdd
from datetime import datetime dt purchaselist = purchaselist.map(lambda x:(x[0],[x[1],dt.strptime(x[2],"%b %d %h:%m")])) purchasebyuser = purchaselist.groupbykey() purchasebyuser = purchasebyuser.map(lambda x:(x[0],[y[0] y in sorted(x[1], key=lambda z:z[1])])) print(purchasebyuser.take(2))
output
[('94438fef-c503-4326-9562-230e78796f16', ['bread', 'shaving cream']), ('a0dcbb3b-d1dd-43aa-91d7-e92f48cee0ad', ['milk', 'bread', 'lotion'])]
save rdd new hive table
schema_rdd = hivecontext(sc).inferschema(purchasebyuser) schema_rdd.saveastable('purchasebyuser')
for reading , writing hive table see stackoverflow question , spark docs
Comments
Post a Comment