hive - Combining data based on column in spark -


i have data in following format in hive table.

user |  purchase | time_of_purchase 

i want data in

user | list of purchases ordered time 

how do in pyspark or hiveql?

i have tried using collect_list in hive not retain order correctly timestamp.

edit : adding sample data asked kartikkannapur. here sample data

94438fef-c503-4326-9562-230e78796f16 | bread | jul 7 20:48 94438fef-c503-4326-9562-230e78796f16 | shaving cream | july 10 14:20 a0dcbb3b-d1dd-43aa-91d7-e92f48cee0ad | milk | july 7 3:48 a0dcbb3b-d1dd-43aa-91d7-e92f48cee0ad | bread | july 7 3:49 a0dcbb3b-d1dd-43aa-91d7-e92f48cee0ad | lotion | july 7 15:30 

the output want

94438fef-c503-4326-9562-230e78796f16 | bread , shaving cream a0dcbb3b-d1dd-43aa-91d7-e92f48cee0ad | milk , bread , lotion 

one way of doing is

first create hive context , read table rdd.

from pyspark import hivecontext purchaselist = hivecontext(sc).sql('from purchaselist select *') 

then process rdd

from datetime import datetime dt purchaselist = purchaselist.map(lambda x:(x[0],[x[1],dt.strptime(x[2],"%b %d %h:%m")])) purchasebyuser = purchaselist.groupbykey() purchasebyuser = purchasebyuser.map(lambda x:(x[0],[y[0] y in sorted(x[1], key=lambda z:z[1])])) print(purchasebyuser.take(2)) 

output

[('94438fef-c503-4326-9562-230e78796f16', ['bread', 'shaving cream']), ('a0dcbb3b-d1dd-43aa-91d7-e92f48cee0ad', ['milk', 'bread', 'lotion'])] 

save rdd new hive table

schema_rdd = hivecontext(sc).inferschema(purchasebyuser) schema_rdd.saveastable('purchasebyuser') 

for reading , writing hive table see stackoverflow question , spark docs


Comments

Popular posts from this blog

javascript - Slick Slider width recalculation -

jsf - PrimeFaces Datatable - What is f:facet actually doing? -

angular2 services - Angular 2 RC 4 Http post not firing -