hadoop - Is there a better way of handling static columns than grouping in Pig? -


i have lot of denormalized data need calculations on. there's 28 columns, 1 of id column, 5 of need sum, , rest of need report. 22 of these columns same single id. i'm grouping on 23 columns , summing 5. seems me has undue overhead. there better way handle it?

here's script after initial load:

grouped = group inputdata (site_id_col,  meta_id_col,  item_id_col,  seller_id_col,  category1_col,  category2_col,  total_watch_col,  item_title_col,  auct_type_col,  currency_col,  item_price_col,  shipping_type_col,  shipping_fee_col,  start_date_col,  total_qty_col,  qty_avail_col,  status_id_col,  auct_duration_col,  end_date_col,  login_atol_col,  login_latest_col);  filtered = foreach grouped generate   flatten(group),   sum(inputdata.impression_col),   sum(inputdata.click_col),   sum(inputdata.bidcount_col),   sum(inputdata.qty_sold_col),   sum(inputdata.ck_trans_col),   sum(inputdata.gmv_col);  store filtered 'output/'; 

so, whether or not faster depends on data set , cluster, can try regenerating data id , 5 summed columns, , joining onto 22 "reported" columns after. like:

smallerdata = foreach inputdata generate item_id_col, impression_col, ...;  reportingdata = foreach inputdata generate item_id_col, [other 22 reporting cols]; reportingdata1 = distinct reportingdata;  grouped = group smallerdata item_id_col;  filtered = foreach grouped generate   flatten(group) id,   sum(inputdata.impression_col),   sum(inputdata.click_col),   sum(inputdata.bidcount_col),   sum(inputdata.qty_sold_col),   sum(inputdata.ck_trans_col),   sum(inputdata.gmv_col);  joined = join filtered id, reportingdata1 item_id_col;  store joined 'output/'; 

if grouped set has way fewer rows input set, make things faster. it'll go long way towards preventing running java heap space issues, pig known have when wind wide rows (i.e. after groups).

if that's not case, original way faster because joins not want use in pig unless have to.


Comments

Popular posts from this blog

android - getbluetoothservice() called with no bluetoothmanagercallback -

sql - ASP.NET SqlDataSource, like on SelectCommand -

ios - Undefined symbols for architecture armv7: "_OBJC_CLASS_$_SSZipArchive" -