Differences between revisions 5 and 6
Revision 5 as of 2008-04-25 00:57:23
Size: 2016
Editor: 123
Revision 6 as of 2009-09-20 23:38:14
Size: 2016
Editor: localhost
Comment: converted to 1.6 markup
No differences found!

Query Optimization Ideas for Pig

Already Implemented

  • pipeline a sequence of stateless operators into a single Map or single Reduce

Implemented, but room for improvement

  • push distributive/algebraic functions into combiner, including algebraic UDFs, DISTINCT, and other items

Low hanging fruit

  • System-R optimizer heuristics:
    • push projections (move them earlier in the plan)
    • push cheap filters (move filters known to be cheap, e.g. ones with simple logic predicates, earlier in the plan)
    • eliminate cartesian products when possible, e.g. convert CROSS followed by FILTER into JOIN

Fruit that can be reached with a short ladder

  • look for ways to do multiple group/cogroup/join operations in a single map-reduce job --- this would occur if the keys share a common prefix. Example: group by userid+hour, then count, then group by userid, then take max --- can be done in one map-reduce job with userid as the reduce key.
  • choose a join strategy (symmetric hashing, fragment-and-replicate, ...); can probably make a reasonable choice based on file sizes [but first, we have to implement various join strategies in the execution layer -- currently pig only supports symmetric hashing]
  • choose a CROSS strategy ((1) n-by-m grid, as currently implemented; or (2) n-by-1 which is better in the case of crossing a large table with a very small table, which can be implemented as map-only) [a common example of crossing a large table with a small one is when you have to normalize all records of Table A by dividing by a scalar S -- you can do this by crossing Table A with a Table B containing a single record that holds S]

Probably won't get there, and may not even want to go there

  • query optimization techniques found in any database textbook
    • reordering filters (need to estimate selectivity based on (a-priori or on-the-fly) histograms, or maybe adaptively reorder)
    • ...

PigOptimizationWishList (last edited 2009-09-20 23:38:14 by localhost)