Query Optimization Ideas for Pig
Already Implemented
pipeline a sequence of stateless operators into a single Map or single Reduce
Implemented, but room for improvement
push distributive/algebraic functions into combiner, including algebraic UDFs, DISTINCT, and other items
Low hanging fruit
System-R optimizer heuristics:
push projections (move them earlier in the plan)
push cheap filters (move filters known to be cheap, e.g. ones with simple logic predicates, earlier in the plan)
eliminate cartesian products when possible, e.g. convert CROSS followed by FILTER into JOIN
Fruit that can be reached with a short ladder
look for ways to do multiple group/cogroup/join operations in a single map-reduce job --- this would occur if the keys share a common prefix. Example: group by userid+hour, then count, then group by userid, then take max --- can be done in one map-reduce job with userid as the reduce key.
choose a join strategy (symmetric hashing, fragment-and-replicate, ...); can probably make a reasonable choice based on file sizes [but first, we have to implement various join strategies in the execution layer -- currently pig only supports symmetric hashing]
choose a CROSS strategy ((1) n-by-m grid, as currently implemented; or (2) n-by-1 which is better in the case of crossing a large table with a very small table, which can be implemented as map-only) [a common example of crossing a large table with a small one is when you have to normalize all records of Table A by dividing by a scalar S -- you can do this by crossing Table A with a Table B containing a single record that holds S]
Probably won't get there, and may not even want to go there
query optimization techniques found in any database textbook
reordering filters (need to estimate selectivity based on (a-priori or on-the-fly) histograms, or maybe adaptively reorder)
...