Stagification

Stagification involves breaking the physical plan into multiple stages where stage boundaries are drawn between the Local and Global Rearrange operators. Stages are aggregated into MR jobs by taking two jobs at a time if there are two or more than two stages. The remainder stage at the end will be executed either as a map-only job or another map-reduce job. The so formed MR jobs will be used to create a JobControl object with the relevant dependencies. Following is an example:

A = load 'a';
B = foreach A generate $0, $2;
C = filter B by $2 < 10;
D = filter B by $2 > 10;
E = group C by $0, D by $0;
F = foreach E generate group, count(C), count(D);

stage.png

Infrastructure

This involves the actual Map & Reduce classes that will be run on Hadoop. There will be a static Map and Reduce class which can execute a portion of the physical plan. The physical plan to be executed will be configured through job conf. The infrastructure will also add some new operators to enable processing. The operators that will be added by the infrastructure will be Start Map, End Map, Start Reduce, End Reduce and Splits and Split Readers. So a map would just call (End Map).next() till there are values to be returned and adds them to the collector. The Start Map operator's next method is configured by the infrastructure to return the key, value passed to the map function. Similarly, for the Reduce phase. Splits and Split Readers will be added wherever there is branching.

The above example would translate as follows:

infra.png

PigStagificationAndMRInfrastructure (last edited 2009-09-20 23:38:14 by localhost)