The new LOProject works in two different ways:-
- Given 1 index, it outputs datum.
- Given 2 or more indexes, it outputs tuple.
Besides that it can be marked as sentinel, meaning it bridges data from outer plan to inner plan.
Doesn't seem it having too many meanings?
Example
B = COGroup A BY $0, B BY S1 ; C = FOREACH B GENERATE flatten(A.(f1, f2)), group ;
Here are the inner plans (inside GENERATE):-
(plan1) (plan2) Project(A.(f1, f2)) Project(group)
The one in the first plan returns projected bag but the one from the second plan returns datum. Both of them also act as bridges between outer/inner plans.
My suggestion
It would be cleaner and more understandable if we just:-
- Introduce LOSentinel which can be used to get 1 field out of outer plan (from tuple or bag).
- Use LOProject only when projecting tuples or bags (and output tuple/bag)
Following examples show plans inside LOGenerate:-
Example1
B = FOREACH A GENERATE x1*x2 ;
Sentinel(x1) Sentinel(x2)
\ /
MULExample2
FOREACH C GENERATE FLATTEN(A.(f1, f2)), group ;
(plan1) (plan2)
Sentinel(A) Sentinel(group)
|
Project(f1, f2) Note: Flatten is handled by LOGenerate
Example3
W = LOAD '...' AS (url, outlink);
G = GROUP W by url;
R = FOREACH G {
FW = FILTER W BY outlink eq 'www.apache.org';
PW = FW.outlink;
DW = DISTINCT PW;
GENERATE group, COUNT(DW);
}
(plan1) (plan2)
Sentinel(group) Sentinel(W)
|
Filter
|
Project(outlink)
|
Distinct
|
COUNTThought?