PIG ILLUSTRATE Command
The ILLUSTRATE command is used to demonstrate how a few "good" example input data are transformed through a series of PIG commands, as represented by the corresponding aliases, being executed. The quality of the "good" data, and consequently the quality of the illustration, is judged by three measurements: completeness, conciseness and the degree of realism. The concept and the algorithm is from http://i.stanford.edu/~olston/publications/sigmod09.pdf.
After the base data from a portion of a real input are fetched and all their derived data along the command tree are produced, there are three basic steps to optimize the three measurements. First, a pruning is performed to maximize conciseness without sacrificing completeness. Next, a upstream synthesis step is executed, at "best effort", to enhance the completeness while maintain as much as realism as possible. Finally, one more pruning is performed to trim away any possible redundant records that are hurting conciseness after the possible introduction of synthetic records in the second step.
The command is executed in local mode.
Status quo of the Codes on Trunk
The current code on trunk should support LOAD/STORE/COGROUP/JOIN/FILTER/UNION, but is still deemed to be incomplete or obsolete or of quality issues in the following areas:
- it is using the older logical plan scheme;
- the optimization is applied in a very limited way and no new optimization rules are used (of course!);
- it does not support LIMIT/SPLIT/DISTINCT/CROSS/STREAM operators;
- MAP type is not supported;
Inner plans of ForEach and Filter are not accounted for when figuring out completeness/conciseness of the example data, making the completeness calculation only marginally valid;
- Unit test cases are few and far between. The coverage is small and estimated to be less than 5% and lacks of full functional tests. The verification is cursory. No end-to-end tests are present.
- During pruning when equivalence classes and affinity groups are used on a per-operator basis, the handling are separated from the data generation process, making it difficult to manage as they work on the generated data;
- dead codes, unused variables/arguments/classes are scattered around.
Suggested additional development work
Use of the new logical plan, code cleansing, test case enhancements and bug fixes notwithstanding, the major work will be on the functional side. Although the basic algorithm is to be adhered to, in some areas it will be enhanced slightly so as to follow the operators' semantics yet still honor the three principles of completeness, conciseness and realism as much as can.
Feasibility of Equivalence Classes
One corner stone of the algorithm is the concept of the "equivalence classes" that classifies records an operator processes into the "classes" that any single one record therein is sufficient to represent the whole class of record in one aspect of the operator. This concept has two implicit requirements on the operators: 1) the classification is value-based; 2) the classification of a record is independent of all other records' classification.
However, not all PIG operators meet these two requirements. Some, such as FILTER, LOAD and UNION fit nicely. Some others, like COGROUP and DISTINCT, needs a bit more info, "affinity level" in the case of COGROUP, to add more constraints during pruning. Others, like LIMIT, do not fit well since they are not value-based and sensitive to past records' classification. For those operators, after the normal pruning-synthesis-pruning steps, a post process will be introduced to restore the criteria according to the operators' semantics.
Equivalence Classes on Expressions and Functions
One more hurdle regarding the equivalence classes is that they are now only defined on relational operators only; while expression operators are not in picture. But the functions and expression operators that produce finite results should have equivalence classes too to make the completeness coverage check, well, complete. Consequently no equivalence classes are defined for so-called "multifaceted" filtering beyond the simple "projection comparison with constant" filtering. Specifically, the comparison operators, the Boolean operators, the SIGN and NULL operators and the IsEmpty and !bincond functions, (namely all the operators and functions that generate boolean outputs, intermediate or final, for now; in the future if other operators/functions that generate finite results are to be introduced, they will have defined equivalence classes too), should have defined the equivalence classes.
In contrast, those operators that generate continuous results, the arithmetic operators, e.g., won't have defined equivalence classes. One might think, e.g., that the division operator, '/', may well define two equivalence classes with the second holding the inputs that have zero denominator. However, that should be treated as an exception handling case and the ILLUSTRATE command should not take exception handling into consideration.
Data Synthesis and Operator/Expression Invertibility
During the data synthesis process, operators are desired to be able to generate input values based from given output values. Currently only comparison operators are capable of generating the "reverse" results. This will be extended to all expression operators.
However, to support "multifaceted" filtering, the equivalence classes need to be on all combinations of possible Boolean truth values of the primitive logical expressions. If all the primitive logical expressions are unrelated, it is fine to simply synthesize the input constraint fields separately. Otherwise, the multidimensional space of the input constrain fields could be carved in any arbitrarily complex way so that equivalence classes definition is made impracticable if not infeasible. For instance, a simple compound logical expression of "(a > 5) AND (b < 3)" carved the 2-d space of (a, b) into 4 sections and one pair of synthetic values independently chosen along both axis in each section will well represent an equivalence class. A slightly more complex compound logical expression of "((a+b > 5) AND (a-b < 3))" will still carve the 2-d space into 4 sections. But it is impossible to independently select the values of a and b to represent each of all the 4 sections. Given this complexity and the "best-effort" nature already in the basic algorithm, we will synthesize data for "multifaceted" filtering only if all the primitive logical expressions of a compound logical expression are independent. All other types of compound logical expressions will use the existing "best-effort" approach for the top-level logical expression. But the equivalence classes will still be defined on all possible combinations of the truth values of the primitive logical expressions so as to avoid unnecessary completeness reduction during pruning.
UDFs are currently considered "irreversible". In the future, we may want to support hooks so that UDFs can provide a way to generate the 'reverse" results. That would be beneficial for UDF testing and verification purposes.
Given the general need to synthesize data in Operators and Expressions, and the polymorphism nature of operators and expressions, a new method, !getSyntheticInput, will be intruduced in the EvalFunc and PhysicalOperator classes that defaults to no-op but otherwise will work out the input constraints based upon existing data and output constraints. This method will conceivably take place of many existing methods of the AugmentBaseDataVisitor class that handle the data synthesis process.
However, in this version, there is no plan to support synthesis data on any expressions or functions.
Each output record generated by inputs that have more than one records is put into a single equivalence class. The purpose is to illustrate at least a 2-by-2 cross product between two inputs.
In base data generation, all rows will be present for the prunning-synthesis-prunning procedure, namely, LIMIT is treated as a no-op. In the post process, one synthetic record is added as the input constraint to demonstrate the effect of the LIMIT operator. Note that by doing this, the original LIMIT value is most likely not repected. As an explanation, a message will be display to the effect of the changed value for illustration purpose.
< The use of the post process step takes the following couple of points into consideration:
- the semantics of LIMIT does not guarantee any particular records to be retained;
- An implementation artifact is that the first records are to be retained, thus tend to hurt completeness tremendously. Consequently the usual prunning-synthesis-prunning steps on the data set after LIMIT would tend to cause many synthetic records while there would be many real yet unused "limited out" records.
It has two equivalence classes, one with unique records, the other the opposite. A data structure of map from the equivalence class of duplicate records to a pair of the equivalence class of unique records and a set of IdentitySet of duplicate records is maintained to facilitate the transfer of a record from the equivalence class of duplicate records to that of unique records as result of prunning and, eventually, disallow the prunning of last record(s) in either equivalence class;
Lineage tracers are set up in the nested block of FOREACH to allow for the coverage of the inner plan.
Results for nested aliases will be displayed the same way as normal operators but with the operator name scoped with the enclosing alias using the '.' as the scoping separator.
MAP deferences will have different equivalence classes for each different deference. This is to consider the fact that a deference on a map key is more likely to be null than a bag or tuple deference.
Implicit SPLIT will have a single equivalence class so that at least one record will be pushed for downstream processing.
For Explicit SPLIT, for the same sake of complexity and feasibility as the "multifaceted filtering", an approach similar to the composite expression in FILTER is adopted during the pruning and synthesis processes. That is, in the pruning, equivalence classes are created based upon combinations of possible logical truth values of the children FILTER logical expressions; in the upstream synthesis process, however, only the truth values of the top level of the children FILTERs' logical expressions are considered for completeness.
In the above discussion, the children of a SPLIT are those that are traversed by the operations leading to the alias being illustrated, of course.
For upstream data synthesis, SPLIT will union the input constraints from all its children's output constraints. This could hurt conciseness as the input constraints from different children might be able to be merged, particularly if their significant fields, vs. the "don't care" fields, do not overlap. But that would require delaying the instantiations of the "don't care" fields by the downstream operators and setting the criteria on instantiating the "don't care" fields by the downstream operators. The complexity is judged to be not worth of the effort. It might be left for a future release.
A new interface, Illustratable, will be introduced to support all ILLUSTRATE-related operations. Each physical relational operator will implement it as a fully functional operation; while each physical expression operator's implementation will throw an exception.
The SAMPLE operator will be made an exception in that it will not work under ILLUSTRATE. The reason is that for SAMPLE, realism is of uttermost importance, which basically forbids any data synthesis. On the other hand, the prunning would hurt the sample sizing purpose.
Map-side join/cogroup is not supported in ILLUSTRATE as they are performance-related not data-centric features.
STREAM and UDF do not support data synthesis now. In the future, after hooks are introduced to allow the synthesis function in the external context, both can be enabled.
The schema foe illustrated aliases will be displayed alongside.
Currently ILLUSTRATE only works on one alias at a time. This is more aimed for interactive use. The support of ILLUSTRATE on a script is to be added. For non-aliased leaf nodes like STORE, the whole statement will be used as an identifier for the results displayed.
Command flags similar to those for EXPLAIN might need to be suuported in this release.
Existing implementation uses a limited edition of "true local mode" that bypasses most call paths from physical plan to map-reduce plan, and a different call path has to be used just for ILLUSTRATE purpose.
The new version will use the "real" call paths from physical plan to map-reduce plan, as by all other PIG executions. But just before launch the M/R jobs by the launcher, all physical plans contained in the map-reduce plans are transformed to ILLUSTRATE-specific ones before a "local runner", in place of the M/R launcher, is fired. The "local runner" will run the pipeline locally through invocations of the "map" and "reduce" methods of the mapper and reducer objects respectively without any M/R jobs launched. It will also take inputs from and direct outputs to in-memory data structures as set up properly by the ILLUSTRATE-specific physical operators..
In the future, the following enhancements might be desired.
Hooks are provided so that UDFs can tell PIG what equivalence classes they will have, what post process after the pruning-synthesis-pruning procedure is to be performed, and how the "reverse" results can be generated to enhance the completeness.
Data Synthesis in "Multifaceted" Filtering beyond the Independent Primitive Logical Expressions
Although it might be too hard to conceive a fully functional approach, more advanced types than the logical expressions of independent primitive logical expressions might be desired for in the future.
Data Synthesis on Expressions/Functions
Data synthesis may be desired on expressions and functions.
Refer to the Operator-specific feature section for STREAM.
In upstream synthesis, input constrains from children might be consolidated instead of being unioned in order to boost conciseness. It might require delaying instantiating the "don't care" fields by the downstream operators during their synthesis process, and setting up the instantiation criteria, a "richer constraint language", by the downstream operators for the instantiation step performed later by some upstream operator.