Pig Types Branch Integration
As of 1/12, the types branch has been merged with the trunk; types branch should no longer be used.
This page tracks the integration of the types branch back into the main branch.
Step 1: Merge patches from main into branch
A merge from trunk to the branch has not been done since February 27, 2008. We need to begin merging patches into the branch. This will be particularly challenging in many cases because large sections of the code have been rewritten. This means that the patches cannot be merged as is, but rather the functional changes will need to be determined and the same changes made in the branch.
As new patches are committed to the trunk we need to track them here to see which category they fall into.
Issues that we believe will be fixed in rework and just need tested
JIRA |
Comments |
Done |
pig could not resolve UDF class; maybe already fixed; verify |
Yes |
|
Use of schema on a secondary output of SPLIT throws IndexOutOfBoundsException |
Yes |
|
comparator function not used in local mode; is this already fixed? verify |
Fixed in rework |
|
Incorrect results when there is a dump in between statements; might be fixed |
Already fixed in rework |
|
better error handling; might already have it; need to verify |
Fixed in rework which has error message about cast failures and out of bounds access |
|
On hadoop 0.16, some jobs using combiner fail with an NPE |
Already fixed in rework (the code using RecordReader in types branch is same as in trunk as of now. For this code, the patch is not applicable) |
|
Combiner error |
This is currently an open issue in trunk (as of 09/15/2008) - script in the issue does run fine in Types branch with small data |
|
Make Grunt's explain output more understandable; Explain is already in the right format |
Yes |
|
pig allows to overwrite existing files addressed in the type branch according to Pi need to verify. (Per Pi: PIG-237 code was copied from type branch so don't need to merge) |
Yes |
Issues that do not touch rewritten code and should be easy
JIRA |
Comments |
Done |
remove TokenMgrError and Co from svn properties in src/org/apache/pig/tools/pigscript/parser; - involves only changing svn properties |
No |
|
making code release ready: need to add license and build.xml changes |
No |
|
build.xml improvements; might already be in; otherwise trivial to fix |
Yes |
|
running 1 test at a time; might already be there; trivial otherwise |
Yes |
|
target descriptions; might already be in; otherwise a trivial change that can be done now |
Yes |
|
implement version; trivial change; can be applied now |
Yes |
|
parameter substitution; can be moved now |
Yes |
|
pig parser hangs on input script bigger ~1kb; can be committed right now after (58) |
Yes |
|
similar to 149; can be committed now |
Yes |
|
fix for doc target can be committed now |
Yes |
|
param sub broken with some commands; can be committed now right after (58) |
Yes |
|
incorrect variable definition in parameter substitution; can be committed now after (58) |
Yes |
|
param subst tests hardwire perl location; can be committed now right after (58) |
Yes |
|
support non default constructor with variable number of arguments |
Yes |
|
Calling non default constructor of Final class from Main class in UDF - This is already present in Types branch |
Yes |
|
target for building source jar |
Yes |
|
non-static logger used; |
Yes |
Issues that do touch rewritten code and will be difficult
Streaming
The code is somewhat isolated to should not be too bad. Unit tests and end-to-end tests are fairly comprehansive.
JIRA |
Comments |
Done |
Make pig work on Windows - includes streaming fixes |
Fixed with PIG-501 but needs new javacc |
|
problem with streaming optimization |
Yes |
|
Broken pipe if excuting the streaming script via the stream command directory |
Yes |
|
Streaming implementation |
Yes |
|
Bug in streaming |
Yes |
|
Failure running complex script with streaming |
Yes |
|
auto ship broken is presence of define for another streaming operator |
Yes |
|
Number of input/output rows in the logs is invalid with BinaryStorage |
Yes |
|
validation of files in ship/cache specs |
Yes |
|
streaming does not handle errors on load function |
Yes |
|
parsing problems for streaming define |
Yes |
|
streaming returns always returns 127 |
Yes |
|
names for secondary streaming outputs don't match the spec |
Yes |
|
streaming does not handle commands with pipes |
Yes |
|
streaming error |
Yes |
|
Parsing issue on the stream statement? |
Yes |
|
Bug in streaming |
Yes |
|
Bug in streaming |
Yes |
|
Move DEFINE and STORE parsing into QueryParser from GruntParser |
Yes |
Config changes
JIRA |
Comments |
Done |
config changes- The following files were not merged since they are not in types yet* test/org/apache/pig/test/TestPigFile.java * test/org/apache/pig/test/TestStreaming.java * src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/SliceWrapper.java |
Yes* |
|
cleanup after PIG-111 |
Yes |
|
command line properties are ignored needs manual testing; Pradeep should do this one |
Yes |
Custom splits
JIRA |
Comments |
Done |
streaming fix after custom splits |
Yes |
|
|
Yes |
|
simple changes over slicer |
Yes |
Others
JIRA |
Comments |
Done |
UDF repository creation - needs finalizing the EvalFunc.exec interface and changing DataAtom to String |
Yes |
|
unit tests broken under windows can be merged now - depends on streaming and other test cases being merged first |
Fixed with PIG-501 but needs new javacc |
|
problems with bzip files; depends on slicer |
No |
|
New illustrate command does not work in mapreduce mode. |
Yes |
|
Illustrate command; I think should be easy to integrate |
Yes |
|
support hadoop map reduce in loal mode |
No |
|
Need wrapper UDFs for all java.lang.Math functions |
Yes |
|
Add tutorial files and builds to Pig SVN |
Yes |
|
Bug in how distinct data bag counts |
Yes |
|
integration with hadoop 0.17; Files not patched: * src/org/apache/pig/backend/executionengine/PigSlicer.java - not present yet in types * src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java - changes depend on Config properties change * src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/PigMapReduce.java - this is present in ../impl/mapReduceLayer now and is different in content - depends on some other change going in first which will cause PigMapReduce to "implement" MapRunnable and Reducer (in the types branch it currently does not implement any interface) * src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/PigCombine.java - not present in types |
Yes |
|
making PigStorage work with control characters |
The necessary changes are already present in PigStorage in Types branch |
|
memory management; depends on PIG-111; |
Yes |
|
memory management |
Yes |
|
Performance issues with memory spills |
Yes |
|
making Pig work with HOD 0.4; should be doable now - involves backend - HExecutionEngine.java - a little more involved and needs careful merging |
Yes |
|
Optimize Pig by replacing String '+' and StringBuffer with StringBuilder; - there are changes in backend/Operators which have changed in structure now |
Yes |
|
speculative execution is broken; |
Yes |
|
Warnings with hadoop 17 |
Yes |
|
HOD param passing |
Yes |
|
reversable functions; only added reversable interface; the rest is no longer applicable as the new code does not try to reuse intermediate results |
Yes |
|
NullPointerException thrown if we catch exception with null message |
Yes |
|
purgin weak references |
Yes |
|
need to create temp files in the task's working directory; should be easy to integrate |
Yes |
|
UNION/CROSS/JOIN operations should not allow 1 operand; trivial change; |
Yes |
|
escaping |
yes |
Step 2
Merge from branch to trunk.