Pig Types Branch Integration

As of 1/12, the types branch has been merged with the trunk; types branch should no longer be used.

This page tracks the integration of the types branch back into the main branch.

Step 1: Merge patches from main into branch

A merge from trunk to the branch has not been done since February 27, 2008. We need to begin merging patches into the branch. This will be particularly challenging in many cases because large sections of the code have been rewritten. This means that the patches cannot be merged as is, but rather the functional changes will need to be determined and the same changes made in the branch.

As new patches are committed to the trunk we need to track them here to see which category they fall into.

Issues that we believe will be fixed in rework and just need tested

JIRA

Comments

Done

PIG-183

pig could not resolve UDF class; maybe already fixed; verify

Yes

PIG-178

Use of schema on a secondary output of SPLIT throws IndexOutOfBoundsException

Yes

PIG-202

comparator function not used in local mode; is this already fixed? verify

Fixed in rework

PIG-153

Incorrect results when there is a dump in between statements; might be fixed

Already fixed in rework

PIG-125

better error handling; might already have it; need to verify

Fixed in rework which has error message about cast failures and out of bounds access

PIG-179

On hadoop 0.16, some jobs using combiner fail with an NPE

Already fixed in rework (the code using RecordReader in types branch is same as in trunk as of now. For this code, the patch is not applicable)

PIG-324

Combiner error

This is currently an open issue in trunk (as of 09/15/2008) - script in the issue does run fine in Types branch with small data

PIG-113

Make Grunt's explain output more understandable; Explain is already in the right format

Yes

PIG-237

pig allows to overwrite existing files addressed in the type branch according to Pi need to verify. (Per Pi: PIG-237 code was copied from type branch so don't need to merge)

Yes

Issues that do not touch rewritten code and should be easy

JIRA

Comments

Done

PIG-122

remove TokenMgrError and Co from svn properties in src/org/apache/pig/tools/pigscript/parser; - involves only changing svn properties

No

PIG-34

making code release ready: need to add license and build.xml changes

No

PIG-68

build.xml improvements; might already be in; otherwise trivial to fix

Yes

PIG-124

running 1 test at a time; might already be there; trivial otherwise

Yes

PIG-127

target descriptions; might already be in; otherwise a trivial change that can be done now

Yes

PIG-13

implement version; trivial change; can be applied now

Yes

PIG-58

parameter substitution; can be moved now

Yes

PIG-203

pig parser hangs on input script bigger ~1kb; can be committed right now after (58)

Yes

PIG-150

similar to 149; can be committed now

Yes

PIG-149

fix for doc target can be committed now

Yes

PIG-218

param sub broken with some commands; can be committed now right after (58)

Yes

PIG-220

incorrect variable definition in parameter substitution; can be committed now after (58)

Yes

PIG-222

param subst tests hardwire perl location; can be committed now right after (58)

Yes

PIG-256

support non default constructor with variable number of arguments

Yes

PIG-255

Calling non default constructor of Final class from Main class in UDF - This is already present in Types branch

Yes

PIG-284

target for building source jar

Yes

PIG-213

non-static logger used;

Yes

Issues that do touch rewritten code and will be difficult

Streaming

The code is somewhat isolated to should not be too bad. Unit tests and end-to-end tests are fairly comprehansive.

JIRA

Comments

Done

PIG-243

Make pig work on Windows - includes streaming fixes

Fixed with PIG-501 but needs new javacc

PIG-226

problem with streaming optimization

Yes

PIG-182

Broken pipe if excuting the streaming script via the stream command directory

Yes

PIG-94

Streaming implementation

Yes

PIG-180

Bug in streaming

Yes

PIG-272

Failure running complex script with streaming

Yes

PIG-230

auto ship broken is presence of define for another streaming operator

Yes

PIG-232

Number of input/output rows in the logs is invalid with BinaryStorage

Yes

PIG-231

validation of files in ship/cache specs

Yes

PIG-229

streaming does not handle errors on load function

Yes

PIG-227

parsing problems for streaming define

Yes

PIG-224

streaming returns always returns 127

Yes

PIG-228

names for secondary streaming outputs don't match the spec

Yes

PIG-216

streaming does not handle commands with pipes

Yes

PIG-188

streaming error

Yes

PIG-184

Parsing issue on the stream statement?

Yes

PIG-181

Bug in streaming

Yes

PIG-174

Bug in streaming

Yes

PIG-154

Move DEFINE and STORE parsing into QueryParser from GruntParser

Yes

Config changes

JIRA

Comments

Done

PIG-111

config changes- The following files were not merged since they are not in types yet* test/org/apache/pig/test/TestPigFile.java * test/org/apache/pig/test/TestStreaming.java * src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/SliceWrapper.java

Yes*

PIG-215

cleanup after PIG-111

Yes

PIG-236

command line properties are ignored needs manual testing; Pradeep should do this one

Yes

Custom splits

JIRA

Comments

Done

PIG-204

streaming fix after custom splits

Yes

PIG-55

Yes

PIG-300

simple changes over slicer

Yes

Others

JIRA

Comments

Done

PIG-246

UDF repository creation - needs finalizing the EvalFunc.exec interface and changing DataAtom to String

Yes

PIG-156

unit tests broken under windows can be merged now - depends on streaming and other test cases being merged first

Fixed with PIG-501 but needs new javacc

PIG-151

problems with bzip files; depends on slicer

No

PIG-207

New illustrate command does not work in mapreduce mode.

Yes

PIG-59

Illustrate command; I think should be easy to integrate

Yes

PIG-120

support hadoop map reduce in loal mode

No

PIG-245

Need wrapper UDFs for all java.lang.Math functions

Yes

PIG-271

Add tutorial files and builds to Pig SVN

Yes

PIG-342

Bug in how distinct data bag counts

Yes

PIG-198

integration with hadoop 0.17; Files not patched: * src/org/apache/pig/backend/executionengine/PigSlicer.java - not present yet in types * src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java - changes depend on Config properties change * src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/PigMapReduce.java - this is present in ../impl/mapReduceLayer now and is different in content - depends on some other change going in first which will cause PigMapReduce to "implement" MapRunnable and Reducer (in the types branch it currently does not implement any interface) * src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/PigCombine.java - not present in types

Yes

PIG-85

making PigStorage work with control characters

The necessary changes are already present in PigStorage in Types branch

PIG-176

memory management; depends on PIG-111;

Yes

PIG-170

memory management

Yes

PIG-235

Performance issues with memory spills

Yes

PIG-18

making Pig work with HOD 0.4; should be doable now - involves backend - HExecutionEngine.java - a little more involved and needs careful merging

Yes

PIG-106

Optimize Pig by replacing String '+' and StringBuffer with StringBuilder; - there are changes in backend/Operators which have changed in structure now

Yes

PIG-250

speculative execution is broken;

Yes

PIG-266

Warnings with hadoop 17

Yes

PIG-291

HOD param passing

Yes

PIG-114

reversable functions; only added reversable interface; the rest is no longer applicable as the new code does not try to reuse intermediate results

Yes

PIG-172

NullPointerException thrown if we catch exception with null message

Yes

PIG-164

purgin weak references

Yes

PIG-129

need to create temp files in the task's working directory; should be easy to integrate

Yes

PIG-118

UNION/CROSS/JOIN operations should not allow 1 operand; trivial change;

Yes

PIG-123

escaping

yes

Step 2

Merge from branch to trunk.