Google Summer of Code 2011 for Pig

Pig is exciting! Pig provide an intuitive way to program hadoop. Inside Yahoo!, more than 80% of hadoop jobs are Pig jobs. It is heavily used in Twitter, Linkedin and lots of other organizations (http://wiki.apache.org/pig/PoweredBy).

"When I say Hadoop, I really mean Pig" -- Milind Bhandarkar from Linkedin

Last year, Pig participate Google Summer of Code for the first time. We get one student (Gianmarco) work on raw comparator secondary sort. It turns out to be very successful and we adopt his code in our code base.

This year, we will participate again. Here we picked up a list of highly desired projects for students. All these projects are doable within the scope of GSOC program. Once accepted, we will assign a dedicated mentor to guide you through different stages of the program. We need your help and you will get a great experience of participating an open source project.

Project List

Nested foreach statement(https://issues.apache.org/jira/browse/PIG-1631)

Pig support DISTINCT, FILTER, LIMIT, and ORDER BY inside nested foreach statement(http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#FOREACH, http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#nestedblock). However, ForEach is highly desired. For example, we need to do query like:

sessionByUser = group session by user;
b = foreach sessionByUser {
    b1 = foreach session generate accumulateSession(group, session);
    generate group, b1;
}

Though some of the functionality can be achieved by other approaches (Accumulator, UDF with bag, query rewrite, etc), Nested foreach offers simplicity and additional opportunity for optimization (Optimization is not part of this project).

Nested cross statement(https://issues.apache.org/jira/browse/PIG-1916)

Similar to nested foreach, we want nested cross as well. One typical use case for nested foreach is after cogroup two relations, we want to flatten the records of the same key, and do some processing. This is naturally to be achieved by cross. Eg:

C = cogroup user by uid, session by uid;
D = foreach C {
    crossed = cross user, session; -- To flatten two input bags
    filtered = filter crossed by user::region == session::region;
    result = foreach crossed generate processSession(user::age, user::gender, session::ip);  --Nested foreach Jira: PIG-1631
    generate result;
}

If we don't have cross, user have to write a UDF process the bag user, session. It is much harder than a UDF process flattened tuples.

Compare with nested foreach, the implementation of cross operator is easier. However, it is the only nested operator takes multiple inputs. There is some complexity because we have some code assuming one input for foreach nested plan.

Syntax sugar

We'd like to add several syntax sugar (May pick 2-3 within the list)

Split statement is better to have a default destination, eg: SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- OTHERS has all tuples with f1>=7 && f2!=5 && f3==6

Pig has TOMAP, TOTUPLE, TOBAG UDF. However, it will be much easier if we can add syntax support to it:

b = foreach a generate [a0#b0] as m;
b = foreach a generate (a0, a1) as t1;
b = foreach a generate {(a0)} as b1;  -- b1 is a single tuple bag

Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:

a = load 'a.txt';
b = group a by all;
c = foreach b generate COUNT(*) as sum;
d = order a by $0;
e = limit d c.sum/100;

Currently, sample statement only support for simple random sampling. It is better we can support more (stratified sampling, bootstrap sample, etc)

Different strategies for large and small order bys(https://issues.apache.org/jira/browse/PIG-483)

Currently pig always does a multi-pass order by where it first determines a distribution for the keys and then orders in a second pass. This avoids the necessity of having a single reducer. However, in cases where the data is small enough to fit into a single reducer, this is inefficient. For small data sets it would be good to realize the small size of the set and do the order by in a single pass with a single reducer.

New DataType:Boolean(https://issues.apache.org/jira/browse/PIG-1429)

Pig does not support boolean data type yet. The only exception is user can define a UDF of boolean type. However, in the follow up processing, user might encounter errors. This project include:

Other Project Ideas

You can also propose new project not listed. Please discuss with us before apply.

Getting start

First, you need to learn PigLatin language. The best source for learning PigLatin is:

Be sure to sign up pig mailing list.

Then checkout Pig source code using svn: svn co http://svn.apache.org/repos/asf/pig/trunk

Set up environment for Eclipse: http://wiki.apache.org/pig/Eclipse_Environment

Learn more about Pig internal at Pig paper at VLDB 2009.

Browse through Pig code. Some good start points are:

How to Apply

GSoc2011 (last edited 2011-03-22 23:42:56 by daijy)