Display context of search results
Case-sensitive searching

THIS PAGE IS OBSOLETE. Please use Pig Latin documentation at http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref1.html

Note: For Pig 0.2.0 or later, some content on this page may no longer be applicable.

So you want to learn Pig Latin. Welcome! Lets begin with the data types.

Data Types

Every piece of data in Pig has one of these four types:

Data Items

Data can be referred to in various powerful and convenient ways in Pig. Any data referred to is called a Data Item. We will illustrate all these ways by using the following example tuple.

t = < 1, {<2,3>,<4,6>,<5,7>}, ['apache':'search']>

Thus, t has 3 fields. Let these fields have names f1, f2, f3. Field f1 is an atom with value 1. Field f2 is a bag having 3 tuples. Field f3 is a data map having 1 key.

The following table lists the various methods of referring to data.

Method of Referring to Data

Example

Value for example tuple t

Notes

Constant

'1.0', or 'apache.org', or 'blah'

Value constant irrespective of t

Field referred to by position

$0

Data Atom '1'

In Pig, positions start at 0 and not 1

Field referred to by name

f2

Bag {<2,3>,<4,6>,<5,7>}

Projection of another data item

f2.$0

Bag {<2>,<4>,<5>} - the bag f2 projected to the first field

Map Lookup against another data item

f3#'apache'

Data Atom 'search'

User's responsibility to ensure that a lookup is written only against a data map, otherwise a runtime error is thrown. If the key being looked up does not exist, a Data Atom with an empty string is returned.

Function applied to another data item

SUM(f2.$0)

2+4+5 = 11

SUM is a builtin Pig function. See PigFunctions for how to write your own functions

Infix Expression of other data items

COUNT(f2) + f1 / '2.0'

3 + 1 / 2.0 = 3.5

Bincond, i.e., the value of the data item is chosen according to some condition

(f1 = = '1' ? '2' : COUNT(f2))

'2' since f1=='1' is true. If f1 were != '1', then the value of this data item for t would be COUNT(f2)=3

See Conditions for what the format of the condition in the bincond can be

Pig Latin Statements

A Pig Latin statement is a command that produces a Relation. A relation is simply a data bag with a name. That name is called the relation's alias. The simplest Pig Latin statement is LOAD, which reads a relation from a file in the file system. Other Pig Latin statements process one or more input relations, and produce a new relation as a result.

Starting with Pig 1.2 release due on 09/30/07, pig commands can span multiple lines and must include ";" at the end.

Examples:

grunt> A = load 'mydata' using PigStorage()
as (a, b, c);
grunt>B = group A by a;
grunt> C = foreach B {
D = distinct A.b;
generate flatten(group), COUNT(D);
}
grunt> 

LOAD: Loading data from a file

Before you can do any processing, you first need to load the data. This is done by the LOAD statement. Suppose we have a tab-delimited file called "myfile.txt" that contains a relation, whose contents are:

1    2    3
4    2    1
8    3    4
4    3    3
7    2    5
8    4    3

Suppose we want to refer to the 3 fields as f1, f2, and f3. We can load this relation using the following command:

A = LOAD 'myfile.txt' USING PigStorage('\t') AS (f1,f2,f3);

Here, PigStorage is the name of a "storage function" that takes care of parsing the file into a Pig relation. This storage function expects simple newline-separated records with delimiter-separated fields; it has one parameter, namely the field delimiter(s).

Future Pig Latin commands can refer to the alias "A" and will receive data that has been loaded from "myfile.txt". A will contain this data:

<1, 2, 3>
<4, 2, 1>
<8, 3, 4>
<4, 3, 3>
<7, 2, 5>
<8, 4, 3>

Notes:

FILTER: Getting rid of data you are not interested in

Very often, the first thing that you want to do with data is to get rid of tuples that you are not interested in. This can be done by the filter statement. For example,

Y = FILTER A BY f1 == '8';

The result is Y =

<8, 3, 4>
<8, 4, 3>

Specifying Conditions

The condition following the keyword BY can be much more general than as shown above.

Thus, a somewhat more complicated condition can be

Y = FILTER A BY (f1 == '8') OR (NOT (f2+f3 > f1));

Note:

COGROUP: Getting the relevant data together

We can group the tuples in A according to some specification. A simple specification is to group according to the value of one of the fields, e.g. the first field. This is done as follows:

X = GROUP A BY f1;
X = GROUP A BY (f1, f2 ..);

The result of the group statement consists of one tuple for each group. The first field of the tuple has name group and has the value on which the grouping has been performed, and the second field has name A and is a bag containing the tuples belonging to that group. Thus, X = :

<1, {<1, 2, 3>}>
<4, {<4, 2, 1>, <4, 3, 3>}>
<7, {<7, 2, 5>}>
<8, {<8, 3, 4>, <8, 4, 3>}>

Suppose we have a second relation B =

<2, 4>
<8, 9>
<1, 3>
<2, 7>
<2, 9>
<4, 6>
<4, 9>

We can co-group A and B, which means that we jointly group the tuples from A and B, using this command:

COGROUP A BY f1, B BY $0;

You can co-group by multiple columns the same way as for group.

The result is:

<1, {<1, 2, 3>}, {<1, 3>}>
<2, {}, {<2, 4>, <2, 7>, <2, 9>}>
<4, {<4, 2, 1>, <4, 3, 3>}, {<4, 6>,<4, 9>}>
<7, {<7, 2, 5>}, {}>
<8, {<8, 3, 4>, <8, 4, 3>}, {<8, 9>}>

Now, all of the original tuples whose first field is 1 are grouped together, the original tuples whose first value is 2 are together, and so on. Thus, similar to a group, the result of a co-group has one tuple for each group. The first field is called group as before and contains the value on which grouping has been performed. Besides, every tuple has a bag for each relation being co-grouped (having the same name as the alias for that relation) that contains the tuples of that relation belonging to that group.

Note that some of the bags are empty, which indicates that no tuples from the corresponding input belong to that group. If we only wish to see groups for which <i>both</i> inputs have at least one tuple, we can write:

C = COGROUP A BY $0 INNER, B BY $0 INNER;

The result is C =

<1, {<1, 2, 3>}, {<1, 3>}>
<4, {<4, 2, 1>, <4, 3, 3>}, {<4, 6>, <4, 9>}>
<8, {<8, 3, 4>, <8, 4, 3>}, {<8, 9>}>

The INNER keyword can be used asymmetrically, with the obvious meaning.

In addition to using columns to group the data, an arbitrary expression can be used:

grunt> cat a        
r1      1       2
r2      2       1
r3      2       8
r4      4       4
grunt> a = load 'a';
grunt> b = group a by $1*$2;
grunt> dump b;

------ MapReduce Job -----
Input: [/user/utkarsh/a:org.apache.pig.builtin.PigStorage()]
Map: [[*]]
Group: [GENERATE {[org.apache.pig.impl.builtin.MULTIPLY(GENERATE {[PROJECT
$1],[PROJECT $2]})],[*]}]
Combine: null
Reduce: null
Output: /tmp/temp1762405695/tmp1820603819:org.apache.pig.builtin.BinStorage
Split: null
Map parallelism: -1
Reduce parallelism: -1
Job jar size = 399671
Pig progress = 0%
Pig progress = 50%
Pig progress = 100%
(2.0, {(r1, 1, 2), (r2, 2, 1)})
(16.0, {(r3, 2, 8), (r4, 4, 4)})
grunt> 

Note:

FOREACH ... GENERATE: Applying transformations to the data

The FOREACH statement is used to apply transformations to the data and to generate new data items. The basic syntax is

<output-alias> = FOREACH <input-alias> GENERATE <data-item 1>, <data-item 2>, ... ;

For each tuple in the input alias, the data items are evaluated, and a tuple containing these data items is put in the output alias. We explain this statement in greater detail by giving examples of typical uses.

Projection

To select a subset of columns from a relation, use this command:

X = FOREACH A GENERATE f1, f2;

X contains tuples from A, but with only the first and second fields present in each tuple. For the value of A given above, X =

<1, 2>
<4, 2>
<8, 3>
<4, 3>
<7, 2>
<8, 4>

Projection elements can be given names using as <alias> construct. This allows to refer to the fields of the produced expression by name in the later statements:

X = FOREACH A GENERATE f1+f2 as sumf1f2;
Y = FILTER X by sumf1f2 > '5';

As with SQL, asterisk (*) is shorthand for all columns. For example, with:

X = FOREACH A GENERATE *;

X is identical to A.

Nested projection

If one of the fields in the input relation, is a non-atomic field, we can perform projection on that field. For example,

FOREACH C GENERATE group, B.$1;

The result is:

<1, {<3>}>
<4, {<6>, <9>}>
<8, {<9>}>

Here is another example, in which multiple nested columns are retained:

FOREACH C GENERATE group, A.(f1, f2);

The result is:

<1, {<1, 2>}>
<4, {<4, 2>, <4, 3>}>
<8, {<8, 3>, <8, 4>}>

Applying functions

Pig has a number of built-in functions. An example is the SUM() function, which takes the sum of a set of numbers in a bag. For example:

FOREACH C GENERATE group, SUM(A.f1);

gives:

<1, 1>
<4, 8>
<8, 16>

You may also register your own function with Pig, and refer to it in Pig Latin commands. See PigFunctions.

note: In Pig, all functions, e.g., COUNT() and SUM(), are case-sensitive (this is true for built-in functions as well as user-supplied functions).

Flattening

Sometimes we want to eliminate nesting. This can be accomplished via the FLATTEN keyword which can be attached before any valid data item. For example:

FOREACH C GENERATE group, FLATTEN(A);

yields:

<1, 1, 2, 3>
<4, 4, 2, 1>
<4, 4, 3, 3>
<8, 8, 3, 4>
<8, 8, 4, 3>

As another example,

FOREACH C GENERATE group, FLATTEN(A.f3);

yields:

<1, 3>
<4, 1>
<4, 3>
<8, 4>
<8, 3>

As a final example,

FOREACH C GENERATE flatten(A.(f1, f2)), flatten(B.$1);

yields:

<1, 2, 3>
<4, 2, 6>
<4, 3, 6>
<4, 2, 9>
<4, 3, 9>
<8, 3, 9>
<8, 4, 9>

Note that for the group '4' in C, there were 2 tuples each in the bags A and B. Thus, when both the bags are flattened, the cross product of these tuples is returned, i.e., the tuples <4, 2, 6>, <4, 3, 6>, <4, 2, 9>, and <4, 3, 9> in the result.

Joining

The equi-join of A and B on column 0 can be expressed as follows:

JOIN A BY $0, B BY $0;

which is equivalent to:

X = COGROUP A BY $0 INNER, B BY $0 INNER;
FOREACH X GENERATE FLATTEN(A), FLATTEN(B);

The result is:

<1, 2, 3, 1, 3>
<4, 2, 1, 4, 6>
<4, 3, 3, 4, 6>
<4, 2, 1, 4, 9>
<4, 3, 3, 4, 9>
<8, 3, 4, 8, 9>
<8, 4, 3, 8, 9>

<i>Note:</i> On flattening, we might end with fields that have the same name but which came from different tables. They are disambiguated by prepending <alias>:: to their names. See PigLatinSchemas.

ORDER: Sorting data according to some fields

We can sort the contents of any alias according to any set of columns. For example,

X = ORDER A BY $2;

One possible output (since ties are resolved arbitrarily) is X =

<4, 2, 1>
<1, 2, 3>
<4, 3, 3>
<8, 4, 3>
<8, 3, 4>
<7, 2, 5>

Notes:

DISTINCT: Eliminating duplicates in data

We can eliminate duplicates in the contents of any alias. For example, suppose we first say

X = FOREACH A GENERATE $2;

As we know, this will result in X =

<3>
<1>
<4>
<3>
<5>
<3>

Now, if we say

Y = DISTINCT X;

The output is Y =

<1>
<3>
<4>
<5>

Notes:

STREAM: Using Custom Code with Pig

This is a recent addition to the language. It allows to add custom processing into Pig's execution pipeline. The details can be found in PigStreamingFunctionalSpec.

CROSS: Computing the cross product of multiple relations

To compute the cross product (also known as "cartesian product") of two or more relations, use:

X = CROSS A, B;

Based on the values of A and B given earlier in the document, the result is X =

<1, 2, 3, 2, 4>
<1, 2, 3, 8, 9>
<1, 2, 3, 1, 3>
<1, 2, 3, 2, 7>
<1, 2, 3, 2, 9>
<1, 2, 3, 4, 6>
<1, 2, 3, 4, 9>
<4, 2, 1, 2, 4>
<4, 2, 1, 8, 9>
...

Notes:

UNION: Computing the union of multiple relations

We can vertically glue together contents of multiple aliases into a single alias by the UNION command. For example,

X = UNION A, B;

The result is X =

<1, 2, 3>
<4, 2, 1>
<8, 3, 4>
<4, 3, 3>
<7, 2, 5>
<8, 4, 3>
<2, 4>
<8, 9>
<1, 3>
<2, 7>
<2, 9>
<4, 6>
<4, 9>

Notes:

SPLIT: Separating data into different relations

The SPLIT statement, in some sense, is the converse of the UNION statement. It is used to partition the contents of a relation into multiple relations based on desired conditions.

An example of a SPLIT statement is the following,

SPLIT A INTO X IF $0 < 7, Y IF ($0 > 2 AND $0<> 7);

The output is

X = 
<1, 2, 3>
<4, 2, 1>
<4, 3, 3>

and 

Y = 
<4, 2, 1>
<8, 3, 4>
<4, 3, 3>
<8, 4, 3>

Notes:

Nested Operations in FOREACH...GENERATE

If one of the fields in the input relation is a data bag, the nested data bag can be treated as an inner or a nested relation. Consequently, in a FOREACH...GENERATE statement, we can perform many of the operations on this nested relation that we can on a regular relation.

The specific operations that we can do on the nested relations are FILTER, ORDER, and DISTINCT. Note that we do not allow FOREACH...GENERATE on the nested relation, since that leads to the possibility of arbitrary number of nesting levels.

The syntax for doing the nested operations is very similar to the regular syntax and is demonstrated by the following example:

W = LOAD '...' AS (url, outlink);
G = GROUP W by url;
R = FOREACH G {
        FW = FILTER W BY outlink eq 'www.apache.org';
        PW = FW.outlink;
        DW = DISTINCT PW;
        GENERATE group, COUNT(DW);
}

Notes:

Increasing the parallelism

To increase the parallelism of a job, include the PARALLEL clause in any of your Pig latin statements.

For example, J = JOIN A by url, B by url PARALLEL 50

Couple of notes:

Retrieving Results

There are several convenient ways to retrieve the contents in a particular alias:

Note

Debugging Your Scripts

Pig provides several ways to assist in building and validating your script.

All three commands are described in Grunt Manual.

Working with Compressed Files

Compressed Input

Compressed files are difficult to process in parallel, since they cannot, in general, be split into fragments and independently decompressed. However, if the compression is block-oriented (e.g. bz2), the splitting and parallel processing is easy to do.

Pig has inbuilt support for processing .bz2 files in parallel (.gz support is coming soon). If the input file name extension is .bz2, Pig decompresses the file on the fly and passes the decompressed input stream to your load function. For example,

A = LOAD 'input.bz2' USING myLoad();

Multiple instances of myLoad() (as dictated by the degree of parallelism) will be created and each will be given a fragment of the *decompressed* version of input.bz2 to process.

Compressed Output

Pig currently supports output compression in the .bz2 format (so that the output can subsequently be loaded in parallel). All you have to do is include a .bz2 extension in the name of your output file. Your store function (if any) should simply write uncompressed data, and Pig will compress it on the fly.

For example,

STORE A into 'output.bz2' USING myStore();

Experimenting with Pig Latin syntax

To experiment with the Pig Latin syntax, you can use the StandAloneParser. Invoke it by the following command:

java -cp pig.jar org.apache.pig.StandAloneParser

Example usage:

$ java -cp pig.jar org.apache.pig.StandAloneParser
> A = LOAD 'myfile.txt';
---- Query parsed successfully ---
> B = FOREACH A GENERATE $1, $2;
---- Query parsed successfully ---
> C = COGROUP A BY $0, B BY $0;
---- Query parsed successfully ---
Current aliases: A->null, 
> D = FOREACH C blah blah blah;
Parse error: org.apache.pig.impl.logicalLayer.parser.ParseException: Encountered "blah" at line 1, column 15.
Was expecting one of:
    "generate" ...
    "{" ...
> D = FOREACH C GENERATE 'hello world';
---- Query parsed successfully ---
> quit
$ 

Outer Join

[pi] We should add this join rewrite logic in the parser.

Outer join by example:

A = load 'test1';
grunt> dump A;
(k1, vq)
(k1, v2)
(k2, v3)
(k2, v4)
(k3, v5)
(k4, v6)

B = load 'test2';
grunt> dump B;
(k1, w1)
(k2, w2)
(k2, w3)
(k3, w4)
(k8, w8)

CG = COGROUP A by $0, B by $0;
grunt> dump CG;
(k1, {(k1, vq), (k1, v2)}, {(k1, w1)})
(k2, {(k2, v3), (k2, v4)}, {(k2, w2), (k2, w3)})
(k3, {(k3, v5)}, {(k3, w4)})
(k4, {(k4, v6)}, {})
(k8, {}, {(k8, w8)})

A_ONLY_FILTERED =  FILTER CG by (COUNT(B) == '0');
A_ONLY_FLAT = FOREACH A_ONLY_FILTERED GENERATE FLATTEN(A);
dump A_ONLY_FLAT;

(k4, v6)


B_ONLY_FILTERED =  FILTER CG by (COUNT(A) == '0');
B_ONLY_FLAT = FOREACH B_ONLY_FILTERED GENERATE FLATTEN(B);
dump B_ONLY_FLAT;

(k8, w8)

B_AND_A_FILTERED =  FILTER CG by ((COUNT(A) != '0') and (COUNT(B) != '0'));
B_AND_A_FLAT = FOREACH B_AND_A_FILTERED GENERATE FLATTEN(B);
dump B_AND_A_FLAT;

(k1, w1)
(k2, w2)
(k2, w3)
(k3, w4)

Another way is following:

A = load 'test1';
B = load 'test2';
CG_I_O = COGROUP A by $0 inner, B by $0 outer;
F = FOREACH CG_I_O GENERATE A, ((COUNT(B) == '0')? '' : B) as MODIFIED_B;
G = FOREACH F GENERATE FLATTEN(A), FLATTEN(MODIFIED_B);
-- You can do it in one pass: Y = FOREACH CG_I_O GENERATE FLATTEN(A), FLATTEN(((COUNT(B) == '0')? '' : B));
dump G;

(k1, vq, k1, w1)
(k1, v2, k1, w1)
(k2, v3, k2, w2)
(k2, v4, k2, w2)
(k2, v3, k2, w3)
(k2, v4, k2, w3)
(k3, v5, k3, w4)
(k4, v6, )

Embedded Pig Latin

Pig Latin can be embedded into a Java program in a manner similar to JDBC. See EmbeddedPig for details.

PigLatin (last edited 2010-09-30 18:37:38 by jsha)