View Source

Q: How can I load data using Unicode control characters as delimiters?

The first parameter to !PigStorage is the dataset name, the second is a regular expression to describe the delimiter. We used `String.split(regex, -1)` to extract fields from lines. See java.util.regex.Pattern for more information on the way to use special characters in regex.

If you are loading a file which contains Ctrl+A as separators, you can specify this to PigStorage using the Unicode notation.

LOAD 'input.dat' USING PigStorage('\u0001')as (x,y,z);

Q: How do I make my jobs run on multiple machines?

Use the PARALLEL clause:

C = JOIN A by url, B by url PARALLEL 50;

Q: How do I make my Pig jobs run on a specified number of reducers?

You can achieve this with the PARALLEL clause. For example:

C = JOIN A by url, B by url PARALLEL 50.

Besides PARALLEL clause, you can also use "set default_parallel" statement in Pig script, or set "mapred.reduce.tasks" system property to specify default parallel to use. If none of these values are set, Pig will only use 1 reducers. (In Pig 0.8, we change the default reducer from 1 to a number calculated by a simple heuristic for foolproof purpose)

Q: Can I do a numerical comparison while filtering?

Yes, you can choose between numerical and string comparison. For numerical comparison use the operators =, <>, < etc. and for string comparisons use eq, neq etc.

Q: Does Pig support regular expressions?

Pig does support regular expression matching via the `matches` keyword. It uses java.util.regex matches which means your pattern has to match the entire string (e.g. if your string is `"hi fred"` and you want to find `"fred"` you have to give a pattern of `".*fred"` not `"fred"`).

Q: How do I prevent failure if some records don't have the needed number of columns?

You can filter away those records by including the following in your Pig program:

A = LOAD 'foo' USING PigStorage('\t');
B = FILTER A BY ARITY(*) < 5;
.....

This code would drop all records that have fewer than five (5) columns.

Q: Is there any difference between `==` and `eq` for numeric comparisons?

There is no difference when using integers. However, `11.0` and `11` will be equal with `==` but not with `eq`.

Q: Is there an easy way for me to figure out how many rows exist in a dataset from it's alias?

You can run the following set of commands, which are equivalent to `SELECT COUNT` in SQL:

a = LOAD 'mytestfile.txt';
b = GROUP a ALL;
c = FOREACH b GENERATE COUNT(a.$0);

Q: Does Pig allow grouping on expressions?

Pig allows grouping of expressions. For example:

grunt> a = LOAD 'mytestfile.txt' AS (x,y,z);
grunt> DUMP a;
(1,2,3)
(4,2,1)
(4,3,4)
(4,3,4)
(7,2,5)
(8,4,3)

b = GROUP a BY (x+y);
(3.0,{(1,2,3)})
(6.0,{(4,2,1)})
(7.0,{(4,3,4),(4,3,4)})
(9.0,{(7,2,5)})
(12.0,{(8,4,3)})

If the grouping is based on constants, the result is the same as GROUP ALL except the group-id is replaced by the constant.

grunt> b = GROUP a BY 4;
(4,{(1,2,3),(4,2,1),(4,3,4),(4,3,4),(7,2,5),(8,4,3)})

Q: Is there a way to check if a map is empty?

In Pig 2.0 you can test the existence of values in a map using the null construct:
m#'key' is not null

Q: How can I specify the number of nodes Pig allocates?

> pig -Dhod.param='-m 3' my_script.pig

Three (3) nodes is the minimum.