Q: How can I load data using Unicode control characters as delimiters?

The first parameter to PigStorage is the dataset name, the second is a regular expression to describe the delimiter. We used String.split(regex, -1) to extract fields from lines. See java.util.regex.Pattern for more information on the way to use special characters in regex.

If you are loading a file which contains Ctrl+A as separators, you can specify this to PigStorage using the Unicode notation.

LOAD 'input.dat' USING PigStorage('\u0001')as (x,y,z);

Q: How do I make my jobs run on multiple machines?

Use the PARALLEL clause:

C = JOIN A by url, B by url PARALLEL 50;

Q: How do I make my Pig jobs run on a specified number of reducers?

You can achieve this with the PARALLEL clause. For example:

C = JOIN A by url, B by url PARALLEL 50. 

Even if you do not specify the parallel clause, the framework uses a default number of reducers, in the order of 0.9*(number of nodes allocated by user -1)*n where n is the number of maximum reduce slots, for running your M/R jobs which result from statements such as GROUP, COGROUP, JOIN, and ORDER BY. For example, when allocating 3 machines you get about 0.9*2*4 = 7 reducers for operating on your parallel jobs.

Q: Can I do a numerical comparison while filtering?

Yes, you can choose between numerical and string comparison. For numerical comparison use the operators =, <>, < etc. and for string comparisons use eq, neq etc. See the format of Conditions.

Q: Does Pig support regular expressions?

Pig does support regular expression matching via the matches keyword. It uses java.util.regex matches which means your pattern has to match the entire string (e.g. if your string is "hi fred" and you want to find "fred" you have to give a pattern of ".*fred" not "fred").

Q: How do I prevent failure if some records don't have the needed number of columns?

You can filter away those records by including the following in your Pig program:

A = LOAD 'foo' USING PigStorage('\t');
B = FILTER A BY ARITY(*) < 5;
.....

This code would drop all records that have fewer than five (5) columns.

Q: Is there any difference between == and eq for numeric comparisons?

There is no difference when using integers. However, 11.0 and 11 will be equal with == but not with eq.

Q: Is it possible to use PIG with a regular Hadoop cluster (not HOD)?

You can set this property using the empty string.

hod.server=""

Q: Is there an easy way for me to figure out how many rows exist in a dataset from it's alias?

You can run the following set of commands, which are equivalent to SELECT COUNT(*) in SQL:

a = LOAD 'mytestfile.txt';
b = GROUP a ALL;
c = FOREACH b GENERATE COUNT(a.$0);

Q: Does Pig allow grouping on expressions?

Pig allows grouping of expressions. For example:

grunt> a = LOAD 'mytestfile.txt' AS (x,y,z);
grunt> DUMP a;
(1,2,3)
(4,2,1)
(4,3,4)
(4,3,4)
(7,2,5)
(8,4,3)

b = GROUP a BY (x+y);
(3.0,{(1,2,3)})
(6.0,{(4,2,1)})
(7.0,{(4,3,4),(4,3,4)})
(9.0,{(7,2,5)})
(12.0,{(8,4,3)})

If the grouping is based on constants, the result is the same as GROUP ALL except the group-id is replaced by the constant.

grunt> b = GROUP a BY 4;
(4,{(1,2,3),(4,2,1),(4,3,4),(4,3,4),(7,2,5),(8,4,3)})

Q: Is there a way to check if a map is empty?

In Pig 2.0 you can test the existence of values in a map using the null construct: m#'key' is not null

Q: How can I specify the number of nodes Pig allocates?

> pig -Dhod.param='-m 3' my_script.pig

Three (3) nodes is the minimum.

Q: How can I ask Pig to use an already allocated HOD cluster?

Suppose you allocated a cluster:

$ mkdir -p ~/hod-clusters/test
$ hod allocate -d ~/hod-clusters/test -n 5
$ setenv CLUSTERDIR ~/hod-clusters/test

You can then use the following command, using either -Dhod.server=’’ or –Dhod.server=””

$ pig -cp $CLUSTERDIR -Dhod.server='' myscript.pig 

FAQ (last edited 2009-09-20 23:38:37 by localhost)