Q: How can I load data using Unicode control characters as delimiters?
The first parameter to !PigStorage is the dataset name, the second is a regular expression to describe the delimiter. We used `String.split(regex, -1)` to extract fields from lines. See java.util.regex.Pattern for more information on the way to use special characters in regex.
If you are loading a file which contains Ctrl+A as separators, you can specify this to PigStorage using the Unicode notation.
LOAD 'input.dat' USING PigStorage('\u0001')as (x,y,z); |
Q: How do I make my jobs run on multiple machines?
Use the PARALLEL clause:
C = JOIN A by url, B by url PARALLEL 50; |
Q: How do I make my Pig jobs run on a specified number of reducers?
You can achieve this with the PARALLEL clause. For example:
C = JOIN A by url, B by url PARALLEL 50. |
Besides PARALLEL clause, you can also use "set default_parallel" statement in Pig script, or set "mapred.reduce.tasks" system property to specify default parallel to use. If none of these values are set, Pig will only use 1 reducers. (In Pig 0.8, we change the default reducer from 1 to a number calculated by a simple heuristic for foolproof purpose)
Q: Can I do a numerical comparison while filtering?
Yes, you can choose between numerical and string comparison. For numerical comparison use the operators =, <>, < etc. and for string comparisons use eq, neq etc.
Q: Does Pig support regular expressions?
Pig does support regular expression matching via the `matches` keyword. It uses java.util.regex matches which means your pattern has to match the entire string (e.g. if your string is `"hi fred"` and you want to find `"fred"` you have to give a pattern of `".*fred"` not `"fred"`).
Q: How do I prevent failure if some records don't have the needed number of columns?
You can filter away those records by including the following in your Pig program:
A = LOAD 'foo' USING PigStorage('\t'); B = FILTER A BY ARITY(*) < 5; ..... |
This code would drop all records that have fewer than five (5) columns.
Q: Is there any difference between `==` and `eq` for numeric comparisons?
There is no difference when using integers. However, `11.0` and `11` will be equal with `==` but not with `eq`.
Q: Is there an easy way for me to figure out how many rows exist in a dataset from it's alias?
You can run the following set of commands, which are equivalent to `SELECT COUNT` in SQL:
a = LOAD 'mytestfile.txt'; b = GROUP a ALL; c = FOREACH b GENERATE COUNT(a.$0); |
Q: Does Pig allow grouping on expressions?
Pig allows grouping of expressions. For example:
grunt> a = LOAD 'mytestfile.txt' AS (x,y,z); grunt> DUMP a; (1,2,3) (4,2,1) (4,3,4) (4,3,4) (7,2,5) (8,4,3) b = GROUP a BY (x+y); (3.0,{(1,2,3)}) (6.0,{(4,2,1)}) (7.0,{(4,3,4),(4,3,4)}) (9.0,{(7,2,5)}) (12.0,{(8,4,3)}) |
If the grouping is based on constants, the result is the same as GROUP ALL except the group-id is replaced by the constant.
grunt> b = GROUP a BY 4; (4,{(1,2,3),(4,2,1),(4,3,4),(4,3,4),(7,2,5),(8,4,3)}) |
Q: Is there a way to check if a map is empty?
In Pig 2.0 you can test the existence of values in a map using the null construct:
m#'key' is not null
Q: How can I specify the number of nodes Pig allocates?
> pig -Dhod.param='-m 3' my_script.pig |
Three (3) nodes is the minimum.