1. I'm using PigStorage to parse my input files. Can I make it use control characters as delimiters?
Yes. The first parameter to PigStorage is the dataset name, the second is a regular expression to describe the delimiter. We used String.split(regex, -1) to extract fields from lines. See java.util.regex.Pattern for more information on the way to use special characters in regex. For example,
LOAD 'input.dat' USING PigStorage('\u0001');
will use ^A as a delimiter.
2. Can I do a numerical comparison while filtering?
Yes, you can choose between numerical and string comparison. For numerical comparison use the operators =, <>, < etc. and for string comparisons use eq, neq etc. See the format of Conditions.
3. How do I make my jobs run on multiple machines?
Use the PARALLEL clause:
C = JOIN A by url, B by url PARALLEL 50;
4. I would like to use Pig to read a list of .gz files that use '\u0001' as a delimiter. How do I do that?
You can use the following load command:
LOAD 'input_file' USING PigStorage('\u0001');
5. Does Pig support NULLs?
Pig currently has no support for NULL values but it is on the roadmap.
6. Does Pig support regular expressions?
Pig does support regular expression matching via the matches keyword. It uses java.util.regex matches which means your pattern has to match the entire string (e.g. if your string is "hi fred" and you want to find "fred" you have to give a pattern of ".*fred" not "fred").
7. How do I prevent failure if some records don't have the needed number of columns?
You can filter away those records by including the following in your Pig program:
A = LOAD 'foo' USING PigStorage('\t'); B = FILTER A BY ARITY(*) < 5; .....
This code would drop all records that have fewer than five (5) columns.
8. Is there any difference between == and eq for numeric comparisons?
There is no difference when using integers. However, 11.0 and 11 will be equal with == but not with eq.
9. Is it possible to use PIG with a regular Hadoop cluster (not HOD) ?
You can set this property using the empty string.
10. Is there an easy way for me to figure out how many rows exist in a dataset from it's alias?
You can run the following set of commands:
a = LOAD 'bla' ... ; b = GROUP a ALL; c = FOREACH b GENERATE COUNT(a.$0);
This is equivalent to SELECT COUNT(*) in SQL.
11. Does Pig allow grouping on expressions?
Currently, Pig only allows grouping on data fields rather than expressions. Allowing grouping on expressions is on our roadmap. Stay tuned!
12. Is there a way to check if a map is empty?
Currently, there is no way to do that.
13. How can I specify the number of nodes Pig allocates?
> pig -Dhod.param='-m 3' my_script.pig
Three (3) nodes is the minimum.
14. How can I load data using PigStorage() that requires Unicode specification for separators?
Old version of Pig using '\t':
a = LOAD '/homes/yahooid/tmp/a.txt' USING PigStorage('\t');
New version of Pig using Unicode:
a = LOAD '/homes/yahooid/tmp/a.txt' USING PigStorage('\u0000B');