Q: How can I load data using Unicode control characters as delimiters?
The first parameter to !PigStorage is the dataset name, the second is a regular expression to describe the delimiter. We used `String.split(regex, -1)` to extract fields from lines. See java.util.regex.Pattern for more information on the way to use special characters in regex.
If you are loading a file which contains Ctrl+A as separators, you can specify this to PigStorage using the Unicode notation.
LOAD 'input.dat' USING PigStorage('\u0001')as (x,y,z);
Q: How do I make my jobs run on multiple machines?
Use the PARALLEL clause:
C = JOIN A by url, B by url PARALLEL 50;
Q: How do I make my Pig jobs run on a specified number of reducers?
You can achieve this with the PARALLEL clause. For example:
C = JOIN A by url, B by url PARALLEL 50.
Besides PARALLEL clause, you can also use "set default_parallel" statement in Pig script, or set "mapred.reduce.tasks" system property to specify default parallel to use. If none of these values are set, Pig will only use 1 reducers. (In Pig 0.8, we change the default reducer from 1 to a number calculated by a simple heuristic for foolproof purpose)
Q: Can I do a numerical comparison while filtering?
Yes, you can choose between numerical and string comparison. For numerical comparison use the operators =, <>, < etc. and for string comparisons use eq, neq etc.
Q: Does Pig support regular expressions?
Pig does support regular expression matching via the `matches` keyword. It uses java.util.regex matches which means your pattern has to match the entire string (e.g. if your string is `"hi fred"` and you want to find `"fred"` you have to give a pattern of `".*fred"` not `"fred"`).
Q: How do I prevent failure if some records don't have the needed number of columns?
You can filter away those records by including the following in your Pig program:
A = LOAD 'foo' USING PigStorage('\t'); B = FILTER A BY ARITY(*) < 5; .....
This code would drop all records that have fewer than five (5) columns.
Q: Is there any difference between `==` and `eq` for numeric comparisons?
There is no difference when using integers. However, `11.0` and `11` will be equal with `==` but not with `eq`.
Q: Is there an easy way for me to figure out how many rows exist in a dataset from it's alias?
You can run the following set of commands, which are equivalent to `SELECT COUNT` in SQL:
a = LOAD 'mytestfile.txt'; b = GROUP a ALL; c = FOREACH b GENERATE COUNT(a.$0);
Q: Does Pig allow grouping on expressions?
Pig allows grouping of expressions. For example:
grunt> a = LOAD 'mytestfile.txt' AS (x,y,z); grunt> DUMP a; (1,2,3) (4,2,1) (4,3,4) (4,3,4) (7,2,5) (8,4,3) b = GROUP a BY (x+y); (3.0,{(1,2,3)}) (6.0,{(4,2,1)}) (7.0,{(4,3,4),(4,3,4)}) (9.0,{(7,2,5)}) (12.0,{(8,4,3)})
If the grouping is based on constants, the result is the same as GROUP ALL except the group-id is replaced by the constant.
grunt> b = GROUP a BY 4; (4,{(1,2,3),(4,2,1),(4,3,4),(4,3,4),(7,2,5),(8,4,3)})
Q: Is there a way to check if a map is empty?
In Pig 2.0 you can test the existence of values in a map using the null construct:
m#'key' is not null
Q: How can I specify the number of nodes Pig allocates?
> pig -Dhod.param='-m 3' my_script.pig
Three (3) nodes is the minimum.
Q: I load data from a directory which contains different file. How do I find out where the data comes from?
You can write a LoadFunc which append filename into the tuple you load.
Eg,
A = load '*.txt' using PigStorageWithInputPath();
Here is the LoadFunc:
public class PigStorageWithInputPath extends PigStorage { Path path = null; @Override public void prepareToRead(RecordReader reader, PigSplit split) { super.prepareToRead(reader, split); path = ((FileSplit)split.getWrappedSplit()).getPath(); } @Override public Tuple getNext() throws IOException { Tuple myTuple = super.getNext(); if (myTuple != null) myTuple.append(path.toString()); return myTuple; } }
In Pig 0.8.0 and beyond, you need to set "pig.splitCombination" to false for PigStorageWithInputPath work correctly.