Differences between revisions 7 and 8
Revision 7 as of 2008-09-16 21:52:31
Size: 3496
Comment: cleanup formatting and grammar
Revision 8 as of 2009-09-20 23:38:06
Size: 3502
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
Yes. The first parameter to `PigStorage` is the dataset name, the second is a regular expression to describe the delimiter. We used `String.split(regex, -1)` to extract fields from lines. See [http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html java.util.regex.Pattern] for more information on the way to use special characters in regex. For example, Yes. The first parameter to `PigStorage` is the dataset name, the second is a regular expression to describe the delimiter. We used `String.split(regex, -1)` to extract fields from lines. See [[http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html|java.util.regex.Pattern]] for more information on the way to use special characters in regex. For example,
Line 13: Line 13:
Yes, you can choose between numerical and string comparison. For numerical comparison use the operators =, <>, < etc. and for string comparisons use eq, neq etc. See the format of [#CondS Conditions]. Yes, you can choose between numerical and string comparison. For numerical comparison use the operators =, <>, < etc. and for string comparisons use eq, neq etc. See the format of [[#CondS|Conditions]].
Line 37: Line 37:
Pig does support regular expression matching via the `matches` keyword. It uses [http://java.sun.com/javase/6/docs/api/java/util/regex/package-summary.html java.util.regex] matches which means your pattern has to match the entire string (e.g. if your string is `"hi fred"` and you want to find `"fred"` you have to give a pattern of `".*fred"` not `"fred"`). Pig does support regular expression matching via the `matches` keyword. It uses [[http://java.sun.com/javase/6/docs/api/java/util/regex/package-summary.html|java.util.regex]] matches which means your pattern has to match the entire string (e.g. if your string is `"hi fred"` and you want to find `"fred"` you have to give a pattern of `".*fred"` not `"fred"`).

1. I'm using PigStorage to parse my input files. Can I make it use control characters as delimiters?

Yes. The first parameter to PigStorage is the dataset name, the second is a regular expression to describe the delimiter. We used String.split(regex, -1) to extract fields from lines. See java.util.regex.Pattern for more information on the way to use special characters in regex. For example,

LOAD 'input.dat' USING PigStorage('\u0001');

will use ^A as a delimiter.

2. Can I do a numerical comparison while filtering?

Yes, you can choose between numerical and string comparison. For numerical comparison use the operators =, <>, < etc. and for string comparisons use eq, neq etc. See the format of Conditions.

3. How do I make my jobs run on multiple machines?

Use the PARALLEL clause:

C = JOIN A by url, B by url PARALLEL 50;

4. I would like to use Pig to read a list of .gz files that use '\u0001' as a delimiter. How do I do that?

You can use the following load command:

LOAD 'input_file' USING PigStorage('\u0001');

5. Does Pig support NULLs?

Pig currently has no support for NULL values but it is on the roadmap.

6. Does Pig support regular expressions?

Pig does support regular expression matching via the matches keyword. It uses java.util.regex matches which means your pattern has to match the entire string (e.g. if your string is "hi fred" and you want to find "fred" you have to give a pattern of ".*fred" not "fred").

7. How do I prevent failure if some records don't have the needed number of columns?

You can filter away those records by including the following in your Pig program:

A = LOAD 'foo' USING PigStorage('\t');
B = FILTER A BY ARITY(*) < 5;
.....

This code would drop all records that have fewer than five (5) columns.

8. Is there any difference between == and eq for numeric comparisons?

There is no difference when using integers. However, 11.0 and 11 will be equal with == but not with eq.

9. Is it possible to use PIG with a regular Hadoop cluster (not HOD) ?

You can set this property using the empty string.

hod.server=""

10. Is there an easy way for me to figure out how many rows exist in a dataset from it's alias?

You can run the following set of commands:

a = LOAD 'bla' ... ;
b = GROUP a ALL;
c = FOREACH b GENERATE COUNT(a.$0);

This is equivalent to SELECT COUNT(*) in SQL.

11. Does Pig allow grouping on expressions?

Currently, Pig only allows grouping on data fields rather than expressions. Allowing grouping on expressions is on our roadmap. Stay tuned!

12. Is there a way to check if a map is empty?

Currently, there is no way to do that.

13. How can I specify the number of nodes Pig allocates?

> pig -Dhod.param='-m 3' my_script.pig

Three (3) nodes is the minimum.

14. How can I load data using PigStorage() that requires Unicode specification for separators?

Old version of Pig using '\t':

a = LOAD '/homes/yahooid/tmp/a.txt' USING PigStorage('\t');

New version of Pig using Unicode:

a = LOAD '/homes/yahooid/tmp/a.txt' USING PigStorage('\u0000B');

PigFaq (last edited 2009-09-20 23:38:06 by localhost)