Revision 1 as of 2007-11-07 19:50:34
converted to 1.6 markup
|Deletions are marked like this.||Additions are marked like this.|
|Line 1:||Line 1:|
|Line 4:||Line 4:|
|Line 31:||Line 31:|
|Line 42:||Line 42:|
|Line 49:||Line 49:|
|Line 61:||Line 61:|
|Line 70:||Line 70:|
|Line 79:||Line 79:|
Pig Latin Schemas
Defining a schema in a LOAD statement
The basic grammar for schema definition is taken from the JSON/Python tuple/list/map definition, and is as follows: field1 = Atom alias name : (f1, f2, ...) = Tuple alias and schema
So the schema:
(time, query : (display, normalized), results : [url, title, summary])
would define a Tuple where the first field is an Atom called "time", the second field is a Tuple called "query" with the Atom fields "display" and "normalized", and the third field is a Bag called "results", which contains tuples that have three Atom fields "url", "title" and "summary".
The "AS" keyword on a LOAD statement allows you to define a schema for a particular alias. For example,
A = load 'input1' as (tstamp, cookie, query); B = load 'input2' as (query, url, rank);
associates schemas with A and B.
The system will do its best to infer the schema for a derived alias based on the schemas of the input aliases.
Continuing with our running example, suppose we have
C = cogroup A by query, B by query;
Then C will be assigned the schema (group, A: [tstamp, cookie, query] , B: [query, url, rank])
<<Anchor: execution failed [Too many arguments] (see also the log)>>
Referring to Nested Fields, i.e., Nested Projection
You can refer to fields up to 1 level below in the nesting. Thus, in the above example, you can say,
foreach C generate group, A.cookie
Name Ambiguity Resolution
Sometimes, when using FLATTEN, there might be name ambiguities in schemas from two different inputs. Thus, if in the above example, we write
D = foreach C generate flatten(A), flatten(B)
There will be a name ambiguity since both flatten(A) and flatten(B) have the field query. To avoid ambiguity in such cases, fields can be referred to by <outer-alias>::fieldName. Thus for C, we can refer to either A::query or B::query but not to query.
However, the unambiguous fields can be accessed both by their names as well as by <outer-alias>::fieldName. Thus for C, both url or B::url will access the same field.
Assigning Names to Individual Items in GENERATE
Just like in SQL where you can give names to individual items in the select list, we can name individual items in the generate clause using AS. Thus, in our example,
E = foreach D generate (cookie eq 'null' ? 'null' : url ) as nullifiedUrl, rank as myRank;
This will assign a schema (nullifiedUrl, myRank) to E.
Schemas of Functions
Eval functions can specify their own output schema by overriding the outputSchema() method. The builtin function SUM specifies that its output is called sum. Thus,
F = foreach C generate group, SUM(tstamp);
F gets assigned the schema: (group,sum). This can of course be overriden e.g., generate group, SUM(tstamp) as sumTstamp.
Last Resort: Overriding system-inferred schemas
Sometimes the system cannot infer a schema (e.g., binconds, evalfunctions that dont specify one). In these cases, and also in others when you want to override the system-inferred schema you can override it using the AS clause. Thus, you could say:
C = (cogroup A by query, B by query) as (group, foo, bar);
and C would be assigned the schema (foo,bar).