Differences between revisions 1 and 2
Revision 1 as of 2007-11-07 19:50:34
Size: 3736
Editor: rollwhite-lx
Revision 2 as of 2009-09-20 23:38:18
Size: 3736
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
[[Anchor(Pig_Latin_Schemas)]] <<Anchor(Pig_Latin_Schemas)>>
Line 4: Line 4:
[[Anchor(Defining_a_schema_in_a_LOAD_statement)]] <<Anchor(Defining_a_schema_in_a_LOAD_statement)>>
Line 31: Line 31:
[[Anchor(Schema_Propagation)]] <<Anchor(Schema_Propagation)>>
Line 42: Line 42:
[[Anchor(Referring_to_Nested_Fields,_i.e.,_Nested_Projection)]] <<Anchor(Referring_to_Nested_Fields,_i.e.,_Nested_Projection)>>
Line 49: Line 49:
[[Anchor(Name_Ambiguity_Resolution)]] <<Anchor(Name_Ambiguity_Resolution)>>
Line 61: Line 61:
[[Anchor(Assigning_Names_to_Individual_Items_in_GENERATE)]] <<Anchor(Assigning_Names_to_Individual_Items_in_GENERATE)>>
Line 70: Line 70:
[[Anchor(Schemas_of_Functions)]] <<Anchor(Schemas_of_Functions)>>
Line 79: Line 79:
[[Anchor(Last_Resort:_Overriding_system-inferred_schemas)]] <<Anchor(Last_Resort:_Overriding_system-inferred_schemas)>>

Pig Latin Schemas

Defining a schema in a LOAD statement

The basic grammar for schema definition is taken from the JSON/Python tuple/list/map definition, and is as follows: field1 = Atom alias name : (f1, f2, ...) = Tuple alias and schema

So the schema:

(time, query : (display, normalized), results :  [url, title, summary])

would define a Tuple where the first field is an Atom called "time", the second field is a Tuple called "query" with the Atom fields "display" and "normalized", and the third field is a Bag called "results", which contains tuples that have three Atom fields "url", "title" and "summary".

The "AS" keyword on a LOAD statement allows you to define a schema for a particular alias. For example,

A = load 'input1' as (tstamp, cookie, query);
B = load 'input2' as (query, url, rank);

associates schemas with A and B.

Schema Propagation

The system will do its best to infer the schema for a derived alias based on the schemas of the input aliases.

Continuing with our running example, suppose we have

C = cogroup A by query, B by query;

Then C will be assigned the schema (group, A: [tstamp, cookie, query] , B: [query, url, rank])

<<Anchor: execution failed [Too many arguments] (see also the log)>>

Referring to Nested Fields, i.e., Nested Projection

You can refer to fields up to 1 level below in the nesting. Thus, in the above example, you can say,

foreach C generate group, A.cookie

Name Ambiguity Resolution

Sometimes, when using FLATTEN, there might be name ambiguities in schemas from two different inputs. Thus, if in the above example, we write

D = foreach C generate flatten(A), flatten(B)

There will be a name ambiguity since both flatten(A) and flatten(B) have the field query. To avoid ambiguity in such cases, fields can be referred to by <outer-alias>::fieldName. Thus for C, we can refer to either A::query or B::query but not to query.

However, the unambiguous fields can be accessed both by their names as well as by <outer-alias>::fieldName. Thus for C, both url or B::url will access the same field.

Assigning Names to Individual Items in GENERATE

Just like in SQL where you can give names to individual items in the select list, we can name individual items in the generate clause using AS. Thus, in our example,

E = foreach D generate (cookie eq 'null' ? 'null' : url ) as nullifiedUrl, rank as myRank;

This will assign a schema (nullifiedUrl, myRank) to E.

Schemas of Functions

Eval functions can specify their own output schema by overriding the outputSchema() method. The builtin function SUM specifies that its output is called sum. Thus,

F = foreach C generate group, SUM(tstamp);

F gets assigned the schema: (group,sum). This can of course be overriden e.g., generate group, SUM(tstamp) as sumTstamp.

Last Resort: Overriding system-inferred schemas

Sometimes the system cannot infer a schema (e.g., binconds, evalfunctions that dont specify one). In these cases, and also in others when you want to override the system-inferred schema you can override it using the AS clause. Thus, you could say:

C = (cogroup A by query, B by query) as (group, foo, bar);

and C would be assigned the schema (foo,bar).

PigLatinSchemas (last edited 2009-09-20 23:38:18 by localhost)