Pig Types Functional Specification

Purpose

This document proposes a functional specification for the Pig type system. It reviews what has already been implemented and discusses changes that would be made to the type system to complete it. In addition to types, expression operators will be discussed.

Definitions

Supported Types

Currently (pig release 1.2) pig has four data types:

Pig will now have the following data types:

Eventually, user defined types (UDTs) may be added. They are not specified at this time. However, nothing should be done in the implementation of types or operators that will preclude the additions of UDTs at a later time.

Type Specification

There is no current way to specify types. The system simply assumes types based on how the user makes use of the field (e.g. if # is applied to the field, it must be a map).

With these changes, users will be able to specify data types for incoming data. Currently, it is possible to define column names via AS. This will be extended to allow defining a type for the column. The syntax will be:

LOAD 'myfile' AS (colname [: type], ...)

This wil be legal anywhere AS is legal. Note that in many instances, using this syntax will result in a cast, since the results of a FOREACH ... GENERATE statement will in most cases already have a type.

Users will not be required to define the type of a given datum. If the type is not provided the datum will be assigned the type bytearray. No conversions will be done on bytearrays unless users coerce them into other types (see expression operators below for details).

When defining complex types, users have two choices. They can define them completely. For example, column a is a bag of tuples, each tuple having size 3, with types (long, int, long). They can also define a complex type to contain an unknown type. Thus column a could be described as a bag of unknown, or a bag of tuples of unknown.

The complete syntax for describing types will be:

type declaration

comments

bag{<identifier> [: tuple](<identifier> [: type] [, <identifier> [: type]])}

bag{ }

indicates contents of bag are not known

tuple( )

indicates contents of tuple are not known

tuple(<identifier> [: type] [, <identifier> [: type]])

map[ ]

key type can be of any basic type, value types default to bytearray but can be cast to any type. here [ and ] are type specifiers and not the regular expression to denote one or more instance of

int

long

float

double

chararray

bytearray

For example:

a = LOAD 'myfile' AS (garage bag(unknown), links bag(chararray), page bag(tuple(...)), coordinates bag(tuple(float, bytearray)));

Users must also be able to declare types for their user defined functions (UDFs). DEFINE will be changed to have the following syntax:

DEFINE alias '=' type funcspec '(' arglist ')' ';'

alias: [a-zA-Z_][a-zA-Z_0-9]+

type: bag | tuple | map | int | long | float | double | chararray | bytearray

funcspec: pathelement
          funcspec '.' pathelement

pathelement: [a-zA-Z_][a-zA-Z_0-9]+

arglist: type [varname]
         arglist ',' expr

varname: [a-zA-Z_][a-zA-Z_0-9]+

As with other types, users will not be required to declare the signature of their UDFs, in which case any arguments can be passed to the UDF and the UDFs return value will be treated as type bytearray.

If the function signature is defined, but used in a different way, this will result in a compile time error. For example, the following will result in an error:

DEFINE myfunc = com.mycompany.myudfs.myfunc(int i, chararray c);
...
processed = foreach raw generate myfunc($0, $1, $2);

Also, if a function signature is defined the parser will do type checking on the functions arguments to ensure they match the definition given for the function.

Function overloading, default arguments, and indeterminate number of arguments (i.e. myfunc(int, ...)) will not be supported at this time. Since users can alias a given function as many times as they want using DEFINE, if they wish to use the same function with different arguments they can alias it multiple times to different names.

Constants

Currently, the only supported constants are strings, which must be enclosed in single quotes.

String constants will be assumed to be of type chararray. Users can specify these constants as ASCII characters (e.g. x MATCHES 'abc' ) or as octal constants (e.g. x MATCHES '\0141\0142\0143' ). This allows users to use non-ASCII characters in expressions. Casts to type bytearray will be supported (see casts below).

Support will be added for numeric constants. By default any numeric constant matching the regular expression [0-9]+ will be assigned a type of int. Any numeric constant matching the regular espression [0-9]+[lL] will be assigned a type of long. Any numeric constant matching the regular expression [0-9]*\.[0-9]+([eE][+-]?[0-9]+)? will be assigned a type of double. Any numeric constant matching the regular expression [0-9]*\.[0-9]+([eE][+-]?[0-9]+)?[fF] will be assigned a type of float.

Support will also be added for complex constants. The syntax for complex constants will be:

datalist: datum
          datalist ',' datum

key_value_pair_list: key_value_pair
                     key_value_pair_list ',' key_value_pair

key_value_pair: atomic_datum '#' datum

bag: '{' datalist '}'

tuple: '(' datalist ')'

map: '[' key_value_pair_list ']'

The use of () for tuple, {} for bag, and [] for map were chosen to match the current output style of the dump command.

Nulls

The concept of NULL is not currently supported in pig. This forces errors in some cases where we would prefer not to error out (such as divide by 0 and failed UDF calls).

All existing operators will be changed to work with Nulls. Those changes will conform to SQL NULL semantics. The following table describes how each operator will handle a null value:

Operator

Interaction

Equality operators

If either sub-expression is null, the result of the equality comparison will be null

Matches

If either the string being matched against or the string defining the match is null, the result will be null

Is null

returns true if the tested value is null (duh!)

Arithmetic operators, concat

If either sub-expression is null, the resulting expression is null

Size

If the tested object is null, size will return null

Dereference of a map or tuple

If the dereferenced map or tuple is null, the result will be null.

Cast

Casting a null from one type to another will result in a null

Aggregate

Aggregate functions will ignore nulls, just as they do in SQL. This requires changes to the existing implementations.

Generating Nulls

In addition to the table above which indicates when operators will generate nulls, the following actions will generate a regular null:

Expression Operators

Expression operators do not currently define what operands they support, though the use of some operands cause run time errors. For example, a query such as

a = load '/user/pig/tests/data/singlefile/studenttab10k';
b = group a by $0;
c = foreach b generate $0 * $1;

will run through the map stage and then produce an error in the reduce stage. Ideally the concept of multiplying a bag by a scalar would either be well defined or produce an error before the map reduce tasks are begun.

Pig will change to clearly define which operators work with which types. Using an operator with an unsupported type will result in an error at compile time rather than during run time.

Expression operators are currently forced to use the lowest common denominator for certain calculations. For example, all numeric operations are done as doubles since the system cannot discriminate between floating point types and integer types. This is innefficient. It also breaks the law of least astonishment, in that anyone who has significant exposure to computer languages expects integer arithmetic operations to result in integer values, not floating point values.

For each operator, a table is provided showing how the operator will interact with the various data types. The following terms will be used in the tables:

A note about the use of bytearray. In most cases bytearray types can be coerced into other types. This is to maintain backward comptability with current pig, and to maintain easy use of the language without requiring casts everywhere in the code.

Comparators

Comparators are currently "perlish" in that they require the user to define the type of the operands. If the user wishes to compare two operands numerically, = = is used, whereas eq is used for comparing two operands as strings. The existing string comparators eq, ne, lt, lte, gt, gte will continue to be supported, but they should be considered depricated. It will now be legal to use = = , ! = , etc. with data of type chararray.

Numeric Equals and Notequals (==, !=)

bag

tuple

map

int

long

float

double

chararray

bytearray

bag

error

error

error

error

error

error

error

error

error

tuple

boolean (Tuple A is equal to tuple B iff they have the same size s, and for all 0 <= i < s A[i] = = B[i])

error

error

error

error

error

error

error

map

boolean (Map A is equal to map B iff A and B have the same number of entries, and for every key k1 in A with a value of v1, there is a key k2 in B with a value of v2, such that k1 = = k2 and v1 = = v2)

error

error

error

error

error

error

int

boolean

boolean

boolean

boolean

error

boolean (bytearray cast as int)

long

boolean

boolean

boolean

error

boolean (bytearray cast as long)

float

boolean

boolean

error

boolean (bytearray cast as float)

double

boolean

error

boolean (bytearray cast as double)

chararray

boolean

boolean (bytearray cast as chararray)

bytearray

boolean

String Equality and Inequality Comparators.

These include eq ne lt lte gt gte . These operators should be considered deprecated.

Numeric Inequality Operators (>, >=, <, <=)

bag

tuple

map

int

long

float

double

chararray

bytearray

bag

error

error

error

error

error

error

error

error

error

tuple

error

error

error

error

error

error

error

error

map

error

error

error

error

error

error

error

int

boolean

boolean

boolean

boolean

error

boolean (bytearray cast as int)

long

boolean

boolean

boolean

error

boolean (bytearray cast as long)

float

boolean

boolean

error

boolean (bytearray cast as float)

double

boolean

error

boolean (bytearray cast as double)

chararray

boolean

boolean (bytearray cast as chararray)

bytearray

boolean

MATCHES

chararray

bytearray

chararray

boolean

boolean (bytearray cast as chararray)

bytearray

boolean

IS NULL

This is a new operator.

This operator will be added to check if a datum is null. It can be applied to any data type.

Binary Operators

Multiplication and Division

These are the operators * and / .

bag

tuple

map

int

long

float

double

chararray

bytearray

bag

error

error

error

not yet

not yet

not yet

not yet

error

error

tuple

error

error

not yet

not yet

not yet

not yet

error

error

map

error

error

error

error

error

error

error

int

int

long

float

double

error

int (bytearray cast as int)

long

long

float

double

error

long (bytearray cast as long)

float

float

double

error

float (bytearray cast as float)

double

double

error

double (bytearray cast as double)

chararray

error

error

bytearray

double (bytearray cast as double)

Modulo

This is a new operator, %, it is valid only for integer types.

int

long

bytearray

int

int

long

int (bytearray cast as int)

long

long

long (bytearray cast as long)

bytearray

error

We may choose not to implement the mod operator right away, as there are no immediate user requests for it.

Addition and Subtraction

These are the operators + and - .

bag

tuple

map

int

long

float

double

chararray

bytearray

bag

error

error

error

error

error

error

error

error

error

tuple

not yet

error

error

error

error

error

error

error

map

error

error

error

error

error

error

error

int

int

long

float

double

error

int (bytearray cast as int)

long

long

float

double

error

long (bytearray cast as long)

float

float

double

error

float (bytearray cast as float)

double

double

error

double (bytearray cast as double)

chararray

error

error

bytearray

double (bytearray cast as double)

Concat

This is a new user defined function.

A new operator concat will be added for chararrays.

chararray

bytearray

chararray

chararray

chararray (bytearray cast as chararray)

bytearray

bytearray

Unary Operators

Negation

This is the unary operator - .

bag

error

tuple

error

map

error

int

int

long

long

float

float

double

double

chararray

error

bytearray

double (as double)

Size

This is a new user defined function.

A new operator size will be added that can take the size of types.

bag

returns number of elements in bag

tuple

returns arity of the tuple

map

returns number of key/value pairs in map

int

returns 1

long

returns 1

float

returns 1

double

returns 1

chararray

returns number of characters in the array

bytearray

as number of bytes in the array

Tuple Dereference Operator

This is the operator . .

For tuples, dereferencing into the tuple via . will continue to be supported. This dereferencing can be done via name or position. For example mytuple.myfield and mytuple.$0 are both valid dereferences. If the dot operator is applied to a bytearray, the bytearray will be assumed to be a tuple.

Map Dereference Operator

This is the operator # .

For maps, dereferencing into the map via # will continue to be supported. This dereferencing must be done by key, for example mymap#mykey If the pound operator is applied to a bytearray, the bytearray will be assumed to be a map.

Cast Operators

Casts will be added to the language.

C/Java like syntax will be used for casts (e.g. (int)mydouble).

Cast operators.

to

from

bag

tuple

map

int

long

float

double

chararray

bytearray

bag

error

error

error

error

error

error

error

error

tuple

error

error

error

error

error

error

error

error

map

error

error

error

error

error

error

error

error

int

error

error

error

yes

yes

yes

yes

error

long

error

error

error

yes

yes

yes

yes

error

float

error

error

error

yes

yes

yes

yes

error

double

error

error

error

yes

yes

yes

yes

error

chararray

error

error

error

yes

yes

yes

yes

error

bytearray

yes

yes

yes

yes

yes

yes

yes

yes

Downcasts may cause loss of data (ie casting from long to int may drop bits).

Aggregate Functions

The aggregate functions take a bag of tuples and an indicator of which field in the tuple to compute on.

In the following chart, the entry error indicates that applying that aggregate function to a field of this type is an error. A data type (e.g. long) indicates that applying that aggregate function to a field of this type returns the indicated data type.

int

long

float

double

chararray

bytearray

COUNT

long

long

long

long

long

long

SUM

long

long

double

double

error

as double

AVG

long

long

double

double

error

as double

MIN

int

long

float

double

chararray

bytearray

MAX

int

long

float

double

chararray

bytearray

Argument Construction for Functions

Because of the way pig breaks up grouping and cogrouping separate from the application of aggregation functions and flattening, people who are used to SQL struggle to understand how to use expression operators in an aggregate function. Consider the following SQL and pig queries:

-- SQL query 1
SELECT name, sum(salary * bonus_multiplier)
FROM employee
GROUP BY name;

# pig latin query 1
employee = LOAD 'employee' AS (name, salary, bonus_multiplier);
grouped = GROUP employee BY name;
total_compensation = FOREACH grouped GENERATE group, SUM(salary * bonus_multiplier);

At first glance, it would appear that these two are functionally equivalent. But they are not. The schema of grouped in the pig query is {group, {[name, salary, bonus]}} (where {} indicates a bag and [] a tuple). The FOREACH in the last line serializes the outer bag, so that each tuple in it is processed separately. The function SUM() is then passed two bags, one containing a tuple fore each of the entries in grouped.employee:salary and one containing a tuple for each of the entries in grouped.employee:bonus_multiplier. Multiplying these two bags is not what the user intends in this case, and is not a well defined mathematical operation.

The proposed solution is to change the semantics of pig, so that expression evaluation on function arguments is done before the arguments are constructed as bags of tuples, rather than afterwards. This means that the semantics would change so that SUM(salary * bonus_multiplier) means that for each tuple in grouped, the fields grouped.employee:salary and grouped.employee:bonus_multiplier will be multiplied and the result formed into tuples that are placed in a bag to be passed to the function SUM(). Note that this will not resolve all problems, such as the following:

-- SQL query 2
SELECT name, sum(e.salary * b.multiplier)
FROM employee JOIN bonuses USING (name)
GROUP BY name;

This could still not be replicated in pig latin as

# pig latin query 2
employee = LOAD 'employee' AS (name, salary);
bonuses = LOAD 'bonuses' AS (name, multiplier);
grouped = COGROUP employee BY name, bonuses BY name;
total_compensation = FOREACH grouped GENERATE group, SUM(employee:salary * bonuses:multiplier);

This would not work because grouped.employee:salary and grouped.bonuses:multiplier are in different bags. There is thus no way to intelligently construct the bag of tuples salary * multiplier. This would instead need to be recast as:

# pig latin query 2.1
employee = LOAD 'employee' AS (name, salary);
bonuses = LOAD 'bonuses' AS (name, multiplier);
grouped = COGROUP employee BY name, bonuses BY name;
flattened = FOREACH grouped GENERATE group, FLATTEN(employee), FLATTEN(bonuses);
grouped_again = GROUP flattened BY group;
total_compensation = FOREACH grouped_again GENERATE group, SUM(employee:salary * bonuses:multiplier);

This double grouping is unfortunate. But given that the user wishes to do arithmetic evaluation across bags, this seems unavoidable. The language will need to detect this and error out when a query such as query 2 is given.

PigTypesFunctionalSpec (last edited 2009-09-20 23:38:25 by localhost)