Pig Types Design

Purpose

This document presents the design for adding types to pig. See the functional specification for complete details of what changes will be made.

Architectural Changes

Several significant architectural changes will be made as part of adding types to pig.

  1. The classes representing data types will be significantly changed.
  2. The way expression evaluation is done will be significantly changed.
  3. The schema class will be modified.

Data Type Classes

Currently there are four classes extending Datum: DataBag, DataMap, DataAtom, and Tuple.

In the new design, the class Datum will be removed. Tuple and DataBag will remain, but be converted to interfaces. Maps and all atomic types will be implemented by already existing java objects, according to the following table:

Pig data type

Implementing Class

Bag

org.apache.pig.data.DataBag

Tuple

org.apache.pig.data.Tuple

Map

java.util.Map<Object, Object>

Integer

java.lang.Integer

Long

java.lang.Long

Float

java.lang.Float

Double

java.lang.Double

Chararray

java.lang.String

Bytearray

byte[]

Tuple and DataBag will be converted to interfaces to allow alternate implementations by others who wish to use their own form of tuples or bags. Default implementations (DefaultTuple and DefaultAbstractBag, DefaultDataBag, SortedDataBag, and DistinctDataBag) will be provided. As part of this BagFactory will be rewritten to be abstract. A DefaultBagFactory will be provided that will implement the functionality currently in BagFactory. An abstract factory will be added for tuples, TupleFactory. A DefaultTupleFactory will be provided. All Tuple construction will be changed to be done via the factory. Each of the factories will have a getFactory() method that will determine which actual factory to instantiate. This determination will be done by checking configuration parameters. This can be used by those who wish to use their own tuples or bags to override the default tuple or bag implementations.

The class of static values DataType will be added to provide values for each data type. static const final byte values are used rather than an enum to avoid the object overhead.

The class of static functions DataReadWriteHelper will be added to provide for (de)serialization of standard types. This will allow users who implement their own version of Tuple or Bag to still use standard methods to (de)serialize the contents. Methods are not provided in this class for Tuple and DataBag because those implement !Writable and thus will already have read and write methods.

Java objects (rather than pig specific objects) were chosen to represent most types in order to take advantage of all the language features of each type and avoid the need to either reimplement common functionality (such as increment for integer) or force conversion to and from pig specific data types and java types. First class language objects will also be easier and more natural for UDF writers to work with.

Objects were chosen over base types (e.g. Integer over int) in order to allow the concept of null, and to ease the interface and implementation (ie a tuple can be implemented as an ArrayList of objects).

Diagram 1 presents the new types architecture in more detail:

Diagram 1: Data Type Classes newData.jpg

Expression Evaluation

Currently, expression evaluation is handled by two sets of classes, those extending Cond and those extending EvalFunc. Also, some operations needed in expression evaluation (binary condition, constants, column projection, map dereferencing) are extensions of EvalSpec. This will be reworked to create one set of classes that handle expression evalution.

Keeping with the changes proposed for splitting logical and physical plans, logical expressions and physical expressions will be created. The logical expressions will be used for type checking and determining the correct physical expressions to put in the execution plan. The physical expressions will be used for evaluation.

New classes LogicalExpr and PhysicalExpr will be created. Each type of expression will have one logical class that extends LogicalExpr and one or more physical classes that extend PhysicalExpr. Physical expressions may have more than one class implementing a given expression in order to handle type overloading of that expression. Where possible, this will be handled by generics, but where not possible it will be handled by multiple classes (since java generics do not allow the developer to provide a specific realization for a given type). For example, consider the physical expression plus. This can be done as a generic, because the java '+' operator that it will implement can be applied to Integer, Long, Float, or Double. But the physical expression equals cannot be done as a generic. This is because equals() cannot be applied to a java bytearray. So a separate class will be necessary to implement equals for bytearray types. Separate classes will be created rather than putting functionality for all types in one class to avoid if/else chains in the expression evaluation.

These new classes will cause the following changes:

TODO: Should StarSpec be replaced too, or should it stay an EvalSpec?

Diagram 2: Extenders of EvalSpec in new architecture newEvalSpec.jpg

Diagram 3: LogicalExpr and its extenders newLogicalExpr.jpg

Diagram 4: PhysicalExpr and its extenders newPhysicalExpr.jpg

Schema

The concept of Schema will be changed. Currently, a Schema represents either the alias for an atomic field (AtomSchema) or the (possibly) set of aliases for a complex field (TupleSchema). This will be broken out into two classes, FieldSchema and Schema. FieldSchema will always represent one field, regardless of the data type of the field. It will contain information on alias name, data type, and (if the field is a tuple) the Schema. Schema will be only for tuples, and will allow the caller to fetch the FieldSchema for a field based either on position or alias.

Diagram 5: Schema class in the new architecture newSchema.jpg

Detailed Design

Parser Changes

The parser will be changed in several areas:

AS Syntax

AS statements will be modified to accept type specification as part of the schema specification. This will require changes in the grammar as well as the construction of the Schema object.

Type Checking

Type checking will be done by a LogicalExprVisitor on the expressions contained in each ExprSpec. This will be done after the parse phase has completed. It could be done during parsing. However, keeping this separate will make the code in the parser simpler. It will also allow multiple implementations of the parser (as we currently have for the pig pen code) without requiring multiple implementations of the type checker.

As part of type checking, type promotion will be done. For example, if the user requests the addition of an integer and a double, the type checker will handle promoting the integer to a double by inserting a cast operator between the integer and the plus expression. So what started out as: PLUS(3, 3.0) will become PLUS(castIntToDouble(3), 3.0). This is necessary even for cases where java would do the promotion for us (as in our example of double plus integer) because we do no want to assume the implementation of the backend. We may change that at some point in the future.

Physical Expression Selection

As noted above, for some expressions there will be multiple implementations of the expression, based on the type. This will be accomplished in some cases by generics, and in some cases by separate class implementations. The code that compiles the logical plan into a physical plan will need to be able to choose the correct physical implementation of a logical expression. This will be achieved by having tables defined for each logical operator. These tables will be a matrix with rows and columns for each type. The compiler code can then look up the appropriate operator in this matrix simply using the byte codes for the types. For example, the table for plus will look like:

plusOperator[][] = {
   // bag, tuple, map,   int,               long,           float,           double,           chararray, bytearray
   {null,  null,  null,  null,              null,           null,            null,             null,      null}, // bag
   {null,  null,  null,  null,              null,           null,            null,             null,      null}, // tuple
   {null,  null,  null,  null,              null,           null,            null,             null,      null}, // map
   {null,  null,  null,  ExprPlus<Integer>, null,           null,            null,             null,      null}, // int
   {null,  null,  null,  null,              ExprPlus<Long>, null,            null,             null,      null}, // long
   {null,  null,  null,  null,              null,           ExprPlus<Float>, null,             null,      null}, // float
   {null,  null,  null,  null,              null,           null,            ExprPlus<Double>, null,      null}, // double
   {null,  null,  null,  null,              null,           null,            null,             null,      null}, // chararray
   {null,  null,  null,  null,              null,           null,            null,             null,      null}  // bytearray
};

When the compiler finds a plus operator with two integers as arguments, it will look up plusOperatorSelection[DataType::TypeInteger][DataType::TypeInteger] and then instantiate the indicated class.

Recall that the type checker will be placing cast operators in to handle type promotion. This is why plus is not defined for double plus integer, etc.

If the compiler looks up an entry in the table and the result is null, it will give an error informing the user that an operation of the requested type is not supported.

Constants

The parser will change to support non-string constants. For a specification of constant syntax see the functional specification. The parser will support these constants, and encode them as ConstExpr objects.

Casts

The parser will change to support cast syntax. For the syntax see the functional specification. Cast operators will be selected via tables in the same way as other operators.

DEFINE Syntax

To support changes to define syntax, a symbol table will be added to the parser. This table will track functions that have been defined with type signatures. Any function used in the query will be checked against this table. If the function used is not in the symbol table than the type checker will assign a bytearray type to the function result. If the function is present in the symbol table it will be checked that the argument usage matches what was specified in the symbol table and that the arguments are of the type specified in the symbol table. If either of these criteria are not met the parser will throw an error.

Changes to the Physical Plan and Expression Evaluation

As described in #Expression_Evaluation above, a new class ExprSpec will be added to handle all expression evaluation. In its constructor it will visit its associated expression tree to discover every expression in the tree that needs to get data from its input tuple. It will construct this into a map of offsets and expressions.

On each call to DataCollector::add it will then use this map to populate the appropriate expressions with the values from the current tuple. It will then call eval() on the expression at the top of the tree, and return the resulting value as its output.

Changing Current Operators

Currently the regular expressions in the Matches operator are compiled for every row. This will be changed so that they are compiled once in the constructor of the operator and just matched on each row. In addition, regular expressions will need to be supported on byte[] data where no coding is specified.

TODO: Need to investigate third party regular expression libraries to see what we can find that will handle regular expressions on byte sequences. Failing this, we may be able to take a third party library and modify it to work with byte arrays. We really don't want to write our own regular expressions.

Null Handling

Changes will be made to each expression operator to handle null values.

Changes to Arithmetic Expressions as Function Arguments

Expression tree construction in the parser will be changed to match the new semantics specified here.

Feedback from Design Review Meeting

PigTypesDesign (last edited 2011-03-26 21:24:49 by jeremyhanna)