Introduction

A number of bugs have been filed against Pig that fall under the area of poorly defined or undefined semantics. In the 0.9 Pig release we would like to take on a number of these issues, clarifying semantics where they are unclear, defining them where they are undefined, and correcting them where they are clearly wrong. This page classifies the existing bugs and indicates what we believe the proper fix is for them.

Categories

The bugs have been placed into the following categories:

Bug Table

JIRA

Category

Proposed Solution

Backward Compatible

Proposed Priority

PIG-1341

Dynamic type binding

Close as won't fix

yes

high

PIG-1277

Nested types

See Casting of Bytearrays below.

yes

high

PIG-1222

Dynamic type binding

See Casting of Bytearrays below.

yes

high

PIG-1065

Dynamic type binding

See Using Bytearray as a Shuffle Key below.

yes

high

PIG-999

Dynamic type binding

See Using Bytearray as a Shuffle Key below.

yes

high

PIG-696

Dynamic type binding

See Casting of Bytearrays below.

yes

high

PIG-621

Dynamic type binding

See Casting of Bytearrays below.

yes

high

PIG-1584

Grammar

Cogroup inner does not match the semantics of inner join. We should deprecate it in this release, and then remove it in the next.

yes

low

PIG-678

Grammar

Put this off for now, until we can determine how far to extend it. This change should not be part of 0.9.

yes

low

PIG-438

Grammar

Support reassigning of aliases.

yes

low

PIG-313

Grammar

We continue not supporting this. But we should detect it at compile time rather than at runtime.

yes

medium

PIG-1371

Nested types

See Casting Complex Types below.

yes

medium

PIG-847

Nested types

See Two Level Access below.

yes

high

PIG-767

Nested types

Fix DESCRIBE to properly show the tuple inside a bag.

no

medium

PIG-730

Nested types

Fix schema merge to act consistently for complex types.

no

medium

PIG-723

Nested types

Make sure that two level access is correctly set for bags generated by projection.

no

medium

PIG-694

Nested types

Fix schema merge to act consistently for complex types.

yes

medium

PIG-496

Nested types

Support use of positional references in bags and tuples when the bag is declared as bag{} or the tuple as tuple()

yes

medium

PIG-1627

Schema

Flattening a bag with an unknown schema should produce a record with an unknown schema

no

high

PIG-1536

Schema

Pick one semantic for schema merges and use it consistently throughout Pig

no

medium

PIG-1188

Schema

See Schemas and Load below

no

high

PIG-1112

Schema

See Schemas and Load below

no

high

PIG-749

Schema

See Schemas and Load below

no

high

PIG-435

Schema

See Schemas and Load below

no

high

PIG-1718

Grammar

Disallow use of types in the AS clause in foreach OR allow cast semantics if the types are convertible

no

medium

Discussion

Casting of Bytearrays

Currently whenever Pig assumes that a value is a bytearray and that value needs to be cast, it will try to call typeToBytes on it. In cases where the underlying value is not actually a bytearray, this results in a null value. In some cases it appears to also result in errors. For example, consider:

    A = load 'bla' as (m:map[]);
    B = foreach A generate (int)m#k;

If m contains Strings as values (which is not unreasonable) the above query will return null rather than casting the String to an Integer. In cases such as this where Pig is assuming the value is a bytearray and inserting it into an expression, it needs to determine what the actual type is at runtime and take action based on that. The easiest way to do this may be a new physical operator that can wrap the bytearray field. It would know what type of result it should return (in this case, an integer). If the underlying type is truly bytearray, then this operator would call byteToInt from the appropriate load function. If it were another type it would insert the appropriate cast.

Since this must be done at runtime and will involve testing on each record, it will not perform as well as the case where the value type is known. This is an acceptable tradeoff, as users can declare the type of their data to get better performance.

Using Bytearray as a Shuffle Key

In cases where the shuffle key (order by, group by, join key) is assumed to be a bytearray because Pig does not know the actual type, we need to wrap that bytearray in a tuple and tell Hadoop that our shuffle key is a tuple. Since tuple takes an object, the cases where the underlying object is not actually bytearray will still be handled. Pig will have to remember to unwrap the bytearray from the tuple on the reduce side.

Casting Complex Types

When the table defining valid casts between Pig Latin types was drawn up, casts between different definitions of a given complex type (bag to bag, tuple to tuple) were not defined. We need to define the semantics for those. I propose the following:

  1. It is only possible to cast within a type (that is tuple to tuple, bag to bag, map to map)
  2. Casting to a type with a different number of fields is not a valid cast. For example, casting {(int, int)} to {(long)} is not valid.

  3. Casting to a type where every sub-cast between fields is a valid cast, is valid. For example, casting {(int)} to {(long)} is valid, but casting {(int)} to {(bytearray)} is not.

  4. As is the case with other Pig Latin casts, an invalid cast that is caught at compile time will generate an error and one that is caught at runtime will generate a warning and a null value.

When we add maps with a specified type for the value field, these same rules should apply.

Schemas and Load

Four of the schema bugs center around the same question, how to handle the case when data does not match the schema provided by the user either in LOAD or FOREACH. The current situation is that if users provide a schema to LOAD that includes types, then a FOREACH is inserted after the LOAD. If the actual data exceeds the number of fields specified in the schema, then the extra fields are truncated. If the number of fields in the data is less then specified in the schema, nulls will be appended. If however, the user specifies a schema without types the data will not be modified (neither appended nor truncated). As can be seen from the bugs above the situation is even worse with flattening, FOREACH, and AS.

I propose the following change.

    A = LOAD 'x' AS (a, b, c);

will be semantically equivalent to:

    A' = LOAD 'x';
    A = FOREACH A' GENERATE $0 as a, $1 as b, $2 as c; 

Similarly

    B = FOREACH A GENERATE x, flatten(y) AS (alpha, beta), z

will be semantically equivalent to:

    B' = FOREACH A GENERATE x, flatten(y), z
    B = FOREACH B' GENERATE x, $1 AS alpha, $2 as beta, z

Both of the above will hold whether types are declared in the schema or not. In the case where types are not declared, care must be taken not to cast the field to a bytearray. Pig will assume it is a bytearray, as it does when it does not know the type. But, as is always the case, the underlying type may be something else.

This means in all cases where AS is used, if the data has less columns then specified, it will be padded with nulls. If it has more, they will be truncated. This has the nice property that it is easy to understand (for both Pig developers and users) and it has a well defined result in every case. It also resolves the question of what it means to use AS with LOAD when the loader provides the schema. That is, this FOREACH would be applied after the data is loaded, possibly changing the schema and the data.

Note that this change is not backward compatible. Also note that it will add a (hopefully small) performance penalty to defining schema in your script rather than in your loader. This is acceptable, as we see more and more loaders defining the schema in production cases.

Two Level Access

Ideally I would like to remove two level access. However, it is deeply entwined in our code and most likely in much user code. Removing it would be too disruptive. I propose instead that we make it automatically detected by the system so that users no longer need to set it when creating bags in their loaders or EvalFuncs. Two level access can always be determined by the system by examining whether the data type is a bag and its only field is a tuple. By having Schema determine this automatically at construction time we will remove the need for users to properly set it. We will not modify the interfaces that allow users to set it. At some point in the future when we are sure we are always setting it correclty we can quit honoring those interfaces (that is, they can still exist, but Pig will set two level access regardless of what the user indicates).