Introduction
A number of bugs have been filed against Pig that fall under the area of poorly defined or undefined semantics. In the 0.9 Pig release we would like to take on a number of these issues, clarifying semantics where they are unclear, defining them where they are undefined, and correcting them where they are clearly wrong. This page classifies the existing bugs and indicates what we believe the proper fix is for them.
Categories
The bugs have been placed into the following categories:
- Schema: These are related to schemas that are improperly inferred, etc.
- Grammar: Places where the grammar is unclear or produces unexpected results.
- Nested Types: Issues dealing with bags, tuples, and maps.
- Dynamic Type Binding: In certain situations Pig assumes a value to be of type byte array when it does not know the actual type, and handles whatever actual type it is at runtime. There are situations where this does not work properly.
Bug Table
JIRA |
Category |
Proposed Solution |
Backward Compatible |
Proposed Priority |
Dynamic type binding |
Close as won't fix |
yes |
high |
|
Nested types |
See Casting of Bytearrays below. |
yes |
high |
|
Dynamic type binding |
See Casting of Bytearrays below. |
yes |
high |
|
Dynamic type binding |
See Using Bytearray as a Shuffle Key below. |
yes |
high |
|
Dynamic type binding |
See Using Bytearray as a Shuffle Key below. |
yes |
high |
|
Dynamic type binding |
See Casting of Bytearrays below. |
yes |
high |
|
Dynamic type binding |
See Casting of Bytearrays below. |
yes |
high |
|
Grammar |
Cogroup inner does not match the semantics of inner join. We should deprecate it in this release, and then remove it in the next. |
yes |
low |
|
Grammar |
Put this off for now, until we can determine how far to extend it. This change should not be part of 0.9. |
yes |
low |
|
Grammar |
Support reassigning of aliases. |
yes |
low |
|
Grammar |
We continue not supporting this. But we should detect it at compile time rather than at runtime. |
yes |
medium |
|
Nested types |
See Casting Complex Types below. |
yes |
medium |
|
Nested types |
See Two Level Access below. |
yes |
high |
|
Nested types |
Fix DESCRIBE to properly show the tuple inside a bag. |
no |
medium |
|
Nested types |
Fix schema merge to act consistently for complex types. |
no |
medium |
|
Nested types |
Make sure that two level access is correctly set for bags generated by projection. |
no |
medium |
|
Nested types |
Fix schema merge to act consistently for complex types. |
yes |
medium |
|
Nested types |
Support use of positional references in bags and tuples when the bag is declared as bag{} or the tuple as tuple() |
yes |
medium |
|
Schema |
Flattening a bag with an unknown schema should produce a record with an unknown schema |
no |
high |
|
Schema |
Pick one semantic for schema merges and use it consistently throughout Pig |
no |
medium |
|
Schema |
See Schemas and Load below |
no |
high |
|
Schema |
See Schemas and Load below |
no |
high |
|
Schema |
See Schemas and Load below |
no |
high |
|
Schema |
See Schemas and Load below |
no |
high |
|
Grammar |
Disallow use of types in the AS clause in foreach OR allow cast semantics if the types are convertible |
no |
medium |
Discussion
Casting of Bytearrays
Currently whenever Pig assumes that a value is a bytearray and that value needs to be cast, it will try to call typeToBytes on it. In cases where the underlying value is not actually a bytearray, this results in a null value. In some cases it appears to also result in errors. For example, consider:
A = load 'bla' as (m:map[]);
B = foreach A generate (int)m#k;If m contains Strings as values (which is not unreasonable) the above query will return null rather than casting the String to an Integer. In cases such as this where Pig is assuming the value is a bytearray and inserting it into an expression, it needs to determine what the actual type is at runtime and take action based on that. The easiest way to do this may be a new physical operator that can wrap the bytearray field. It would know what type of result it should return (in this case, an integer). If the underlying type is truly bytearray, then this operator would call byteToInt from the appropriate load function. If it were another type it would insert the appropriate cast.
Since this must be done at runtime and will involve testing on each record, it will not perform as well as the case where the value type is known. This is an acceptable tradeoff, as users can declare the type of their data to get better performance.
Using Bytearray as a Shuffle Key
In cases where the shuffle key (order by, group by, join key) is assumed to be a bytearray because Pig does not know the actual type, we need to wrap that bytearray in a tuple and tell Hadoop that our shuffle key is a tuple. Since tuple takes an object, the cases where the underlying object is not actually bytearray will still be handled. Pig will have to remember to unwrap the bytearray from the tuple on the reduce side.
Casting Complex Types
When the table defining valid casts between Pig Latin types was drawn up, casts between different definitions of a given complex type (bag to bag, tuple to tuple) were not defined. We need to define the semantics for those. I propose the following:
- It is only possible to cast within a type (that is tuple to tuple, bag to bag, map to map)
Casting to a type with a different number of fields is not a valid cast. For example, casting {(int, int)} to {(long)} is not valid.
Casting to a type where every sub-cast between fields is a valid cast, is valid. For example, casting {(int)} to {(long)} is valid, but casting {(int)} to {(bytearray)} is not.
- As is the case with other Pig Latin casts, an invalid cast that is caught at compile time will generate an error and one that is caught at runtime will generate a warning and a null value.
When we add maps with a specified type for the value field, these same rules should apply.
Schemas and Load
Four of the schema bugs center around the same question, how to handle the case when data does not match the schema provided by the user either in LOAD or FOREACH. The current situation is that if users provide a schema to LOAD that includes types, then a FOREACH is inserted after the LOAD. If the actual data exceeds the number of fields specified in the schema, then the extra fields are truncated. If the number of fields in the data is less then specified in the schema, nulls will be appended. If however, the user specifies a schema without types the data will not be modified (neither appended nor truncated). As can be seen from the bugs above the situation is even worse with flattening, FOREACH, and AS.
I propose the following change.
A = LOAD 'x' AS (a, b, c);
will be semantically equivalent to:
A' = LOAD 'x';
A = FOREACH A' GENERATE $0 as a, $1 as b, $2 as c; Similarly
B = FOREACH A GENERATE x, flatten(y) AS (alpha, beta), z
will be semantically equivalent to:
B' = FOREACH A GENERATE x, flatten(y), z
B = FOREACH B' GENERATE x, $1 AS alpha, $2 as beta, zBoth of the above will hold whether types are declared in the schema or not. In the case where types are not declared, care must be taken not to cast the field to a bytearray. Pig will assume it is a bytearray, as it does when it does not know the type. But, as is always the case, the underlying type may be something else.
This means in all cases where AS is used, if the data has less columns then specified, it will be padded with nulls. If it has more, they will be truncated. This has the nice property that it is easy to understand (for both Pig developers and users) and it has a well defined result in every case. It also resolves the question of what it means to use AS with LOAD when the loader provides the schema. That is, this FOREACH would be applied after the data is loaded, possibly changing the schema and the data.
Note that this change is not backward compatible. Also note that it will add a (hopefully small) performance penalty to defining schema in your script rather than in your loader. This is acceptable, as we see more and more loaders defining the schema in production cases.
Two Level Access
Ideally I would like to remove two level access. However, it is deeply entwined in our code and most likely in much user code. Removing it would be too disruptive. I propose instead that we make it automatically detected by the system so that users no longer need to set it when creating bags in their loaders or EvalFuncs. Two level access can always be determined by the system by examining whether the data type is a bag and its only field is a tuple. By having Schema determine this automatically at construction time we will remove the need for users to properly set it. We will not modify the interfaces that allow users to set it. At some point in the future when we are sure we are always setting it correclty we can quit honoring those interfaces (that is, they can still exist, but Pig will set two level access regardless of what the user indicates).