Backward incompatible changes in Pig 0.7.0

Pig 0.7.0 will include some major changes to Pig most of them driven by the Load-Store redesign. Some of these changes will not be backward compatible and will require users to change their pig scripts or their UDFs. This document is intended to keep track of such changes so that we can document them for the release.

Summary

Change

Section

Impact

Steps to address

Comments

Load/Store interface changes

Changes to the Load and Store Functions

High

LoadStoreMigrationGuide Pig070LoadStoreHowTo

Data compression becomes load/store function specific

Handling Compressed Data

Unknown but hopefully low

If compression is needed, the underlying Input/Output format would need to support it

Switching to Hadoop's local mode

Local Mode

Low

None

Main change is 10-20x performance slowdown. Also, local mode now uses the same UDF interfaces to execute UDFs as the MR mode.

Removing support for Load-Stream or Stream-Store optimization

Streaming

Low to None

None

This feature was never documented so it is unlikely it was ever used

We no longer support serialization and decerialization via load/store functions

Streaming

Unknown but hopefully low to medium

Implement new PigToStream and StreamToPig interfaces for non-standard serialization

LoadStoreRedesignProposal

Removing BinaryStorage builtin

Streaming

Low to None

None

As far as we know, this class was only used internally by streaming

Output part files now have a "-m-" and "-r" in the name

Output file names

Low to medium

If you have a system which depends on output file names the names now have changed from part-XXXXX to part-m-XXXXX if the output is being written from the map phase of the job or part-r-XXXX if it is being written from the reduce phase

Removing Split by file feature

Split by File

Low to None

Input format of the loader would need to support this

We don't know that this feature was widely/ever used

Local files no longer accessible from cluster

Access to Local Files from Map-Reduce Mode

low to none

copy the file to the cluster using copyToLocal command prior to the load

This feature was not documented

Removing Custom Comparators

Removing Custom Comparators

Low to None

None

This feature has been deprecated since Pig 0.5.0 release. We don't have a single known use case

Using PigFileInputFormat and PigTextInputFormat

Changes to custom Load Functions

Low to medium

Custom loaders using a text-based input format and like to support recursive file listing need to use these classes

This is to get around the problem of MAPREDUCE-1577.

Changes to the Load and Store Functions

See Load Store Migration Guide

Handling Compressed Data

In 0.6.0 or earlier versions Pig supported bzip compressed files with extensions of .bz or .bz2 as well as gzip compressed files with .gz extension. Pig was able to both read and write files in this format with the understanding that gzip compressed files could not be split across multiple maps while bzip compressed files could. Also, data compression was completely decoupled from the data format and Load/Store functions meaning that any loader could read compressed data and any store function could write it just by the virtue of having the right extension on the files it was reading or writing.

With Pig 0.7.0 the read/write functionality is taking over by Hadoop's Input/OutputFormat and how compression is handled or whether it is handled at all depends on the Input/OutputFormat used by the loader/store function.

The main input format that supports compression is TextInputFormat. PigStorage is the only loader that comes with Pig that is derived from TextInputFormat which means it will be able to handle .bz2 and .gz files. Other loaders such as BinStorage will no longer support compression.

On the store side, TextOutputFormat also supports compression but the store function needs do to additional work to enable it. Again, PigStorage will support compressions while other functions will not.

If you have a custom load/store function that needs to support compression, you would need to make sure that the underlying Input/OutputFormat supports this type of compression.

Local Mode

The main change here is that we switched from Pig's native local mode to Hadoop's local mode. This change should be transparent for most applications. Possible differnces you will see are:

  1. Hadoop local mode is about order of magnitude slower than Pig's local mode. Something that Hadoop team promised to address.
  2. For algebraic functions, now the entire Algebraic interface will be used which is likely a good thing if you are using local mode for testing your production applications.

Streaming

There are two things that are changing in streaming.

First, in the initial (0.7.0) release, we will not support optimization where if streaming follows load of compatible format or is followed by format compatible store the data is not parsed but passed in chunks from the loader or to the store. The main reason we are not porting the optimization is that the work is not trivial and the optimization was never documented and so unlikely to be used.

Second, you can no longer use load/store functions for (de)serialization. A new interface has been defined that has to be implemented for custom (de)serializations. The default (PigStorage) format will continue to work. This format is now implemented by a class called org.apache.pig.builtin.PigStreaming that can be also used directly in the streaming statement. Note that this class handles arbitrary delimiters: For example your statement could look like:

 `perl StreamScript.pl` input(stdin using PigStreaming(',')) output(stdout PigStreaming(';')) <...remaining options...>;

Details of the new interface are describe in http://wiki.apache.org/pig/LoadStoreRedesignProposal.

We have also removed org.apache.pig.builtin.BinaryStorage loader/store function and org.apache.pig.builtin.PigDump which were only used from within streaming. They can be restored if needed - we would just need to implement the corresponding Input/OutputFormats.

Split by File

In the earlier versions of Pig, a user could specify "split by file" on the loader statement which would make sure that each map got the entire file rather than the files were further divided into blocks. This feature was primarily design for streaming optimization but could also be used with loaders that can't deal with incomplete records. We don't believe that this functionality has been widely used.

Because the slicing of the data is no longer in Pig's control, we can't support this feature generically for every loader. If a particular loader needs this functionality, it will need to make sure that the underlying InputFormat supports it. (Any InputFormat based on FileInputFormat will support this through the mapred.min.split.size - if this property is set to a value greater than the size of any of the files to be loaded then each file will be split as a different split. This property can be provided on the pig command line as a java -D property - note that this will apply to all jobs that will be run as part of that script.

We will have a different approach for streaming optimization if that functionality is necessary.

Access to Local Files from Map-Reduce Mode

In the earlier version of Pig, you could access a local file from map-reduce mode by prepending file: to the file location:

A = load 'file:/mydir/myfile';
...

When Pig processed this statement, it would first copy the data to DFS and then import it into the execution pipeline.

In Pig 0.7.0, you can no longer do this and if this functionality is still desired, you can add the copy into your script manually:

fs -copyFromLocal src dist
A = load 'dist';
....

Removing Custom Comparators

This functionality was added to deal with gap in Pig's early functionality - lack of numeric comparison in order by as well as lack of descending sort. This functionality has been present in last 4 releases and custom comparators has been deprecated in the last several releases. They functionality is removed in this release.

Merge Join

In Pig.0.6.0 there was a pre-condition for merge join: "The loadfunc for the right input of the join should implement the SamplableLoader interface" - instead the LoadFunc should now implement OrderedLoadFunc interface in Pig 0.7.0. All other pre-condtions still hold.

PigFileInputFormat and PigTextInputFormat

Given a load location, Pig 0.6.0 loaders load recursively all files under the location (can be multi-level). To get around MAPREDUCE-1577, Pig 0.7.0 adds PigFileInputFormat and PigTextInputFormat classes. They are subclasses of the Hadoop FileInputFormat and TextInputFormat overriding the listStatus method to support multi-level/recursive directory/file listing. Any custom loader that uses FileInputFormat or TextInputFormat, and wants to support recursive file listing should use the corresponding Pig version of the InputFormat.

Pig070IncompatibleChanges (last edited 2010-03-18 18:41:41 by RichardDing)