Pig Road Map

The following document was developed as a roadmap for pig at Yahoo prior to pig being released as open source.

Abstract

This document lays out the road map for pig as of the end of Q3 2007. It begins by laying out the vision for pig and describing its current state. It categorizes different features to be worked on and discusses the priorities of those categories. Each feature is described in detail. Finally, it discusses how individual features will be prioritized.

Pig Vision

The pig query language provides several significant advantages compared to straight map reduce:

  1. There are common data query operations that most, if not all, users querying data make use of. These include project, filter, aggregate, join, and sort (all of the common data base relational operators). Pig provides these in the map reduce framework so that users need not implement these common operators over and over, while still allowing the user to include their custom code for non-common operations.
  2. The use of a declarative language opens up grid data queries to users who do not have the expertise, time, or inclination to write java programs to answer their queries.
  3. A great majority of data queries can be answered using the provided operators. A declarative language greatly reduces the development effort needed by users to make use of grid data.
  4. An interactive shell allows users to easily experiment with their data and queries in order to shorten their development cycle.

All of this comes at a price in performance and flexibility. An expert programmer will always be able to write more efficient code in a lower level language to execute a given job than can be done in general purpose higher level language. Also, some types of jobs may not fit nicely into pig's query model. For these reasons it is not envisioned that pig should become the only way to access data on the grid. However, given sufficient performance, stability, and ease of use, the majority of users will find pig an easier way to write, test, and maintain their code.

Given this vision, the goals for the pig team over the next twelve months are:

  1. Make pig a stable, usable, production quality product.
  2. Make pig user friendly in terms of understanding the data being queried, how queries are executed, etc.
  3. Provide performance at or near the same level available from implementing similar operations directly in map reduce.
  4. Provide flexibility for users to integrate their code, in any of several supported common languages, into their pig queries.
  5. Provide users a quality shell in which to navigate the HDFS and do query development.
  6. Provide tools for grid administrators to understand how their data is being queried and created via pig.

Current Status

The current status of pig, measured against the seven goals above is:

  1. Release 1.2 of pig focussed mainly on goal 1 of making pig stable and production quality. There is still work to be done in this, particulary in the areas of error handling and documentation.
  2. Very little support is given to users at this time to assist them in understanding the structure of the data they are querying, how their queries will be executed, etc.
  3. Pig performance against similar code written directly in map reduce has not been tested.
  4. Users can integrate java code with pig. Other languages are not supported. The supported java code must be a function that accepts a row at a time and outputs one or more rows in response. Other models, such as streaming, are not supported.
  5. The grunt shell provides the ability to browse the HDFS and to create pig queries. It is very rudimentary, lacking many features of a modern shell.
  6. There are no tools currently to allow a grid administrator to monitor how pig is being used on the grid.

Development Categories

Features under consideration for development are categorized as:

Category

Priority

Types of Features

Comments

Engineering manageability

High

release engineering, testing

Infrastructure

High

architectural changes, such as metadata, types

Performance

High / Medium

Changes that bring pig into near parity with direct map reduce are high priority, others are medium

Production quality

Medium

good error handling, reliability

User experience

Medium

ease of use, providing adequate information for the user

Security

Low

Do not have requirements or security support from HDFS yet

Feature Details

For each of the features, the following information is given:

Explanation: What this feature is.

Rational: Why it is needed. This should include a reference to which of the pig goals mentioned above that this meets.

Category: Which of the above categories the feature fits into.

Requestor: Indicates whether the feature was requested by a user, development, qa, administartor (that is grid administrators), or management.

Depends On: Other featuers that must be implemented before this one can. Most of these are internal, but a few are external dependencies.

Current Situation: The feature may be totally non-existent, already partially addressed in current pig, under design, or under development.

Estimated Development Effort:

Urgency:

Describe Schema

Explanation: Users need to be able to see the schema of data they wish to query. This includes finding the schema of data that is not yet being queried (e.g. describe '/user/me/mydata') and determining the schema of a relation after several pig operations, e.g.:

a = load('/user/me/mydata') as (query, urls, user_clicked);
b = filter a by user_clicked >= '1';
c = group b by query;
describe c;

Also, this should allow the user to see the schema that will result from merging several files, e.g.

a = load('/user/me/mydata/*')
describe a;

should return the merged schema of all the files in the directory /user/me/mydata.

Rational: Goal 2.

Category: User experience.

Requestor: Users, Development.

Depends On: Metadata

Current Situation: As of pig 1.2 the second type of describe (schema of a current relation) can be done. Describing the schema of a file cannot be done until metadata about the file exists.

Estimated Development Effort: Small.

Urgency: Medium.

Documentation

Explanation: Pig users need comprehensive user documentation that will help them write queries, write user defined functions, submit their pig scripts to the cluster, etc.

Rational: Goal 1.

Requestor: Users, Development.

Category: User experience.

Depends On:

Current Situation: There are a few existing wiki pages at http://wiki.apache.org/pig/. However, there is not enough at this time to allow users to get started with pig, and what there is is not collected into a coherent whole with an index etc.

Estimated Development Effort: Medium. This effort requires expertise not currently existing in the Pig development team. A document writer is needed to assist the team in this.

Urgency: High.

Early Error Detection and Failure

Explanation: Errors should be detected in pig queries at the earliest possible opportunity. This includes syntax parsing that should be done before pig connects to HOD at all, and type checking that should be done before the jobs are submitted to Map Reduce.

Rational: Goal 1.

Category: User experience.

Requestor: Development

Depends On: Some types of checking will require Metadata, others could be accomplished without it.

Current Situation: Currently a number of errors are not detected until runtime.

Estimated Development Effort: Small.

Urgency: Medium.

Error Handling

Explanation: Pig needs to divide errors into categories of:

Rational: Goal 1.

Category: Production quality.

Requestor: Development.

Depends On: NULL Support

Current Situation: Most errors encountered during map reduce execution cause the system to immediately abandon the query (e.g. a divide by zero error in a query).

Estimated Development Effort: Medium.

Urgency: Medium.

Explain Plan Support

Explanation: Users want to be able to see how their query will be executed. This helps them understand how changes in their queries will effect the execution.

Rational: Goal 2.

Category: User experience.

Requestor: Users

Depends On:

Current Situation: No explanation is currently available for query execution. The user can read the output of pig to see how many times map reduce was invoked, but even this does not make clear which operations are being done in each step.

Estimated Development Effort: Small.

Urgency: Low.

Removing Automatic String Encoding

Explanation: Currently all data that pig reads from the grid is assumed to be strings encoded in UTF8. However, some of the data stored in the grid,

are too short to use inductive algorithms to determine the encoding. Pig (and any other user) therefore has no way to determine the encoding. However, since pig currently assumes the data is UTF8 encoded it corrupts the data by converting it improperly.

The are two partial solutions to this. One is to provide users with a C string like data type that provides some basic operators (equality, less than, maybe very limited regular expressions) that does not attempt any intepretation or translation of the underlying data. This will be done as part of the types work to create a byte data type. The second solution is, even for full string types, to only translate the underlying data type to java strings when it is actually required, instead of by default. This should be done for performance reasons, as translation is expensive.

Rational: Goal 1.

Category: Production quality.

Requestor: Development

Depends On: Types Beyond String

Current Situation: See Explanation.

Estimated Development Effort: Small, as most of the work will be done in "Types Beyond String".

Urgency: Medium.

Language Expansion

Explanation: Users have requested a number of extensions to the language. Some of these are new functionality, some just extensions of exsiting funcitonality. Requested extensions are:

  1. CASE statement. This would generalize the currently available binary condition operator (? :).

  2. Extend operators allowed inside FOREACH { ... GENERATE }. Users have requested that GROUP, COGROUP, and JOIN be added.

  3. Allow user to specify that a sort should be done in descending order instead of the default ascending order. This should be extended to the SQL functionality of specifying ascending or descending for each field, e.g. ORDER a BY $0 ASCENDING, $1 DESCENDING.

  4. Add LIMIT functionality.

Rational: Goal 1.

Category: User experience.

Requestor: Users

Depends On:

Current Situation: None of the above requested extensions currently exist.

Estimated Development Effort: Medium for all of the above, but note that any one of the above could be added without the others. Taken alone, each of the above are Small.

Urgency: Low.

Logging

Explanation: Pig needs a consistent, configurable way to write query progress, debugging, and error information both to the user and to logs.

Rational: Goal 1.

Category: User experience and Engineering manageability.

Requestor: Development.

Depends On:

Current Situation: Most pig progress, debugging, and error information is currently written to stderr via System.err.println. Hadoop uses log4j to provide trace logging. Recently (as of pig 1.1d), pig began to use log4j as well. However, very few parts of the code have been converted from println to log4j. In addition, pig only has a screen appender for log4j, so it only writes to stderr. Pig needs to be generating a log file on the front end as well. See [LoggingPolicy] for more details.

Estimated Development Effort: Medium.

Urgency: Medium.

Metadata

Explanation: Pig needs metadata to do the following:

  1. Provide a way for users to understand the data they want to query. For example, users should be able to say DESCRIBE '/user/me/mydata' and get back a list of (minimally) fields and their types.

  2. Do type checking to find situations where users issue queries that are not semantically meaningful (such as dividing by a string).
  3. Allow performance optimizations such as:
    • performing arithmetic operations (e.g. SUM) as a long if the underlying type is an int or long, instead of as a double.
    • allowing pig to load and store numeric types as numerics rather than requiring a conversion from string.
  4. Provide a way to describe to a user defined function the format of the data being passed to the function.
  5. Allow sorting via numeric in addition to lexical order.

The decision has been made to use the Jute interface, provided by hadoop, to describe metadata. This will allow pig to interact with other hadoop tools. It will also free the pig team from needing to development their own metadata management library.

The goal is not to require metadata in pig. Pig will retain the flexibility to work on unspecified input data.

It remains an open question on how pig should handle the case where the data it is provided does not match the metadata specification. The default should be to produce a warning for each non-conforming row. It remains to be determined if there is a use for a "strict" mode where a non-conforming row would cause a query stopping failure.

Note that this metadata only refers to metadata local to a file. It does not refer to global metadata such as table names, available UDFs, etc.

Rational: Goals 2, 3, 4.

Category: Infrastructure.

Requestor: Everyone.

Depends On: Types Beyond String, some changes to Jute.

Current Situation: No metadata is available concerning data stored in the grid.

Estimated Development Effort: Large.

Urgency: High.

NULL Support

Explanation: Pig needs to be able to support NULLs in its data for the following reasons:

  1. Some of its input data has NULL values in it, and users want to be able to act on this NULL data (e.g. filter it out via IS NOT NULL).
  2. Function and expression evaluations sometimes are unable to return a value, but execution of the query should not stop (e.g. divide by zero error). In this case the evaluation needs to return a NULL value to place in the field.

This requires that Jute support NULL values in data stored in its format.

This will require additions to the language, namely the ability to filter on IS (NOT) NULL. It will also require that user defined functions and expression evaluators determine how they will interact with NULLs. Functions that are similar to SQL functions (such as COUNT, SUM, etc.) and arithmetic expression operators should behave in a way consistent with SQL standards (even though SQL standards themselves are inconsistent) in order to avoid violating the law of least astonishment.

Rational: Goal 1.

Category: Infrastructure.

Requestor: Development.

Depends On: Jute NULL support.

Current Situation: There is no concept of NULL in pig at this time.

Estimated Development Effort: Medium.

Urgency: High.

Parameterized Queries

Explanation: Users would like the ability to define parameters in a pig query. When the query is invoked, values for those parameters would be defined and they would then be used in execution of the query. For example:

a = load '/data/mydata/@date@';
b = load '@latest_bot_filter@';
...

The query above could then be invoked, providing values for 'date' and 'latest_bot_filter'.

Rational: Goal 1.

Category: User experience.

Requestor: Users

Depends On:

Current Situation: No support for this is available.

Estimated Development Effort: Small.

Urgency: Low.

Performance

Explanation: Pig needs to run, as close as is feesible in a generic language, near the same performance that could be obtained by a programmer developing an application directly in map reduce. There are multiple possible routes to take in performance enhancement:

  1. Query optimization via relational operator reordering and substitution. For example, pushing filters as far up the execution plan as possible, pushing expression evaluation as late as possible, etc.
  2. In some instances it is possible to execute multiple relational operators in the same map or reduce. In some cases pig is already doing this (e.g. multiple filters are pipelined into one map operation). There are additional cases that pig is not taking advantage of that it could (for example if a join is followed immediately by an aggregation on the same key, both could be placed in the same reduce).
  3. After the map stage, the map reduce system sorts the data according to keys specified by the job requester. This key comparison is done via a class specified by the job requester. Currently, this class is instantiated on every key comparison. This means that the constructor for this class is called at least nlogn times. Hadoop provides a byte comparitor option where the sequence of bytes in the key are compared directly rather than through an instantiated class. Whereever possible pig needs to make use of this comparitor.
  4. Hadoop provides the ability to run a mini-reduce stage (called the combine stage) after the map and before data is shuffled to new processes on the reduce. For certain algebraic functions (such as SUM and COUNT) the amount of data to be shipped between map and reduce could be reduced (sometimes greatly) by running the reduce step first in the combine, and then again in reduce.
  5. The code as it exists today can be instrumented and tests done to determine where the most time is spent. This information can then be used to optimize heavily used areas of the code that are taking significant amounts of time.
  6. Split performance is very poor, and needs to be optimized.
  7. Currently, given a line of pig like foreach A generate group, COUNT($0), SUM($1), pig will run over A twice, once for COUNT and once for SUM. It should be able to compute both in a single pass.

Rational: Goal 3.

Category: Performance.

Requestor: Everyone.

Depends On: Metadata for some optimizations

Current Situation: See Explanation.

Estimated Development Effort:

  1. Large
  2. Medium
  3. Small
  4. Small
  5. Small
  6. ?
  7. ?

Urgency: High for 4, 5, and 6, Medium for the rest.

Query Optimization

See Performance

Shell Support

Explanation: There are three main goals of a pig shell:

  1. The ability to interact in a shell with the HDFS.
  2. The ability to interact in a shell with the user's machine. This would allow the user to edit local files, etc.
  3. The ability to write pig scripts in an interactive shell (similar to perl or python shells).

Items 1 and 2 are similar, the only difference being which file system and OS they interact with. Item 3 is something entirely different, though both types of items are called shells. Item 3 is an interactive programming environment. For this reason I do not think it makes sense to attempt to combine the two shells. Instead I propose that two shells be developed:

hsh (hadoop shell) - This shell will provide shell level interaction with HDFS. In addition to the current operators of cat, cp, kill, ls, mkdir, mv, pwd, rm it may be necessary to add chmod (based on hadoop implementation of permissions), chown and chgrp (based on hadoop implementation of ownership), df, ds, ln (if hadoop someday supports links), and touch. This shell will also support access to the user's local machine and all of the available shell commands. Ideally the shell would even support applying standard shell file tools that access files linearly (grep, more, etc.) to HDFS files, but this may be a later addition. Separate commands would not be required for cp, rm etc. based on which file system. The shell should be able to determine that based on the path. To accomplish this, HDFS files would have /hdfs/CLUSTER (where CLUSTER is the name of the cluster, e.g. kryptonite) as the base of their path. hsh would then need to intercept shell commands such as rm, and determine whether to execute in HDFS or the local file system, based on the file being manipulated. For example, if a user wanted to create a file in his local directory on the gateway machine and then copy it to the HDFS, he could then do:

hsh> vi myfile
hsh> cp myfile /hdfs/kryptonite/user/me/myfile

The invocation of vi would call vi on the local machine and edit a file in the current working directory on his local box. The copy would move the file from the user's local directory on the gateway machine to his directory on the HDFS.

hsh will support standard command line editing (and ideally tab completion). This should be done via third party library such as jline.

grunt - This shell will provide interactive pig scripting, similar to the way it does today. It will no longer support HDFS file manipulation commands (such as mv, rm, etc.). It will also be extended to support inline scripting in either perl or python. (Which one should be used is not clear. The research team votes for python. User input is needed to decide if more users would prefer python or perl.) This will enable two important features:

  1. Embedding pig in a scripting language to give it control flow and branching structures.
  2. Allow on the fly development of user defined functions by creating a function native to the scripting language and then referencing it in the pig script.

The grunt shell will be accessible from hsh by typing grunt. This will put the user in the interactive shell, in much the same way that typing python on a unix box puts the user in an interactive python development shell.

Rational: Goals 4 and 6.

Category: Infrastructure.

Requestor: Development, Users

Depends On:

Current Situation: Currently grunt provides very limited commands (ls, mkdir, mv, cp, cat, rm, kill) for the HDFS. It also provides an interactive shell for generating pig scripts on the fly (similar to perl or python interactive shells). No command line history or editing are provided. No connection with the users local file system is supported.

Estimated Development Effort: Large.

Urgency: Medium.

SQL Support

Explanation: There is a large base of SQL users. Many of these users will prefer to query data in SQL rather than learn pig. To accomodate these users and increase the adoption of hadoop, we could provide a SQL layer on top of pig. From pig's perspective, the easiest way to address this it to directly translate SQL to pig, and then execute the resulting pig script.

Rational: Goal 4.

Category: User experience.

Requestor: Users

Depends On: Metadata

Current Situation: See Explanation.

Estimated Development Effort: Large.

Urgency: Medium.

Statistics on Pig Usage

Explanation: Pig should record statistics on its usage so that pig developers and grid administrators can monitor and understand pig usage. These statistics need to be collected in a way that does not compromise the security of data that is being queried (i.e. they cannot store results of the queries). They should however contain:

For security reasons, these statitics will need to be kept inside the grid they are generated on, and only be accessable to cluster administrators and pig developers.

Rational: Goal 7. This also assists developers in meeting goals 1 and 3 because it allows them to determine the quality and performance of the user experience.

Category: Engineering manageability.

Requestor: Administrators

Depends On:

Current Situation: There are no usage statistics available.

Estimated Development Effort: Medium.

Urgency: Medium.

Stream Support

Explanation: Hadoop supports running HDFS files through an arbitrary executable, such as grep or a user provided program. This is referred to as streaming. There exists already a base of user programs used in this way. Users have expressed an interest in integrating these with pig so that they can take advantage of pig's parallel structure and relational operators but still use their hand crafted programs to express their business logic. For example (note: syntax and semantics still TBD)

a = load '/user/me/mydata';
b = filter a by $0 matches "^[a-zA-Z]\.yahoo.com";
c = stream b through countProperties;
...

Rational: Goal 4.

Category: Infrastructure.

Requestor: Users

Depends On:

Current Situation: No support for streaming is currently available.

Estimated Development Effort: Medium.

Urgency: Medium (while it has only been requested by a couple of users to date, we believe it will open up pig usage to a number of users and therefore is more desirable).

Test Framework

Explanation: Data querying systems like pig have unique functional testing requirements. For unit tests, junit is an excellent tool. But for function tests, as arbitrary queries of potentially large amounts of data need to be added on a regular basis it is not feasible to use a testing system like junit that assumes the test implementer knows the correct result for the test when the test is implemented. We need a tool that will allow the tester to designate a source of truth for the test, and then generate the expected results for that test from that source of truth. For example, for small functional tests a database could be set up with the same data as in the grid. The tester would then write a pig query and an equivalent SQL query, and the test harness would run both and compare the results.

In addition to the above, a set of performance tests need to be created to allow developers to test pig performance. It should be possible to develop these in the same framework as is used for the functional tests.

Rational: Goal 1.

Category: Engineering manageability.

Requestor: Development.

Depends On:

Current Situation: Pig has unit tests in junit.

Estimated Development Effort: Medium.

Urgency: High (quality testing that is easy for developers to implements facilitates better and faster development of features).

Test Integration with Hudson

Explanation: Currently hadoop uses hudson as its nightly build environment. This includes checking out, building, and running unit tests. It is also used to test new patches that are submitted to jira. Pig needs to be integrated with hudson in this same way.

Also, there are a set of tools that are regularly applied to other hadoop nightly builds (Findbugs and code coverage metrics) via hudson. These need to be applied to pig as well.

Rational: Goal 1.

Category: Engineering manageability.

Requestor: QA.

Depends On:

Current Situation: See Explanation

Estimated Development Effort: Medium.

Urgency: High (frees committers from manually testing patches).

Types Beyond String

Explanation: For performance and usability reasons pig needs to support atomic data types beyond strings. Based on data types supported in jute and standard data types supported by most data processing systems, these types will be:

Supporting these types in a native format will allow performance gains in reading and writing data to disk and in key comparison for sorting, grouping, etc. It will also avoid requiring string->number conversions for every row during numeric expression evaluation. It will also allow sorting in numeric order rather than requiring that all sorts be in lexical order, as is currently the case.

User defined types will require the user to provide a set of functions to:

Optionally, a user could provide functions that do

Rational: Goals 1 and 3.

Category: Infrastructure.

Requestor: Development.

Depends On:

Current Situation: A functional spec for changes to types has been proposed, see http://wiki.apache.org/pig/PigTypesFunctionalSpec

Estimated Development Effort: Large.

Urgency: High

User Defined Function Support in Other Language

Explanation: Many potential pig users are not java programmers. Many prefer to program in C++, perl, python, etc. Pig needs to allow users to program in their language of preference.

This requires that pig provide APIs so that users can write code in these other languages and use it as part of their query in a way similar to what is done with java functions today. Java functions can be implemented as eval functions, filter functions, or load and storage functions. At this point there is not a perceived need for load and storage functions in languages beyond java. But eval and filter functions should be supported.

Perl, python, and C++ have been chosen as the languages to be supported because those are the languages most used by the user community at this time.

Rational: Goal 4.

Category: Infrastructure.

Requestor: Everyone.

Depends On:

Current Situation: Pig currently supports user defined functions in java.

Estimated Development Effort: Medium.

Urgency: High.



Addenda

The following thoughts were added after the roadmap had been published and reviewed.

Execution that supports restart

We discussed the need for something that takes a Pig execution plan and executes it in a way the user can easily observe and comprehend. If a job breaks the user gets a HDFS directory with all of the completed sub-job state and enough info that the user can repair his Pig script or the data and restart. This is needed for really long jobs.

Global Metadata

Need to consider where/how/if pig will store global metadata. Items that would go here would include table->file mapping (to support such things as show tables), UDFs available on the cluster, UDTs available on the cluster, etc.

ProposedRoadMap (last edited 2009-09-23 16:01:11 by AlanGates)