Parameter Substitution in Pig

Motivation

This document describes a proposal for implementing parameter substitution in pig. This proposal is motivated by multiple requests from users who would like to create a template pig script and then use it with different parameters on a regular basis. For instance, if you have daily processing that is identical every day except the date it needs to process, it would be very convenient to put a placeholder for the date and provide the actual value at run time.

Requirements

  1. Ability to have parameters within a pig script and provide values for this parameters at run time.
  2. Ability to provide parameter values on the command line
  3. Ability to provide parameter values in a file
  4. Ability to generate parameter values at run time by running a binary or a script.
  5. Ability to provide default values for parameters
  6. Ability to retain the script with all parameters resolved. This is mostly for debugging purposes.

Interface

Using Parameters

Parameters in a pig script are in the form of $<identifier>.

A = load '/data/mydata/$date';
B = filter A by $0>'5';
.....

In this example, the value of the date is expected to be passed on each invocation of the script and is substituted before running the pig script. An error is generated if the value for any parameter is not found.

A parameter name have a structure of a standard language identifier: it must start with a letter or underscore followed by any number of letters, digits, and underscores. The names are case insensitive. The names can be escaped with \ in which case substitution does not take place.

In the initial version of the software the parameters are only allowed when pig script is specified. They are disabled with -e switch or in the interactive mode.

Specifying Parameters

Parameter value can be supplied in four different ways.

Command Line

Parameters can be passed via pig command line using -param <param>=<val> construct. Multiple parameters can be specified. If the same parameter is specified multiple times, the last value will be used and a warning will be generated.

pig -param date=20080201

Parameter File

Parameters can also be specified in a file that can be passed to pig using -param_file <file> construct. Multiple files can be specified. If the same parameter is present multiple times in the file, the last value will be used and a warning will be generated. If a parameter present in multiple files, the value from the last file will be used and a warning will be generated.

A parameter file will contain one line per parameter. Empty lines are allowed. Perl style (#) comment lines are also allowed. Comments must take a full line and # must be the first character on the line. Each parameter line will be of the form: <param_name>=<param_value>. White spaces around = are allowed but are optional.

# my parameters

date = 20080201
cmd = `generate_name`

Files and command line parameters can be combined with command line parameters taking precedence.

Declare Statement

declare command can be used from within pig script. The use case for this is to describe one parameter in terms of other(s).

%declare CMD `$mycmd $date`
A = load '/data/mydata/$CMD';
B = filter A by $0>'5';
.....

The format is %declare <param> <value>

declare command starts with % to indicate that this is a preprocessor command that is processed prior to executing pig script. It takes the highest precedence. The scope of parameter value defined via declare is all the lines following declare command until the next declare command that defines the same parameter is encountered.

Default Statement

default command can be used to provide a default value for a parameter. This value is used if the parameter has no value defined by any other means. (default has the lowest priority.).

default has the format and scoping rules identical do declare.

%default DATE '20080101'

Processing Order

  1. Configuration files are scanned in the order they are specified on the command line. Within each file, the parameters are processed in the order they are specified.
  2. Command line parameters are scanned in the order they are specified on the command line.
  3. declare/default statements are processed in the order they appear in the pig script.

Value Format

Value formats are identical regardless of how the parameter is specified and can be of two types. First is a sequence of characters enclosed in single or double quotes. In this case the unquoted version of the value is used during substitution. Quotes within the value can be escaped. Single word values that dont use special characters such as % or = don't have to be quoted.

%declare DESC 'Joe\'s URL'
A = load 'data' as (name, desc, url);
B = FILTER A by desc eq '$DESC';

Note that the constant given to the filter needs to be enclosed in quotes because the parameter value is the unquoted version of the string.

Second is a command enclosed in backticks. In this case, the command is executed and its stdout is used as the parameter value:

%declare CMD `generate_date`
A = load '/data/mydata/$CMD';
B = filter A by $0>'5';
.....

The values of both types can be expressed in terms of other parameters as long a the values of the dependent parameters are defined prior to this value.

%declare CMD `$mycmd $date`
A = load '/data/mydata/$CMD';
B = filter A by $0>'5';
.....

In this example, parameters mycmd and date are substituted first when declare statement is encountered. Then the resulting command is executed and its stdout is placed into the path prior to running the load statement.

Debugging

If -debug option is specified to pig, it will produce fully substituted pig script in the current working directory named <original name>.substituted

A -dryrun option will be added to pig in which case no execution is performed and substituted script is produced. We can also use the same option to produce just the execution plan.

=== Logging ===

Pig uses apache commons(http://commons.apache.org/logging/) in conjunction with log4j(http://logging.apache.org/log4j/) and we should to the same in the parameter substitution code.

The following code can be used to instanciate a logger:

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
....

class ParameterSubstitutionPreprocessor
{
    private final Log log = LogFactory.getLog(getClass());
    ....
}

Note that this code will work once we integrate this into Pig.

Pig uses INFO as the default log level. Any messages that you want users to see during normal operation should be logged at this level. Anything that is only useful for debugging, should be logged at DEBUG level. Warnings should be logged at WARN level.

Error Handling

All the errors should be propagated via exceptions. (The code should not use exit calls to make sure that the caller can react to the error.)

The following exceptions should be used:

want to make sure that we don't have to declare additional exceptions in our APIs.)

Design

A C-style preprocessor will be written to perform parameter substitution. The preprocessor will do the following:

  1. Create an empty <original name>.substituted file in the current working directory

  2. Create parameter hash that maps parameter names to parameter values.

  3. Read parameters from files in the order they are specified on the command line
  4. Resolve each parameter:

    • search the parameter value for variables that need to be replaced and perform replacement if needed. Generate an error and abort if replacement is needed but the

correspondent parameter is not found in the parameter hash.

value equal to stdout of the command. If the command fails (returns non-0 value), report the error and abort the processing.

  1. Resolve each command line parameter in the order they are specified on the command line
    • use the same resolution steps as for parameters passed in a file
  2. For each line in the input script
    • if comment or empty line, copy over
    • if declare line resolve the parameter using the same steps as for parameters passed in a file
    • if default line is encountered, the parameter defined is looked up in the parameter hash. If the parameter is not found, processing identical to declare line is

performed; otherwise, the line is skipped.

parameter is not found in the parameter hash. (Reuse the code from the parameter substitution in declare statement.)

  1. If -dryrun is not specified, pass the output file to grunt to execute. Otherwise, print the name of the file and exit.
  2. if neither -debug nor -dryrun are specified, remove the output file.

Future Features

One nice feature to add later is to be able to constrain parameter names. For instance in the statement below the intent might be to replace only $date and leave latest in the path.

A = load 'data/$date_latest';
...

This can be specified with perl-style syntax:

A = load 'data/${date}_latest';
...

ParameterSubstitution (last edited 2009-09-20 23:38:38 by localhost)