Differences between revisions 1 and 2
Revision 1 as of 2007-11-07 18:38:50
Size: 3647
Editor: rollwhite-lx
Revision 2 as of 2009-09-20 23:38:19
Size: 3647
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
[[Anchor(Eval_Functions)]] <<Anchor(Eval_Functions)>>
Line 11: Line 11:
[[Anchor(Input_to_the_Functions)]] <<Anchor(Input_to_the_Functions)>>
Line 29: Line 29:
[[Anchor(Output_of_the_functions)]] <<Anchor(Output_of_the_functions)>>
Line 36: Line 36:
[[Anchor(Example)]] <<Anchor(Example)>>
Line 55: Line 55:
[[Anchor(Advanced_Features)]] <<Anchor(Advanced_Features)>>

Eval Functions

To create an eval function, the following abstract class must be extended. The parameter T is the return type of the eval function.

public abstract class EvalFunc<T extends Datum>  {
    abstract public void exec(Tuple input, T output) throws IOException;

Input to the Functions

The arguments to the function get wrapped in a tuple and are passed as the parameter input above. Thus, the first field of input is the first argument and so on.

For example, suppose I have a data set A =

<a, b, c>
<1, 2, 3>

Suppose, I have written an Eval Function MyFunc and my PigLatin is as follows:

B = foreach A generate MyFunc($0,$2);

Then MyFunc will be called first with the tuple <a, c> and then with the tuple <1, 3>.

Output of the functions

When extending the abstract class, the type parameter T must be bound to a subclass of Datum. (The compiler will allow you to subclass EvalFunc<Datum> but you will get an error on using that function). When T is bound to a particular type of Datum ( DataAtom, or Tuple, or DataBag, or DataMap), the eval function gets handed, through the parameter output, a Datum of type T to produce its output in.

Note that in case T is a databag, although you get handed a DataBag as the parameter output, this is an append-only data bag. Its contents always remain empty. This is a performance optimization (we use it for pipelining) based on the assumption that you wouldnt want to examine your own output.


As an example, here is the code for the builtin function TOKENIZE, that expects as input 1 argument of type data atom, and tokenizes the input data atom string to a data bag of tuples, one for each word in the input string.

public class TOKENIZE extends EvalFunc<DataBag> {

    public void exec(Tuple input, DataBag output) throws IOException {
        String str = input.getAtomField(0).strval();
        StringTokenizer tok = new StringTokenizer(str, " \",()*", false);
        while (tok.hasMoreTokens()) {
            output.add(new Tuple(tok.nextToken()));

Advanced Features

  • Schemas: Eval functions can declare their output schema by overriding the following method in EvalFunc. See: PigLatinSchemas.

     * @param input Schema of the input
     * @return Schema of the output
    public Schema outputSchema(Schema input)
          return input.copy();
  • Algebraic Eval Functions If the input to your function might be large (i.e. the input tuple may contain a large bag of tuples nested inside of it) and you are concerned about performance, you may want to consider writing your function in such a way that it can receive its input in small "chunks," one at a time, and then merge the per-chunk outputs to obtain the final output. (In the map/reduce model, the "combiner" feature does this.) To enable this feature, your eval function must implement the interface Algebraic. See AlgebraicEvalFunc for details.

  • Final cleanup action If your function needs to do some final action after being called the last time for a particular input set, it can override the finish method of the class EvalFunc.

     * Placeholder for cleanup to be performed at the end. User defined functions can override.
    public void finish(){}

EvalFunction (last edited 2009-09-20 23:38:19 by localhost)