Eval Functions

To create an eval function, the following abstract class must be extended. The parameter T is the return type of the eval function.

public abstract class EvalFunc<T extends Datum>  {
    abstract public void exec(Tuple input, T output) throws IOException;

Input to the Functions

The arguments to the function get wrapped in a tuple and are passed as the parameter input above. Thus, the first field of input is the first argument and so on.

For example, suppose I have a data set A =

<a, b, c>
<1, 2, 3>

Suppose, I have written an Eval Function MyFunc and my PigLatin is as follows:

B = foreach A generate MyFunc($0,$2);

Then MyFunc will be called first with the tuple <a, c> and then with the tuple <1, 3>.

Output of the functions

When extending the abstract class, the type parameter T must be bound to a subclass of Datum. (The compiler will allow you to subclass EvalFunc<Datum> but you will get an error on using that function). When T is bound to a particular type of Datum ( DataAtom, or Tuple, or DataBag, or DataMap), the eval function gets handed, through the parameter output, a Datum of type T to produce its output in.

Note that in case T is a databag, although you get handed a DataBag as the parameter output, this is an append-only data bag. Its contents always remain empty. This is a performance optimization (we use it for pipelining) based on the assumption that you wouldnt want to examine your own output.


As an example, here is the code for the builtin function TOKENIZE, that expects as input 1 argument of type data atom, and tokenizes the input data atom string to a data bag of tuples, one for each word in the input string.

public class TOKENIZE extends EvalFunc<DataBag> {

    public void exec(Tuple input, DataBag output) throws IOException {
        String str = input.getAtomField(0).strval();
        StringTokenizer tok = new StringTokenizer(str, " \",()*", false);
        while (tok.hasMoreTokens()) {
            output.add(new Tuple(tok.nextToken()));

Advanced Features

     * @param input Schema of the input
     * @return Schema of the output
    public Schema outputSchema(Schema input)
          return input.copy();

     * Placeholder for cleanup to be performed at the end. User defined functions can override.
    public void finish(){}

EvalFunction (last edited 2009-09-20 23:38:19 by localhost)