Note: For Pig 0.2.0 or later, some content on this page may no longer be applicable.

Pig User Defined Functions

Pig has a number of built-in functions for loading, filtering, aggregating data (for a complete list, see PigBuiltins.) However, if you want to do something specialized, you may need to write your own Pig user defined function (UDF). This page walks you through the process.

Types of functions

Eval Function

The most important and commonly used type of functions are EvalFunction. Eval functions consume a tuple, do some computation, and produce some data.

Eval functions are very flexible, e.g. they can mimic "map" and "reduce" style functions:

Load Function

Controls reading of tuples from files.

Store Function

Controls storing of tuples to files.

Example

The following example uses each of the types of functions. It computes the set of unique IP addresses associated with "good" products drawn from a list of products found on the web.

register myFunctions.jar
products = LOAD '/productlist.txt' USING MyListStorage() AS (name, price, description, url);
goodProducts = FILTER products BY (price <= '19.99');
hostnames = FOREACH goodProducts GENERATE MyHostExtractor(url) AS hostname;
uniqueIPs = FOREACH (GROUP hostnames BY MyIPLookup(hostname)) GENERATE group AS ipAddress;
STORE uniqueIPs INTO '/iplist.txt' USING MyListStorage();

In the above example, MyListStorage() serves as a load function as well as a store function; MyHostExtractor() and MyIPLookup() are eval functions. myFunctions.jar is a jar file that contains the classes for the user-defined functions.

How to write functions

Ready to write your own handy-dandy pig function? Before you start, you will need to know about the APIs for interacting with the data types (atom, tuple, bag). Click here: PigDataTypeApis.

Click below to learn how to build your own:

<<Anchor: execution failed [Too many arguments] (see also the log)>>

Ok, I have written my function, how to use it?

You can use your functions following the steps below:

Example:

The following example describes how to use your Eval function. Follow the same procedure for your Load/Store function.

1. Create your function /src/myfunc/MyEvalFunc.java

package myfunc;

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;

public class MyEvalFunc extends EvalFunc<DataBag>
{
        //@Override
        public void exec(Tuple input, DataBag output) throws IOException
        {
                String str = input.getAtomField(0).strval();
                StringTokenizer tok = new StringTokenizer(str, " \",()*", false);
                while (tok.hasMoreTokens())
                {
                        output.add(new Tuple(tok.nextToken()));
                }
        }
}

2. Compile your function. Make sure to point java compiler to pig jar file.

/src/myfunc $ javac -classpath /src/pig.jar MyEvalFunc.java

3. Create jar file

/src/myfunc $ cd ..
/src $  jar cf myfunc.jar myfunc

4. Use the function through grunt (similar use from script). Note that there is no quotes around path in the register call.

/src $ java -jar pig.jar -
grunt> register /src/myfunc.jar
grunt> A = load 'students' using PigStorage('\t');
grunt> B = foreach A generate myfunc.MyEvalFunc($0);
grunt> dump B;
({(joe smith)})
({(john adams)})
({(anne white)})
....

See EmbeddedPig to see example of embeding Pig and your functions in Java. Use the same procedure outlined above to create your function jar file.

Advanced Features:

WriteFunctions (last edited 2009-09-20 23:38:30 by localhost)