EmbeddedPig

Embedding Pig In Java Programs

Sometimes you want more control than Pig scripts can give you. If so, you can embed Pig Latin in Java (just like SQL can be embedded in programs using JDBC).

The following steps need to be carried out:

Example

Lets assume that I need to count the number of occurrences of each word in a document. Lets also assume that you have EvalFunction Tokenize that parses a line of text and returns all the words for that line. The function is located in /mylocation/tokenize.jar.

PigLatin script for this computation will look as follows:

register /mylocation/tokenize.jar
A = load 'mytext' using TextLoader();
B = foreach A generate flatten(tokenize($0));
C = group B by $1;
D = foreach C generate flatten(group), COUNT(B.$0);
store D into 'myoutput';

The same can be accomplished with the following Java program


import java.io.IOException;
import org.apache.pig.PigServer;

public class WordCount {
   public static void main(String[] args) {
      
      PigServer pigServer = new PigServer();
        
      try {
         pigServer.registerJar("/mylocation/tokenize.jar");
         runMyQuery(pigServer, "myinput.txt";
        } catch (IOException e) {
         e.printStackTrace();
      }
   }
   
   public static void runMyQuery(PigServer pigServer, String inputFile) throws IOException {        
       pigServer.registerQuery("A = load '" + inputFile + "' using TextLoader();");
       pigServer.registerQuery("B = foreach A generate flatten(tokenize($0));");
       pigServer.registerQuery("C = group B by $1;");
       pigServer.registerQuery("D = foreach C generate flatten(group), COUNT(B.$0);");
      
       pigServer.store("D", "myoutput");
   }
}

Notes:

To run your program, you need to first compile it by using the following command:

javac -cp <path>pig.jar WordCount.java

If the compilation is successful, you can then run your program:

java -cp <path>pig.jar WordCount

last edited 2007-11-07 23:52:30 by OlgaN