The purpose of this page is to give a quick tour of Pig. It is intentionally high level and misses some details, so as to make it easy to consume by somebody who just wants to understand the main capabilities of Pig.

Complete documentation is available at: Pig Wiki Main Page

What is Pig:

Pig has two parts:

Pig Latin programs:

Before you start

Make sure you [BuildPig] (or download it) and then [RunPig]. [RunPig] will help you check your configuration and run a small task.

Data formats and models:

Other capabilities:

Examples:

Example 1: Hello Pig

Suppose you have a function, makeThumbnail(), that converts an image into a small thumbnail representation. You want to convert a set of images into thumbnails. A Pig Latin program to do this is:

images = load '/myimages' using myImageStorageFunc();
thumbnails = foreach images generate makeThumbnail(*);
store thumbnails into '/mythumbnails' using myImageStorageFunc();

The first line tells Pig: (1) what is the input to your computation (in this case, the content of the directory '/myimages'), and (2) how can Pig interpret the file and delineate individual images (in this case, by invoking myImageStorageFunc). (#2 is optional; if omitted, Pig will attempt to parse the file using its default parser.)

The second line instructs Pig to convert every image into a thumbnail, by running the user's makeThumbnail function on each image.

The third line instructs Pig to store the result into the directory '/mythumbnails', and encode the data into the file according to the myImageStorageFunc() function.

Most Pig Latin commands consist of an assignment to a variable (e.g., images, thumbnails). These variables denote tables, but these tables are not necessarily materialized on disk or in the memory of any one machine. The final "store" command causes Pig to compile the preceeding commands into an execution plan, e.g., one or more Map-Reduce jobs to execute on Hadoop. In the above example, the program will get compiled into a single Map-Reduce job where the Reduce phase is disabled, i.e., the output of the Map is the final output.

Example 2: Using the relational-style operations

Suppose you have a log of users visiting web pages, which has entries of the form (user, url, time). Say you want to compute the average number of page visits done by a user (e.g., if the answer is 4, that means that, on average, each user generated four page visit events in the log). Here's a Pig Latin program that computes this number:

     VISITS = load '/visits' as (user, url, time);
USER_VISITS = group VISITS by user;
USER_COUNTS = foreach USER_VISITS generate group as user, COUNT(VISITS) as numvisits;
 ALL_COUNTS = group USER_COUNTS all;
  AVG_COUNT = foreach ALL_COUNTS generate AVG(USER_COUNTS.numvisits);

dump AVG_COUNT;

The first line loads the data and specifies the schema. The "using" clause has been omitted because here we assuming the data is in a tab-delimited text format that Pig can parse by default. Since this data is multifaceted, we use the "as" clause to assign names to the data fields: user, url, time. The "as" clause is optional; if it is not used you can refer to the fields by position (e.g., $0 for user, $1 for url, $2 for time).

The second line forms groups of tuples, one group for each unique user. The third line computes the size of each group, i.e., the number of log events associated with each user.

The fourth line places all tuples output from the previous step into a single group, and the fifth line computes the average of these values, i.e., the average of the per-user counts. If you wanted, say, standard deviation instead of average, which at present is not a built-in Pig function, you could write your own function and referenced it in the "generate" clause in place of AVG.

Since the output is small (a single number), the user decided to use "dump" instead of "store" to produce the output. Dump causes the output to be printed to the screen, instead of written to a file. In general, we recommend that you use this command with caution :)

Example 3: Combining multiple data sets

One of the key features of Pig is that you can combine multiple data sets using operations like join, union, cogroup. (These operations are explained in detail in our documentation.)

Suppose we have our web page visit log from Example 2, and we also have a file that records the pagerank of each URL in the known universe, including all URLs in the visit log. If you aren't familiar with pagerank, just think of it as a numeric quality score of each web page. (In this example we assume the pagerank values have been computed in advance; although the iterative pagerank computation can be expressed in Pig Latin with an outer Java function controlling the looping.) Say you want to identify users who tend visit "good" pages, defined in terms of average pagerank exceeding some threshold. Here it is in Pig Latin:

      VISITS = load '/visits' as (user, url, time);
       PAGES = load '/pages' as (url, pagerank);
VISITS_PAGES = join VISITS by url, PAGES by url;
 USER_VISITS = group VISITS_PAGES by user;
  USER_AVGPR = foreach USER_VISITS generate group, AVG(VISITS_PAGES.pagerank) as avgpr;
  GOOD_USERS = filter USER_AVGPR by avgpr > '0.5';

store GOOD_USERS into '/goodusers';

The first two lines load our two data sets (visits and pages). The third line specifies a join over the two sets -- it finds visits tuples that have the same URL as pages tuples, and glues them together. Hence the VISITS_PAGES table gives us the pagerank of the URL in each visit tuple.

The fourth line groups tuples by user, and the fifth line computes the average pagerank of each user's visited URLs.

The fifth line filters out users whose pagerank is not greater than 0.5.

Is Pig the right platform for your scenario?

Pig is right for you if you:

Pig is not right for you if you: