Differences between revisions 11 and 12
Revision 11 as of 2009-01-15 01:41:43
Size: 9416
Editor: nat-dip4
Revision 12 as of 2009-09-20 23:38:25
Size: 9420
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
Complete documentation is available at: [http://wiki.apache.org/pig/FrontPage Pig Wiki Main Page] Complete documentation is available at: [[http://wiki.apache.org/pig/FrontPage|Pig Wiki Main Page]]
Line 9: Line 9:
 * A set of ''evaluation mechanisms'' for evaluating a Pig Latin program. Current evaluation mechanisms include (a) local evaluation in a single JVM, (b) evaluation by translation into one or more Map-Reduce jobs, executed using [http://lucene.apache.org/hadoop Hadoop].  * A set of ''evaluation mechanisms'' for evaluating a Pig Latin program. Current evaluation mechanisms include (a) local evaluation in a single JVM, (b) evaluation by translation into one or more Map-Reduce jobs, executed using [[http://lucene.apache.org/hadoop|Hadoop]].

The purpose of this page is to give a quick tour of Pig. It is intentionally high level and misses some details, so as to make it easy to consume by somebody who just wants to understand the main capabilities of Pig.

Complete documentation is available at: Pig Wiki Main Page

What is Pig:

Pig has two parts:

  • A language for processing data, called Pig Latin.

  • A set of evaluation mechanisms for evaluating a Pig Latin program. Current evaluation mechanisms include (a) local evaluation in a single JVM, (b) evaluation by translation into one or more Map-Reduce jobs, executed using Hadoop.

Pig Latin programs:

  • Pig Latin has built-in relational-style operations such as filter, project, group, join. Pig Latin also has a map operation that applies a custom user function to every member of a set. In Pig Latin, the map operation is called "foreach".
  • Additionally, users can incorporate their own custom code into essentially any Pig Latin operation. For example, if a user has a function that determines whether a given image contains a human face, the user can ask Pig to filter images according to this function. Pig will then evaluate this function on the user's behalf, over the images. If the evaluation mechanism incorporates parallelism, as is the case with the Hadoop evaluation mechanism, then the user's function will be executed in a parallel fashion.

Before you start

Make sure you [BuildPig] (or download it) and then [RunPig]. [RunPig] will help you check your configuration and run a small task.

Data formats and models:

  • Pig can process data of any format. (Pigs eat anything! .. or is that goats?) A few common formats such as tab delimited text files, are supported via built-in capabilities. A user can add support for a file format by writing a function that parses the bytes of a file into objects in Pig's data model, and vice versa.
  • Pig's data model is similar to the relational data model, except that tuples (a.k.a. records or rows) can be nested. For example, you can have a table of tuples, where the third field of each tuple contains a table. In Pig, tables are called bags. Pig also has a "map" data type, which is useful in representing semistructured data, e.g. JSON or XML.

Other capabilities:

  • Combining multiple data sets: Pig can combine multiple data sets, via operations such as join, union or cogroup.

  • Splitting data sets: Pig can split a single data set into multiple ones, using an operation called split.

  • Semistructured data: Pig supports "maps" of (key, value) pairs, where retrieving the value associated with a given key is an efficient operation. Maps provide a convenient way to represent semistructured data, where the set of non-null fields varies from record to record. Maps are helpful when processing JSON, XML, and sparse relational data (i.e., tables with a lot of null values).


Example 1: Hello Pig

Suppose you have a function, makeThumbnail(), that converts an image into a small thumbnail representation. You want to convert a set of images into thumbnails. A Pig Latin program to do this is:

images = load '/myimages' using myImageStorageFunc();
thumbnails = foreach images generate makeThumbnail(*);
store thumbnails into '/mythumbnails' using myImageStorageFunc();

The first line tells Pig: (1) what is the input to your computation (in this case, the content of the directory '/myimages'), and (2) how can Pig interpret the file and delineate individual images (in this case, by invoking myImageStorageFunc). (#2 is optional; if omitted, Pig will attempt to parse the file using its default parser.)

The second line instructs Pig to convert every image into a thumbnail, by running the user's makeThumbnail function on each image.

The third line instructs Pig to store the result into the directory '/mythumbnails', and encode the data into the file according to the myImageStorageFunc() function.

Most Pig Latin commands consist of an assignment to a variable (e.g., images, thumbnails). These variables denote tables, but these tables are not necessarily materialized on disk or in the memory of any one machine. The final "store" command causes Pig to compile the preceeding commands into an execution plan, e.g., one or more Map-Reduce jobs to execute on Hadoop. In the above example, the program will get compiled into a single Map-Reduce job where the Reduce phase is disabled, i.e., the output of the Map is the final output.

Example 2: Using the relational-style operations

Suppose you have a log of users visiting web pages, which has entries of the form (user, url, time). Say you want to compute the average number of page visits done by a user (e.g., if the answer is 4, that means that, on average, each user generated four page visit events in the log). Here's a Pig Latin program that computes this number:

     VISITS = load '/visits' as (user, url, time);
USER_VISITS = group VISITS by user;
USER_COUNTS = foreach USER_VISITS generate group as user, COUNT(VISITS) as numvisits;
  AVG_COUNT = foreach ALL_COUNTS generate AVG(USER_COUNTS.numvisits);


The first line loads the data and specifies the schema. The "using" clause has been omitted because here we assuming the data is in a tab-delimited text format that Pig can parse by default. Since this data is multifaceted, we use the "as" clause to assign names to the data fields: user, url, time. The "as" clause is optional; if it is not used you can refer to the fields by position (e.g., $0 for user, $1 for url, $2 for time).

The second line forms groups of tuples, one group for each unique user. The third line computes the size of each group, i.e., the number of log events associated with each user.

The fourth line places all tuples output from the previous step into a single group, and the fifth line computes the average of these values, i.e., the average of the per-user counts. If you wanted, say, standard deviation instead of average, which at present is not a built-in Pig function, you could write your own function and referenced it in the "generate" clause in place of AVG.

Since the output is small (a single number), the user decided to use "dump" instead of "store" to produce the output. Dump causes the output to be printed to the screen, instead of written to a file. In general, we recommend that you use this command with caution :)

Example 3: Combining multiple data sets

One of the key features of Pig is that you can combine multiple data sets using operations like join, union, cogroup. (These operations are explained in detail in our documentation.)

Suppose we have our web page visit log from Example 2, and we also have a file that records the pagerank of each URL in the known universe, including all URLs in the visit log. If you aren't familiar with pagerank, just think of it as a numeric quality score of each web page. (In this example we assume the pagerank values have been computed in advance; although the iterative pagerank computation can be expressed in Pig Latin with an outer Java function controlling the looping.) Say you want to identify users who tend visit "good" pages, defined in terms of average pagerank exceeding some threshold. Here it is in Pig Latin:

      VISITS = load '/visits' as (user, url, time);
       PAGES = load '/pages' as (url, pagerank);
VISITS_PAGES = join VISITS by url, PAGES by url;
 USER_VISITS = group VISITS_PAGES by user;
  USER_AVGPR = foreach USER_VISITS generate group, AVG(VISITS_PAGES.pagerank) as avgpr;
  GOOD_USERS = filter USER_AVGPR by avgpr > '0.5';

store GOOD_USERS into '/goodusers';

The first two lines load our two data sets (visits and pages). The third line specifies a join over the two sets -- it finds visits tuples that have the same URL as pages tuples, and glues them together. Hence the VISITS_PAGES table gives us the pagerank of the URL in each visit tuple.

The fourth line groups tuples by user, and the fifth line computes the average pagerank of each user's visited URLs.

The fifth line filters out users whose pagerank is not greater than 0.5.

Is Pig the right platform for your scenario?

Pig is right for you if you:

  • Need to analyze data that is small (kilobytes), tall (megabytes), grande (gigabytes), or venti (terabytes).

  • Want to be able to create, modify and reuse your analysis logic easily.
  • Process one data set at a time
  • ... or need to combine multiple data sets.
  • Do simple processing (e.g., count the number of images on the web)
  • ... or complex processing (e.g., count the number of images that contain faces).
  • It's especially right for you if you have access to a many-computer cluster already running hadoop.

Pig is not right for you if you:

  • Need to retrieve individual records, or small ranges of records, from a very large data set (e.g., lookup Joe Smith's customer profile).
  • Have real-time data serving requirements (e.g., assemble a web page for Joe in under 100ms).
  • Need to be able to do random writes to specific data records.
  • Don't like barnyard animals.

PigOverview (last edited 2009-09-20 23:38:25 by localhost)