Differences between revisions 6 and 7
Revision 6 as of 2008-10-20 06:29:17
Size: 3404
Comment:
Revision 7 as of 2009-09-20 23:38:19
Size: 3404
Editor: localhost
Comment: converted to 1.6 markup
No differences found!

Illustrate

Illustrate is a new addition to pig that helps users debug their pig scripts.

The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. The ExampleGenerator algorithm can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate all the sampled data items, giving you empty results which is of no help in debugging.

This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user.

Usage

Illustrate command can be used in the following way:

Say the input file is 'visits.txt' containing the following data :

Amy     cnn.com 20070218
Fred    harvard.edu     20071204
Amy     bbc.com 20071205
Fred    stanford.edu    20071206

A grunt session might look something like this (Note the use of schemas while loading data. ExampleGenerator needs you to provide aliases) :

grunt> visits = load 'visits.txt' as (user, url, timestamp);
grunt> recent_visits = filter visits by timestamp >= '20071201';
grunt> user_visits = group recent_visits by user;
grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
grunt> illustrate num_user_visits

This would trigger the ExampleGenerator which will display examples something like this:

-------------------------------------------------
| visits     | user  | url          | timestamp | 
-------------------------------------------------
|            | Fred  | harvard.edu  | 20071204  | 
|            | Fred  | stanford.edu | 20071206  | 
|            | Amy   | cnn.com      | 20070218  | 
-------------------------------------------------
--------------------------------------------------------
| recent_visits     | user  | url          | timestamp | 
--------------------------------------------------------
|                   | Fred  | harvard.edu  | 20071204  | 
|                   | Fred  | stanford.edu | 20071206  | 
--------------------------------------------------------
---------------------------------------------------------------------------------------------
| user_visits     | group | recent_visits: (user, url, timestamp )                          | 
---------------------------------------------------------------------------------------------
|                 | Fred  | {(Fred, harvard.edu, 20071204), (Fred, stanford.edu, 20071206)} | 
---------------------------------------------------------------------------------------------
----------------------------------------
| num_user_visits     | group | count1 | 
----------------------------------------
|                     | Fred  | 2      | 
----------------------------------------

Illustrate for Pig 2.0 (PigTypes)

Illustrate is now also a part of Pig 2.0. The following are not currently supported and are on the road-map:

  • LIMIT
  • SPLIT (both implicit and explicit)
  • Nested FOREACH
  • MAPS data-type

ExampleGenerator (last edited 2009-09-20 23:38:19 by localhost)