In this exercise we will work through the example shown in the presentation. We have two datasets: users and pages. users contains the userid and age of every user using some service. pages contains the userid and url visited by that user. We are going to work through this exercise using the interactive shell: java -jar pig.jar -

(Note wiki is trying to be smart and use ` to make things look nice. All quotes should be single normal quotes.)

We start off by loading the users dataset.

Users = load ‘/data/users’ as (name, age);
Pages = load ‘data/pages’ as (user, url);

What is the format of this data? (use describe Users; or dump Users; to answer the question.

Now we filter:

Fltrd = filter Users by 
        age >= 18 and age <= 25;

Now lets do the join.

Jnd = join Fltrd by name, Pages by user;

What does this data look like? You can use describe to verify your answer.

Grpd = group Jnd by url;

How does group differ from join? Again use describe.

Smmd = foreach Grpd generate group,
       COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top100 = limit Srtd 100;
store Top100 into ‘top100sites’;

Finish it up. Does top100sites contain what you expect?

PigExercise1 (last edited 2009-09-20 23:38:37 by localhost)