Differences between revisions 2 and 3
Revision 2 as of 2008-11-12 07:17:55
Size: 5134
Editor: breed
Comment:
Revision 3 as of 2009-09-20 23:38:37
Size: 5134
Editor: localhost
Comment: converted to 1.6 markup
No differences found!

For this second exercise we are going to be a bit more adventurous. We are going to generate some example data for Exercise 1 using a shell script and a UDF. We will start off with a list of names in a file:

sn = load 'singlenames';

Now we are going to write a shell script to permute the names into a list of userids with ages. We will invoke it using: (Note, this time those quotes need to be back quotes.)

users = stream sn through `randid.sh` as (user, age);

randid.sh will get the contents of 'singlenames' in standard in. Things written to stdout will be taken as output tuples. By default tuples are separated by \n and fields by \t. If you'd rather skip the pain of writing the randid.sh script, here is an example:

function partName() {
        name=${list[$((RANDOM%count))]}
        seg=$((RANDOM%3))
        if [ $seg -eq 1 ]
        then
                name=${name:0:2}
        fi
        if [ $seg -eq 2 ]
        then
                name=${name:0:3}
        fi
        echo -n $name
}

count=0
while read name
do
        list[$count]="$name"
        count=$((count+1))
done

iterations=$((count*count/4))
while [ $iterations -gt 0 ]
do
        partName
        partName
        age=$(((RANDOM%50)+18))
        echo
        iterations=$((iterations-1))
done

Okay now we have our users, lets generate the pages dataset. We want to generate a bunch of page requests for each user, so we will make a UDF that takes in tuples from users and generate fake traffic:

pages = foreach sn generate flatten(pig.example.GenerateClicks(*)) as (user, url);

GenerateClicks needs to extend EvalFunc<DataBag>. Here is an example implementation:

package pig.example;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Random;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.DataAtom;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Datum;
import org.apache.pig.data.Tuple;

public class GenerateClicks extends EvalFunc<DataBag> {
    Random rand = new Random((int)System.currentTimeMillis());
    String prefixes[] = {
            "finance",
            "www",
            "search",
            "mail",
            "photo",
            "personal",
            "news",
            "m",
            "video",
            "music",
            "answers",
            "i",
            "im",
            "svcs",
            "web",
            "shop",
            "help",
            "buy",
            "rec",
            "money"
    };
    String sites[] = {
            "cnn",
            "msn",
            "yahoo",
            "google",
            "aol",
            "live",
            "cnet",
            "ask",
            "boop",
            "slashdot",
            "nbc",
            "cbs",
            "baidu",
    };
    String suffixes[] = {
            "com",
            "net",
            "org",
            "us",
            "ca",
            "ch",
            "sg",
            "il",
            "ja",
            "uk",
    };
    
    void bias(ArrayList<String> l) {
        for(int i = 0; i < 4; i++) {
            int r = rand.nextInt(l.size());
            String e = l.get(r);
            for(int j = 0; j < i*4; j++) {
                l.add(e);
            }
        }
    }
    ArrayList<String> prefix;
    ArrayList<String> site;
    ArrayList<String> suffix;
    public GenerateClicks() {
        prefix = new ArrayList<String>();
        for(String p: prefixes) {
            prefix.add(p);
        }
        site = new ArrayList<String>();
        for(String p: sites) {
            site.add(p);
        }
        suffix = new ArrayList<String>();
        for(String p: suffixes) {
            suffix.add(p);
        }
        bias(prefix);
        bias(site);
        bias(suffix);
    }
    String generateURL() {
        int p = rand.nextInt(prefix.size());
        int m = rand.nextInt(site.size());
        int e = rand.nextInt(suffix.size());
        return "http://" + prefix.get(p) + "." + site.get(m) + "." + site.get(e); 
    }
    @Override
    public void exec(Tuple in, DataBag out) throws IOException {
        int count = rand.nextInt(1000+rand.nextInt(10000));
        for(int i = 0; i < count; i++) {
            Tuple t = new Tuple();
            t.appendField((DataAtom)in.getField(0));
            t.appendField(new DataAtom(generateURL()));
            out.add(t);
        }
    }

}

Okay, so you compiled it, but you are getting a class not found exception. Pig needs to be able to find your UDF class and ship it to hadoop. We do this using register: create a jar file with the class and use

register myjar.jar;

before trying to use the UDF.

Why do we need flatten? (To answer that question try that pig latin with and without flatten. Use describe to see the difference.)

The only thing left is to store everything:

store pages into 'pages';
store users into 'users';

PigExercise2 (last edited 2009-09-20 23:38:37 by localhost)