PageRank

Uses the PageRank algorithm described in the Google Pregel paper
Introduces partitioning and collective communication
Lets the user submit his/her own TextFile to calculate the sites' Pagerank!

Usage

bin/hama jar ../hama-0.4.0-examples.jar pagerank <input path> <output path> [damping factor] [epsilon error] [tasks]

The default parameters for pagerank are:

0.85 0.001

As you can see 0.85 is the damping factor, that is the probability which a user will "randomly" jump to other sides. See the Random Surfer Model.

0.001 is the convergence error, the error will always be measured after an iteration. It tells how much the pagerank of all sites has changed. If you are setting this to a lower factor, it will take more iterations.

Submit your own Web-graph

You can transform your graph as a adjacency list to fit into the input which Hama is going to parse and calculate the Pagerank.

The file that Hama can successfully parse is a TextFile that has the following layout:

Site1\tSite2\tSite3
Site2\tSite3
Site3

This piece of text will adjacent Site1 to Site2 and Site3, Site2 to Site3 and Site3 is a dangling node. As you can see a site is always on the leftmost side (we call it the key-site), and the outlinks are seperated by tabs (\t) as the following elements.

Make sure that every site's outlink can somewhere be found in the file as a key-site. Otherwise it will result in weird NullPointerExceptions.

Now you need to transform the text file using:

bin/hama jar ../hama-0.4.0-examples.jar pagerank-text2seq /tmp/input.txt /tmp/out/

Then you can run pagerank on it with:

bin/hama jar ../hama-0.4.0-examples.jar pagerank /tmp/out /tmp/pagerank-output

Note that based on what you have configured, the paths may be in HDFS or on local disk.

Output

The output is a double value that is between zero and 1.0. Where 1.0 is a very "famous" site.

All pages' rank should sum up to 1.0, otherwise the algorithm is broken.

Sample Adjacencylist File

You can create a large pagerank input file by using the PagerankTeragen file from here: http://code.google.com/p/hama-shortest-paths/source/browse/trunk/hama-gsoc/src/de/jungblut/hama/util/PagerankTeragen.java

It is based on MapReduce and requires a running Hadoop cluster. You can create a file using

hadoop/bin hadoop -jar <jar containing the pagerank teragen> <number of vertices> <number of reducers / output files> <number of edges per vertex> <output path>

Have fun! If you are facing problems, feel free to ask questions on the official mailing list.

Implementation

For detailed questions in terms of implementation have a look at my blog. It describes the algorithm and focuses on the main ideas showing implementation things.

http://codingwiththomas.blogspot.com/2011/04/pagerank-with-apache-hama.html

Page tree

PageRank

PageRank

Usage

Submit your own Web-graph

Output

Sample Adjacencylist File

Implementation