SimpleMapReduceTutorial

This is the simplest map reduce example I could come up with. Local filesystem, just getting one segment indexed. I am running Ubuntu, on an Athlon 3200+ using a cable modem connection.

Designate Url

Need to get to the right place

cd nutch/branches/mapred

We need to make a directory that contains files, where each line of each file is a url. I choose [WWW] http://lucene.apache.org/nutch/

mkdir urls
echo "http://lucene.apache.org/nutch/" > urls/urls

Also need to change the crawl filter to include this site

perl -pi -e 's|MY.DOMAIN.NAME|lucene.apache.org/nutch|' conf/crawl-urlfilter.txt

We walk through the following steps: crawl, generate, fetch, updatedb, invertlinks, index.

Crawl

We want to run crawl on the urls directory from above.

./bin/nutch crawl urls

Took me about ten minutes. Output included

051004 003916 178 pages, 17 errors, 0.4 pages/s, 48 kb/s

The errors generally seemed to be timeouts.

The rest of the commands are a bit more dynamic, relying on timestamp and the like. Environment variables help out.

Generate

Here we walk a segment dir from the crawl above.

CRAWLDB=`find crawl-2* -name crawldb`
SEGMENTS_DIR=`find crawl-2* -maxdepth 1 -name segments`
./bin/nutch generate $CRAWLDB $SEGMENTS_DIR

Took less than five seconds.

Fetch

SEGMENT=`find crawl-2*/segments/2* -maxdepth 0 | tail -1`
./bin/nutch fetch $SEGMENT

Took about seven minutes, and output looked like

051004 004931 65 pages, 404 errors, 0.2 pages/s, 19 kb/s,

Again, many timeouts.

UbdateDB

./bin/nutch updatedb $CRAWLDB $SEGMENT

Took less than five seconds.

InvertLinks

LINKDB=`find crawl-2* -name linkdb -maxdepth 1`
SEGMENTS=`find crawl-2* -name segments -maxdepth 1`
./bin/nutch invertlinks $LINKDB $SEGMENTS

Took less than five seconds.

Index

We need a place for our index, say myindex

mkdir myindex

Now, let's index.

./bin/nutch index myindex $LINKDB $SEGMENT

Took less than ten seconds.

Test

The best test I have for the moment is

ls -alR myindex

If you see several files, it at least did something. Happy nutching!

Tutorial written by Earl Cahill, 2005.

last edited 2005-10-04 17:12:07 by EarlCahill