This is the simplest map reduce example I could come up with. Local filesystem, just getting one segment indexed. I am running Ubuntu, on an Athlon 3200+ using a cable modem connection.
Designate Url
Need to get to the right place
cd nutch/branches/mapred
We need to make a directory that contains files, where each line of each file is a url. I choose
http://lucene.apache.org/nutch/
mkdir urls echo "http://lucene.apache.org/nutch/" > urls/urls
Also need to change the crawl filter to include this site
perl -pi -e 's|MY.DOMAIN.NAME|lucene.apache.org/nutch|' conf/crawl-urlfilter.txt
We walk through the following steps: crawl, generate, fetch, updatedb, invertlinks, index.
Crawl
We want to run crawl on the urls directory from above.
./bin/nutch crawl urls
Took me about ten minutes. Output included
051004 003916 178 pages, 17 errors, 0.4 pages/s, 48 kb/s
The errors generally seemed to be timeouts.
The rest of the commands are a bit more dynamic, relying on timestamp and the like. Environment variables help out.
Generate
Here we walk a segment dir from the crawl above.
CRAWLDB=`find crawl-2* -name crawldb` SEGMENTS_DIR=`find crawl-2* -maxdepth 1 -name segments` ./bin/nutch generate $CRAWLDB $SEGMENTS_DIR
Took less than five seconds.
Fetch
SEGMENT=`find crawl-2*/segments/2* -maxdepth 0 | tail -1` ./bin/nutch fetch $SEGMENT
Took about seven minutes, and output looked like
051004 004931 65 pages, 404 errors, 0.2 pages/s, 19 kb/s,
Again, many timeouts.
UbdateDB
./bin/nutch updatedb $CRAWLDB $SEGMENT
Took less than five seconds.
InvertLinks
LINKDB=`find crawl-2* -name linkdb -maxdepth 1` SEGMENTS=`find crawl-2* -name segments -maxdepth 1` ./bin/nutch invertlinks $LINKDB $SEGMENTS
Took less than five seconds.
Index
We need a place for our index, say myindex
mkdir myindex
Now, let's index.
./bin/nutch index myindex $LINKDB $SEGMENT
Took less than ten seconds.
Test
The best test I have for the moment is
ls -alR myindex
If you see several files, it at least did something. Happy nutching!
Tutorial written by Earl Cahill, 2005.