GettingNutchRunningWithUbuntu

Recently, and with a bit of effort, I got db1.spack up and running on nutch trunk. I decided to keep track of what I did to get db2.spack up and running, and contribute this tutorial.

Install Ubuntu

Here are some minimal steps:

Just a little plug for ubuntu. I guess I have a funny setup. I built an Athlon 3200+ machine, with on board SATA drives that I wanted to raid, and I wanted to run java. Those few things combined together took me a couple months off and on, to get it all going. Once I found ubuntu, it took about a night. The java took another day or two. Ubuntu was pretty well exactly what I was looking for: stripped down debian, that installs almost nothing by default and allows me to apt-get install about whatever I want, if the need arises. Could probably install ssh by default though.

As a side note, I just spent about five minutes trying these steps on a rather old box running debian, and it didn't immediately work, though I will try again another day.

Add Nutch User

Let's add a nutch user to do our nutch stuff

# adduser nutch

java

root@db2:/opt# ./jdk-1_5_0_04-linux-amd64.bin

You might also want to follow the instructions for Debian-izing the Sun JDK: [WWW] http://plugindoc.mozdev.org/faqs/distronotes/ubuntu-x86.html#java-sun

Let's put JAVA_HOME in our ~/.bash_profiles, and source said ~/.bash_profiles for root and nutch

# echo 'export JAVA_HOME=/opt/jdk1.5.0_04' >> ~/.bash_profile
# . ~/.bash_profile
nutch@db2:~$ echo 'export JAVA_HOME=/opt/jdk1.5.0_04' >> ~/.bash_profile
nutch@db2:~$ . ~/.bash_profile

apt

I changed my /etc/apt/sources.list to include

deb http://ubuntu-backports.mirrormax.net/ hoary-backports main universe multiverse restricted
deb http://ubuntu-backports.mirrormax.net/ hoary-extras main universe multiverse restricted

deb http://us.archive.ubuntu.com/ubuntu hoary main restricted
deb-src http://us.archive.ubuntu.com/ubuntu hoary main restricted

deb http://us.archive.ubuntu.com/ubuntu hoary-updates main restricted
deb-src http://us.archive.ubuntu.com/ubuntu hoary-updates main restricted

deb http://us.archive.ubuntu.com/ubuntu hoary universe
deb-src http://us.archive.ubuntu.com/ubuntu hoary universe

deb http://security.ubuntu.com/ubuntu hoary-security main restricted
deb-src http://security.ubuntu.com/ubuntu hoary-security main restricted

With the new apt sources, let's update

# apt-get update

And get the packages we need.

# apt-get install ssh subversion ant ant-optional lynx

ssh is just good to have, subversion is used to get nutch, ant is used to build nutch and lynx is used to test nutch.

Build Nutch Code and Index

Let's change over to the nutch user

# su - nutch

Checkout the code

nutch@db2:~$ svn checkout http://svn.apache.org/repos/asf/lucene/nutch/

Since this tutorial is for getting trunk to work, let's go there

nutch@db2:~ $ cd ~/nutch/trunk/

We build with ant

nutch@db2:~/nutch/trunk $ ant

And build a war for tomcat and later searching

nutch@db2:~/nutch/trunk $ ant war

Follow the nutch tutorial ([WWW] http://lucene.apache.org/nutch/tutorial.html) to build a index, or for a simple index:

If you are using the latest "trunk" stuff, the url seeding has been changed from a single file to a directory. Using trunk (after 0.7.2), put the urls in a file (here, called "nutch") in a DIRECTORY called "urls":

nutch@db2:~/nutch/trunk $ mkdir urls
nutch@db2:~/nutch/trunk $ echo 'http://lucene.apache.org/nutch/' > urls/nutch

Using 0.7.2 or before, just put urls in a FILE called "urls":

nutch@db2:~/nutch/trunk $ echo 'http://lucene.apache.org/nutch/' > urls

Then, in any case, you specify in the same fashion ("urls" below referring either to a dir or a file, depending on the version you're using):

nutch@db2:~/nutch/trunk $ perl -pi -e 's|MY.DOMAIN.NAME|lucene.apache.org/nutch|' \
  conf/crawl-urlfilter.txt
nutch@db2:~/nutch/trunk $ bin/nutch crawl urls -dir crawl.test -depth 3

See, perl can be useful :)

tomcat

As above, I put the java stuff in /opt

root@db2:/opt# tar -xzvf jakarta-tomcat-4.1.31.tar.gz

Out with the old and in with the new

# rm -rf /opt/jakarta-tomcat-4.1.31/webapps/ROOT*
# cp ~nutch/nutch/trunk/build/nutch-0.8-dev.war \
    /opt/jakarta-tomcat-4.1.31/webapps/ROOT.war

Let's move to where we put the index

# cd ~nutch/nutch/trunk/crawl.test

And start tomcat from there

# /opt/jakarta-tomcat-4.1.31/bin/catalina.sh start

Test

Connect to tomcat and perform a search.

$ lynx localhost:8080

I searched for 'nutch' and all was well! (you can use <TAB> to get to the search input in lynx)

Tutorial written by Earl Cahill, 2005

last edited 2007-06-03 17:54:33 by CodeDemon