Recently, and with a bit of effort, I got db1.spack up and running on nutch trunk. I decided to keep track of what I did to get db2.spack up and running, and contribute this tutorial.

Install Ubuntu

Here are some minimal steps:

Add Nutch User

Let's add a nutch user to do our nutch stuff

# adduser nutch

java

I tried to get java from normal apt sources and I am guessing it is my Athlon that broke me. I broke down and got java from Sun (http://java.sun.com/j2se/1.5.0/download.jsp), the Download JDK 5.0 Update 4 link. I tried getting the 1.4.2 and it didn't work, but 1.5.0 worked.

root@db2:/opt# ./jdk-1_5_0_04-linux-amd64.bin

You might also want to follow the instructions for Debian-izing the Sun JDK: http://plugindoc.mozdev.org/faqs/distronotes/ubuntu-x86.html#java-sun

Let's put JAVA_HOME in our ~/.bash_profiles, and source said ~/.bash_profiles for root and nutch

# echo 'export JAVA_HOME=/opt/jdk1.5.0_04' >> ~/.bash_profile
# . ~/.bash_profile
nutch@db2:~$ echo 'export JAVA_HOME=/opt/jdk1.5.0_04' >> ~/.bash_profile
nutch@db2:~$ . ~/.bash_profile

apt

Add the Multiverse to your sources.list or use the GUI:

System -> Administration -> Synaptic Package Manager

Settings -> Repositories

With the new apt sources, let's update

# apt-get update

And get the packages we need.

# apt-get install subversion ant ant-optional lynx

subversion is used to get nutch, ant is used to build nutch and lynx is used to test nutch.

Build Nutch Code and Index

Let's change over to the nutch user

# su - nutch

Checkout the code AND the gora code

nutch@db2:~$ svn checkout http://svn.apache.org/repos/asf/nutch/trunk nutch
nutch@db2:~$ cd nutch
nutch@db2:~$ svn checkout https://svn.apache.org/repos/asf/incubator/gora/

Since this tutorial is for getting trunk to work, let's go there

nutch@db2:~ $ cd ~/nutch

We build with ant

nutch@db2:~/nutch $ ant

And build a war for tomcat and later searching

nutch@db2:~/nutch/trunk $ ant war

Follow the nutch tutorial (http://lucene.apache.org/nutch/tutorial.html) to build a index, or for a simple index:

If you are using the latest "trunk" stuff, the url seeding has been changed from a single file to a directory. Using trunk (after 0.7.2), put the urls in a file (here, called "nutch") in a DIRECTORY called "urls":

nutch@db2:~/nutch $ mkdir urls
nutch@db2:~/nutch $ echo 'http://lucene.apache.org/nutch/' > urls/nutch

Using 0.7.2 or before, just put urls in a FILE called "urls":

nutch@db2:~/nutch $ echo 'http://lucene.apache.org/nutch/' > urls

Then, in any case, you specify in the same fashion ("urls" below referring either to a dir or a file, depending on the version you're using):

nutch@db2:~/nutch $ perl -pi -e 's|MY.DOMAIN.NAME|lucene.apache.org/nutch|' \
  conf/crawl-urlfilter.txt
nutch@db2:~/nutch $ src/bin/nutch crawl urls -dir crawl.test -depth 3

See, perl can be useful (smile)

tomcat

Again, I tried apt without much luck, so I downloaded tomcat from Apache (http://jakarta.apache.org/site/downloads/downloads_tomcat-4.cgi).

As above, I put the java stuff in /opt

root@db2:/opt# tar -xzvf jakarta-tomcat-4.1.31.tar.gz

Out with the old and in with the new

# rm -rf /opt/jakarta-tomcat-4.1.31/webapps/ROOT*
# cp ~nutch/nutch/trunk/build/nutch-0.8-dev.war \
    /opt/jakarta-tomcat-4.1.31/webapps/ROOT.war

Let's move to where we put the index

# cd ~nutch/nutch/trunk/crawl.test

And start tomcat from there

# /opt/jakarta-tomcat-4.1.31/bin/catalina.sh start

Test

Connect to tomcat and perform a search.

$ lynx localhost:8080

I searched for 'nutch' and all was well! (you can use <TAB> to get to the search input in lynx)

Tutorial written by Earl Cahill, 2005

  • No labels