Recently, and with a bit of effort, I got db1.spack up and running on nutch trunk. I decided to keep track of what I did to get db2.spack up and running, and contribute this tutorial.
Install Ubuntu
Here are some minimal steps:
got "The Hoary Hedgehog" from
http://www.ubuntu.com/download/ entered 'server' on the install screen
the rest, I thought, was a breeze
I did run 'sudo passwd', which allowed me to do stuff as root, as below
Just a little plug for ubuntu. I guess I have a funny setup. I built an Athlon 3200+ machine, with on board SATA drives that I wanted to raid, and I wanted to run java. Those few things combined together took me a couple months off and on, to get it all going. Once I found ubuntu, it took about a night. The java took another day or two. Ubuntu was pretty well exactly what I was looking for: stripped down debian, that installs almost nothing by default and allows me to apt-get install about whatever I want, if the need arises. Could probably install ssh by default though.
As a side note, I just spent about five minutes trying these steps on a rather old box running debian, and it didn't immediately work, though I will try again another day.
Add Nutch User
Let's add a nutch user to do our nutch stuff
# adduser nutch
java
I tried to get java from normal apt sources and I am guessing it is my Athlon that broke me. I broke down and got java from Sun (
http://java.sun.com/j2se/1.5.0/download.jsp), the Download JDK 5.0 Update 4 link. I tried getting the 1.4.2 and it didn't work, but 1.5.0 worked.
root@db2:/opt# ./jdk-1_5_0_04-linux-amd64.bin
You might also want to follow the instructions for Debian-izing the Sun JDK:
http://plugindoc.mozdev.org/faqs/distronotes/ubuntu-x86.html#java-sun
Let's put JAVA_HOME in our ~/.bash_profiles, and source said ~/.bash_profiles for root and nutch
# echo 'export JAVA_HOME=/opt/jdk1.5.0_04' >> ~/.bash_profile # . ~/.bash_profile nutch@db2:~$ echo 'export JAVA_HOME=/opt/jdk1.5.0_04' >> ~/.bash_profile nutch@db2:~$ . ~/.bash_profile
apt
I changed my /etc/apt/sources.list to include
deb http://ubuntu-backports.mirrormax.net/ hoary-backports main universe multiverse restricted deb http://ubuntu-backports.mirrormax.net/ hoary-extras main universe multiverse restricted deb http://us.archive.ubuntu.com/ubuntu hoary main restricted deb-src http://us.archive.ubuntu.com/ubuntu hoary main restricted deb http://us.archive.ubuntu.com/ubuntu hoary-updates main restricted deb-src http://us.archive.ubuntu.com/ubuntu hoary-updates main restricted deb http://us.archive.ubuntu.com/ubuntu hoary universe deb-src http://us.archive.ubuntu.com/ubuntu hoary universe deb http://security.ubuntu.com/ubuntu hoary-security main restricted deb-src http://security.ubuntu.com/ubuntu hoary-security main restricted
With the new apt sources, let's update
# apt-get update
And get the packages we need.
# apt-get install ssh subversion ant ant-optional lynx
ssh is just good to have, subversion is used to get nutch, ant is used to build nutch and lynx is used to test nutch.
Build Nutch Code and Index
Let's change over to the nutch user
# su - nutch
Checkout the code
nutch@db2:~$ svn checkout http://svn.apache.org/repos/asf/lucene/nutch/
Since this tutorial is for getting trunk to work, let's go there
nutch@db2:~ $ cd ~/nutch/trunk/
We build with ant
nutch@db2:~/nutch/trunk $ ant
And build a war for tomcat and later searching
nutch@db2:~/nutch/trunk $ ant war
Follow the nutch tutorial (
http://lucene.apache.org/nutch/tutorial.html) to build a index, or for a simple index:
If you are using the latest "trunk" stuff, the url seeding has been changed from a single file to a directory. Using trunk (after 0.7.2), put the urls in a file (here, called "nutch") in a DIRECTORY called "urls":
nutch@db2:~/nutch/trunk $ mkdir urls nutch@db2:~/nutch/trunk $ echo 'http://lucene.apache.org/nutch/' > urls/nutch
Using 0.7.2 or before, just put urls in a FILE called "urls":
nutch@db2:~/nutch/trunk $ echo 'http://lucene.apache.org/nutch/' > urls
Then, in any case, you specify in the same fashion ("urls" below referring either to a dir or a file, depending on the version you're using):
nutch@db2:~/nutch/trunk $ perl -pi -e 's|MY.DOMAIN.NAME|lucene.apache.org/nutch|' \ conf/crawl-urlfilter.txt nutch@db2:~/nutch/trunk $ bin/nutch crawl urls -dir crawl.test -depth 3
See, perl can be useful
tomcat
Again, I tried apt without much luck, so I downloaded tomcat from Apache (
http://jakarta.apache.org/site/downloads/downloads_tomcat-4.cgi).
As above, I put the java stuff in /opt
root@db2:/opt# tar -xzvf jakarta-tomcat-4.1.31.tar.gz
Out with the old and in with the new
# rm -rf /opt/jakarta-tomcat-4.1.31/webapps/ROOT*
# cp ~nutch/nutch/trunk/build/nutch-0.8-dev.war \
/opt/jakarta-tomcat-4.1.31/webapps/ROOT.war
Let's move to where we put the index
# cd ~nutch/nutch/trunk/crawl.test
And start tomcat from there
# /opt/jakarta-tomcat-4.1.31/bin/catalina.sh start
Test
Connect to tomcat and perform a search.
$ lynx localhost:8080
I searched for 'nutch' and all was well! (you can use <TAB> to get to the search input in lynx)
Tutorial written by Earl Cahill, 2005