Recently, and with a bit of effort, I got db1.spack up and running on nutch trunk. I decided to keep track of what I did to get db2.spack up and running, and contribute this tutorial.
Here are some minimal steps:
Let's add a nutch user to do our nutch stuff
# adduser nutch |
I tried to get java from normal apt sources and I am guessing it is my Athlon that broke me. I broke down and got java from Sun (http://java.sun.com/j2se/1.5.0/download.jsp), the Download JDK 5.0 Update 4 link. I tried getting the 1.4.2 and it didn't work, but 1.5.0 worked.
root@db2:/opt# ./jdk-1_5_0_04-linux-amd64.bin |
You might also want to follow the instructions for Debian-izing the Sun JDK: http://plugindoc.mozdev.org/faqs/distronotes/ubuntu-x86.html#java-sun
Let's put JAVA_HOME in our ~/.bash_profiles, and source said ~/.bash_profiles for root and nutch
# echo 'export JAVA_HOME=/opt/jdk1.5.0_04' >> ~/.bash_profile # . ~/.bash_profile nutch@db2:~$ echo 'export JAVA_HOME=/opt/jdk1.5.0_04' >> ~/.bash_profile nutch@db2:~$ . ~/.bash_profile |
Add the Multiverse to your sources.list or use the GUI:
System -> Administration -> Synaptic Package Manager
Settings -> Repositories
With the new apt sources, let's update
# apt-get update |
And get the packages we need.
# apt-get install subversion ant ant-optional lynx |
subversion is used to get nutch, ant is used to build nutch and lynx is used to test nutch.
Let's change over to the nutch user
# su - nutch |
Checkout the code AND the gora code
nutch@db2:~$ svn checkout http://svn.apache.org/repos/asf/nutch/trunk nutch nutch@db2:~$ cd nutch nutch@db2:~$ svn checkout https://svn.apache.org/repos/asf/incubator/gora/ |
Since this tutorial is for getting trunk to work, let's go there
nutch@db2:~ $ cd ~/nutch |
We build with ant
nutch@db2:~/nutch $ ant |
And build a war for tomcat and later searching
nutch@db2:~/nutch/trunk $ ant war |
Follow the nutch tutorial (http://lucene.apache.org/nutch/tutorial.html) to build a index, or for a simple index:
If you are using the latest "trunk" stuff, the url seeding has been changed from a single file to a directory. Using trunk (after 0.7.2), put the urls in a file (here, called "nutch") in a DIRECTORY called "urls":
nutch@db2:~/nutch $ mkdir urls nutch@db2:~/nutch $ echo 'http://lucene.apache.org/nutch/' > urls/nutch |
Using 0.7.2 or before, just put urls in a FILE called "urls":
nutch@db2:~/nutch $ echo 'http://lucene.apache.org/nutch/' > urls |
Then, in any case, you specify in the same fashion ("urls" below referring either to a dir or a file, depending on the version you're using):
nutch@db2:~/nutch $ perl -pi -e 's|MY.DOMAIN.NAME|lucene.apache.org/nutch|' \ conf/crawl-urlfilter.txt nutch@db2:~/nutch $ src/bin/nutch crawl urls -dir crawl.test -depth 3 |
See, perl can be useful
Again, I tried apt without much luck, so I downloaded tomcat from Apache (http://jakarta.apache.org/site/downloads/downloads_tomcat-4.cgi).
As above, I put the java stuff in /opt
root@db2:/opt# tar -xzvf jakarta-tomcat-4.1.31.tar.gz |
Out with the old and in with the new
# rm -rf /opt/jakarta-tomcat-4.1.31/webapps/ROOT* # cp ~nutch/nutch/trunk/build/nutch-0.8-dev.war \ /opt/jakarta-tomcat-4.1.31/webapps/ROOT.war |
Let's move to where we put the index
# cd ~nutch/nutch/trunk/crawl.test |
And start tomcat from there
# /opt/jakarta-tomcat-4.1.31/bin/catalina.sh start |
Connect to tomcat and perform a search.
$ lynx localhost:8080 |
I searched for 'nutch' and all was well! (you can use <TAB> to get to the search input in lynx)
Tutorial written by Earl Cahill, 2005