created by sjw/mfgis 2feb07

Installing and Running Nutch Under Debian 'Etch'

Install Sun's Java

Sun Java is available as a set of Debian packages and may be easily installed using apt.
To obtain Sun's Java, ensure that 'non-free' is included in /etc/apt/sources.list
# apt-get install sun-java5-bin sun-java5-demo sun-java5-jdk sun-java5-jre

Since there may be more than one flavor of Java on the system (e.g. kaffe) ensure that Sun Java is the chosen alternative
# update-alternatives --config java // then select sun java from the menu

If necessary edit /etc/profile to include the following lines:

JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun-1.5.0.10

export JAVA_HOME

Install Tomcat5.5 and Verify that it is functioning

# apt-get install tomcat5.5 libtomcat5.5-java tomcat5.5-admin tomcat5.5-webapps

Verify Tomcat is running:

# /etc/init.d/tomcat5.5 status

#Tomcat servlet engine is running with Java pid /var/lib/tomcat5.5/temp/tomcat5.5.pid

Tomcat may be started and/or stopped using the following:

# /etc/init.d/tomcat5.5 start

# /etc/init.d/tomcat5.5 stop

It is NOT necessary to run '~/local/tomcat/bin/catalina.sh start*' as noted elsewhere in the WIKI, nor is it necessary to start tomcat/catalina from any particular location*

Tomcat5.5 under Debian Etch listens to port 8180, not 8080, so pointing your browser to http://blahblah:8180 will bring up the Tomcat home page, if everything is functioning properly.

Grant Yourself Tomcat Manager Permissions

Edit /usr/share/tomcat5.5/conf/tomcat-users.xml and include the following:

<user username="myname" password="mypassword" roles="manager"/>

Enter the Tomcat Manager

Tomcat5.5 under Debian Etch comes pre-installed with a handfull of simple webapps. Clicking on the Tomcat Manager link from the Tomcat home page will show you a list of these applications and their execution status. Later we will return to this page to verify that our nutch applications are running.

Acquire, install and configure Nutch

Acquire a copy of nutch and unpack it in a new directory location. I suggest using /usr/local/nutch as the top-level directory, but this is of course optional

Configure for multiple, independent site crawls and searches

Follow the section Intranet:Configuration from the Nutch tutorial at http://lucene.apache.org/nutch/tutorial8.html. However, plan in advance for crawling and searching sites independently from one another:

Given two sites, site1 and site2 which you wish to crawl/index (and later search) independently from each other, you may make multiple copies of the conf directory:

#cd /usr/local/nutch

#cp -rp conf conf.site1

#cp -rp conf conf.site2

And then work through steps one through four of the above mentioned section for each site.

Create simple shell scripts which allow for the independent crawling of each site, such as /usr/local/nutch/crawl_site1.sh

NUTCH_CONF_DIR=conf.site1

export NUTCH_CONF_DIR

bin/nutch crawl urls/site1 -dir crawls/site1 -depth 10 -topN 100000

and the same for site2.

Then proceed to crawl each site:

#sh crawl_site1.sh

#sh crawl_site2.sh

Configure Tomcat's File and Webapp Paths

Under Debian Etch, the Catalina configuration files are located under /etc/tomcat5.5/policy.d At runtime they are combined into a single file, /usr/share/tomcat5.5/conf/catalina.policy Do not edit the latter, as it will be overwrittten.

At the end of /etc/tomcat5.5/policy.d/04webapps.policy include the following code:

grant codeBase "file:/usr/share/tomcat5.5-webapps/-\" {
    permission java.util.PropertyPermission "user.dir", "read";
    permission java.util.PropertyPermission "java.io.tmpdir", "read,write";
    permission java.util.PropertyPermission "org.apache.*", "read,execute";
    permission java.io.FilePermission "/usr/local/nutch/crawls/-" , "read";
    permission java.io.FilePermission "/var/lib/tomcat5.5/temp", "read";
    permission java.io.FilePermission "/var/lib/tomcat5.5/temp/-", "read,write,execute,delete";
    permission java.lang.RuntimePermission "createClassLoader", "";
    permission java.security.AllPermission;
    };

Warning: The last line here was necessary in order to make things work for me. If anybody can supply a more restrictive permission set, please do so!!! The effects of this are unknown

Install Multiple Copies of Nutch under Tomcat5.5 and Prepare for Searching

Under Debian Etch & Tomcat5.5 the webapps path is located at

/usr/share/tomcat5.5-webapps

Contrary to the Nutch tutorial(s) it is NOT NECESSARY to remove the ROOT context nor is it desirable. It was noted above that the Tomcat Manager allows us to view and control our multiple applications. Removing ROOT would break this functionality.

Create two new folders under /usr/share/tomcat5.5-webapps, and explode the nutch war file into each:

 
#cd /usr/share/tomcat5.5-webapps
#mkdir site1
#mkdir site2
#cp /usr/local/nutch/nutch-0.8.1.war site1
#cp /usr/local/nutch/nutch-0.8.1.war site2
#cd site1; jar xvf nutch-0.8.1.war; rm nutch-0.8.1.war; cd ..
#cd site2; jar xvf nutch-0.8.1.war; rm nutch-0.8.1.war; cd ..

Configure the site1,site2 webapps

Edit the site1/WEB-INF/classes/nutch-default.xml file for the searcher.dir parameter, so that it points back to your crawl directory under /usr/local/nutch and save it as nutch-site.xml after making the following changes:

{{{<name>searcher.dir</name>
<value>/usr/local/nutch/crawls/site1</value>
}}}
And repeat for site2.

Create site1.xml and site2.xml under /usr/share/tomcat5.5-webapps by modifying the distribution nutch-site.xml

<Context path="/site1" docBase="/usr/share/tomcat5.5-webapps/site1"
   debug="0" privileged="true" allowLinking="true">
</Context>

And repeat for site2.

Create symbolic links to these files under /usr/share/tomcat5.5/conf/Catalina/localhost

ln -s /usr/share/tomcat5.5-webapps/site1.xml /usr/share/tomcat5.5/conf/Catalina/localhost/site1.xml
ln -s /usr/share/tomcat5.5-webapps/site2.xml /usr/share/tomcat5.5/conf/Catalina/localhost/site2.xml

Restart Tomcat

 /etc/init.d/tomcat5.5 restart 

Revisit the Tomcat Manager. You should see new entries for site1 and site2 and with luck their Running status should show as True

Search Your Sites!

Point your browser to http://blahblah:8180/site1 and conduct a search.

Point your browser to http://blahblah:8180/site2 and conduct another search.

If everything was configured properly you should see independent results representing independent searches on independent crawls.

FIN.

  • No labels