Running Nutch with Mac OSX

Downloading and setting up Tomcat

Download Tomcat (http://tomcat.apache.org/). The latest versions require J2SE 1.5 which can be downloaded from www.apple.com (Tiger users only). I downloaded apache-tomcat-5.5.12.tar.gz.

Open a terminal window and copy the file to /usr/local (cp apache-tomcat-5.5.12.tar.gz /usr/local) tar -zxvf apache-tomcat-5.5.12.tar.gz Start Tomcat (see below)

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.5.0/Home (or /usr )
/usr/local/apache-tomcat-5.5.12/bin/startup.sh

You will see something like:

Using CATALINA_BASE:   /usr/local/apache-tomcat-5.5.12
Using CATALINA_HOME:   /usr/local/apache-tomcat-5.5.12
Using CATALINA_TMPDIR: /usr/local/apache-tomcat-5.5.12/temp
Using JRE_HOME:       /System/Library/Frameworks/JavaVM.framework/Versions/1.5.0/Home

Check that tomcat is running by opening http://localhost:8080. This should bring up Tomcat's Welcome Page.

Finally edit tomcat-users.xml which is in your Tomcat/conf Directory and add a 'manager' role.

Downloading and setting up Nutch

Download nutch-0.7.1.tar.gz or some other release and place the file somewhere in your Home directory. Expand the file using Stuffit Expander or the tar command. Open http://localhost:8080 and click on the link 'Tomcat Manager' Click select WAR file to upload. Browse to the Nutch Directory and select the file 'nutch-0.7.1.war' which is located in the nutch root folder. Click 'Deploy' Check http://localhost:8080/nutch-0.7.1/en/search.html. You should see the Nutch Search Form.

Crawling

Note that the nutch command line tool (in our case nutch-0.7.1/bin/nutch) is not installed under the Tomcat web-application ($CATALINA_HOME/webapps/nutch-0.7.1/WEB-INF/...). You can either leave it there or move it manually to your tomcat/webapps/nutch/WEB-INF/classes. In the first case you will have to do some classpath configuring or maintain two nutch-site.xml files (one for indexing and one for searching).

Using Terminal, cd to the directory where your bin/nutch is located. From here you can follow the instructions from the tutorial.

Just like any other mac application the Terminal is scriptable which is a nice feature. The applescript below will start a crawl just by doubleclicking it's icon.

tell application "Terminal"
	if ((count of the window) = 0) or ¬
		(the busy of window 1 = true) then
		tell application "System Events"
			keystroke "n" using command down
		end tell
	end if
	do script "cd Desktop/nutch-0.7.1" in window 1
	do script "export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.5.0/Home" in window 1
	do script "bin/nutch crawl -dir ~/nutch_index -depth 20 conf/link.txt" in window 1
end tell
  • No labels