Howto to setup nutch on a Windows Server 2008 R2 Enterprise(64-bit) and crawl samba shares.
First of all you need to download the following software:
Java 1.6 (or newer version):
Nutch-1.2 (or newer version):
Install Cygwin, (run cygwin.exe) follow the setup-assistant.
Install Java (run jdk-6u24-windows-i586.exe) and set JAVA_HOME in Start -> Computer -> Properties -> Advanced system settings -> Advanced -> Environment Variables...
(Use 32-bit Version of Java, there are some troubles with the 64-bit version and the os!)
Install Tomcat, (run apache-tomcat-7.0.11.exe).
After installation, Tomcat should start the service automatically. When the service is not running, start it manually by clicking on Configure Tomcat and then Start:
Now go to http://localhost:8080 in your browser and check if you see the following screen:
For crawling samba share, you first have to setup the networkdrive:
(In this example it's ipa-data1)
Unzip the apache-nutch-1.2-bin.zip to any directory you like, I prefer C:\:
Now go to the nutch-1.2 directory and create an urls folder.
In this folder, you create a text file with any name you like (e.g. files). Now edit it and paste your file urls:
You have to type file:///, otherwise it won't work.
Go to the nutch-1.2\conf directory and edit the nutch-default.xml:
Here we have to change the property plugin-includes and set the limit for file content to -1 for unlimited file length. Take a look at the changes:
Change the value protocol-http to protocol-file in plugin-includes (Don't change the other default values):
To specifie that nutch only crawls your specified links in the folder urls, you have to disable this property with set it to false:
Go to nutch-1.2\conf\ and edit the file crawl-urlfilter.txt:
Change -(file|ftp|mailto) to -(http|ftp|mailto)
Disable skip URLs with slash-delimited & accept hosts in MY.DOMAIN.NAME
Change skip everything else to accept everything else
Edit the file nutch-1.2\conf\nutch-site.xml, paste some default properties:
<value> firstname.lastname@example.org </value>
<description> email@example.com </description>
Open cygwin.exe and run the crawl, just use this command:
(First, navigate to the nutch-1.2 directory with cd /cygdrive/c/nutch-1.2)
Options which u can use:
• -dir dir names the directory to put the crawl in
• -threads threads determines the number of threads that will fetch in parallel
• -depth depth indicates the link depth from the root page that should be crawled
To use the Tomcat manager you have to edit the tomcat-users.xml in C:\Program Files (x86)\Apache Software Foundation\Tomcat 7.0\conf\:
Add a new user and new role, like this:
Save the settings and restart Tomcat (Take a look at Step 3).
Go to http://localhost:8080/manager/html in the browser (login with the user in Step 10).
In the WAR file to deploy section, select the \nutch-1.2\nutch-1.2.war file to upload:
Then you will see the /nutch-1.2 in the list, start it.
Go to C:\Program Files (x86)\Apache Software Foundation\Tomcat 7.0\webapps\ and you will see that there is a folder called nutch-1.2.
Navigate to C:\Program Files (x86)\Apache Software Foundation\Tomcat 7.0\webapps\nutch-1.2\WEB-INF\classes\ and edit the nutch-site.xml:
<name>searcher.dir</name> <value>your_crawl_folder (like C:\nutch-1.2\crawl\)</value>
After that, restart Tomcat.
Go to http://localhost:8080/nutch-1.2 and you should see the following:
Now you can search for your files!
Don't forget, that you have to set up the networkdrives on every system, to enable editing files directly over nutch!