Howto to setup nutch on a Windows Server 2008 R2 Enterprise(64-bit) and crawl samba shares.

First of all you need to download the following software:

Java 1.6 (or newer version):

Tomcat 7:


Nutch-1.2 (or newer version): (

Step 1:

Install Cygwin, (run cygwin.exe) follow the setup-assistant.

Setp 2:

Install Java (run jdk-6u24-windows-i586.exe) and set JAVA_HOME in Start -> Computer -> Properties -> Advanced system settings -> Advanced -> Environment Variables...

(Use 32-bit Version of Java, there are some troubles with the 64-bit version and the os!)

Step 3:

Install Tomcat, (run apache-tomcat-7.0.11.exe).

After installation, Tomcat should start the service automatically. When the service is not running, start it manually by clicking on Configure Tomcat and then Start:

Now go to http://localhost:8080 in your browser and check if you see the following screen:

Step 4:

For crawling samba share, you first have to setup the networkdrive:

(In this example it's ipa-data1)

Step 5:

Unzip the to any directory you like, I prefer C:\:

Now go to the nutch-1.2 directory and create an urls folder.

In this folder, you create a text file with any name you like (e.g. files). Now edit it and paste your file urls:

You have to type file:///, otherwise it won't work.

Step 6:

Go to the nutch-1.2\conf directory and edit the nutch-default.xml:

Here we have to change the property plugin-includes and set the limit for file content to -1 for unlimited file length. Take a look at the changes:

Change the value protocol-http to protocol-file in plugin-includes (Don't change the other default values):

To specifie that nutch only crawls your specified links in the folder urls, you have to disable this property with set it to false:

Step 7:

Go to nutch-1.2\conf\ and edit the file crawl-urlfilter.txt:

Change -(file|ftp|mailto) to -(http|ftp|mailto)

Disable skip URLs with slash-delimited & accept hosts in MY.DOMAIN.NAME

Change skip everything else to accept everything else

Step 8:

Edit the file nutch-1.2\conf\nutch-site.xml, paste some default properties:









Step 9:

Open cygwin.exe and run the crawl, just use this command:

(First, navigate to the nutch-1.2 directory with cd /cygdrive/c/nutch-1.2)

Options which u can use:

• -dir dir names the directory to put the crawl in

• -threads threads determines the number of threads that will fetch in parallel

• -depth depth indicates the link depth from the root page that should be crawled


To use the Tomcat manager you have to edit the tomcat-users.xml in C:\Program Files (x86)\Apache Software Foundation\Tomcat 7.0\conf\:

Add a new user and new role, like this:

Save the settings and restart Tomcat (Take a look at Step 3).

Step 11:

Go to http://localhost:8080/manager/html in the browser (login with the user in Step 10).

In the WAR file to deploy section, select the \nutch-1.2\nutch-1.2.war file to upload:

Then you will see the /nutch-1.2 in the list, start it.

Go to C:\Program Files (x86)\Apache Software Foundation\Tomcat 7.0\webapps\ and you will see that there is a folder called nutch-1.2.

Step 12:

Navigate to C:\Program Files (x86)\Apache Software Foundation\Tomcat 7.0\webapps\nutch-1.2\WEB-INF\classes\ and edit the nutch-site.xml:



After that, restart Tomcat.

Step 13:

Go to http://localhost:8080/nutch-1.2 and you should see the following:

Now you can search for your files!

Don't forget, that you have to set up the networkdrives on every system, to enable editing files directly over nutch!