Howto to setup nutch on a Windows Server 2008 R2 Enterprise(64-bit) and crawl samba shares.
First of all you need to download the following software:
Java 1.6 (or newer version):
http://www.oracle.com/technetwork/java/javase/downloads/index.html
Tomcat 7:
http://tomcat.apache.org/download-70.cgi
Cygwin:
Nutch-1.2 (or newer version):
ftp://mirror.switch.ch/mirror/apache/dist//nutch/ (apache-nutch-1.2-bin.zip)
Step 1:
Install Cygwin, (run cygwin.exe) follow the setup-assistant.
Setp 2:
Install Java (run jdk-6u24-windows-i586.exe) and set JAVA_HOME in Start -> Computer -> Properties -> Advanced system settings -> Advanced -> Environment Variables...
(Use 32-bit Version of Java, there are some troubles with the 64-bit version and the os!)
Step 3:
Install Tomcat, (run apache-tomcat-7.0.11.exe).
After installation, Tomcat should start the service automatically. When the service is not running, start it manually by clicking on Configure Tomcat and then Start:
Now go to http://localhost:8080 in your browser and check if you see the following screen:
Step 4:
For crawling samba share, you first have to setup the networkdrive:
(In this example it's ipa-data1)
Step 5:
Unzip the apache-nutch-1.2-bin.zip to any directory you like, I prefer C:\:
Now go to the nutch-1.2 directory and create an urls folder.
In this folder, you create a text file with any name you like (e.g. files). Now edit it and paste your file urls:
You have to type file:///, otherwise it won't work.
Step 6:
Go to the nutch-1.2\conf directory and edit the nutch-default.xml:
Here we have to change the property plugin-includes and set the limit for file content to -1 for unlimited file length. Take a look at the changes:
Change the value protocol-http to protocol-file in plugin-includes (Don't change the other default values):
To specifie that nutch only crawls your specified links in the folder urls, you have to disable this property with set it to false:
Step 7:
Go to nutch-1.2\conf\ and edit the file crawl-urlfilter.txt:
Change -(file|ftp|mailto) to -(http|ftp|mailto)
Disable skip URLs with slash-delimited & accept hosts in MY.DOMAIN.NAME
Change skip everything else to accept everything else
Step 8:
Edit the file nutch-1.2\conf\nutch-site.xml, paste some default properties:
<configuration>
<property>
<name>http.agent.name</name> <value>test</value>
<description>test </description>
</property>
<property>
<name>http.agent.description</name>
<value>Nutch</value>
<description>Nutch </description>
</property>
<property>
<name>http.agent.url</name>
<value>http://test.url </value>
<description>http://test.url </description>
</property>
<property>
<name>http.agent.email</name>
<value> test@test.ch </value>
<description> test@test.ch </description>
</property>
</configuration>
Step 9:
Open cygwin.exe and run the crawl, just use this command:
(First, navigate to the nutch-1.2 directory with cd /cygdrive/c/nutch-1.2)
Options which u can use:
• -dir dir names the directory to put the crawl in
• -threads threads determines the number of threads that will fetch in parallel
• -depth depth indicates the link depth from the root page that should be crawled
Step10:
To use the Tomcat manager you have to edit the tomcat-users.xml in C:\Program Files (x86)\Apache Software Foundation\Tomcat 7.0\conf\:
Add a new user and new role, like this:
Save the settings and restart Tomcat (Take a look at Step 3).
Step 11:
Go to http://localhost:8080/manager/html in the browser (login with the user in Step 10).
In the WAR file to deploy section, select the \nutch-1.2\nutch-1.2.war file to upload:
Then you will see the /nutch-1.2 in the list, start it.
Go to C:\Program Files (x86)\Apache Software Foundation\Tomcat 7.0\webapps\ and you will see that there is a folder called nutch-1.2.
Step 12:
Navigate to C:\Program Files (x86)\Apache Software Foundation\Tomcat 7.0\webapps\nutch-1.2\WEB-INF\classes\ and edit the nutch-site.xml:
<configuration>
<property>
<name>searcher.dir</name> <value>your_crawl_folder (like C:\nutch-1.2\crawl\)</value>
</property>
</configuration>
After that, restart Tomcat.
Step 13:
Go to http://localhost:8080/nutch-1.2 and you should see the following:
Now you can search for your files!
Don't forget, that you have to set up the networkdrives on every system, to enable editing files directly over nutch!
Enjoy!