RunNutchInEclipse
This is a work in progress. If you find errors or would like to improve this page, just create an account [UserPreferences] and start editing this page
Tested with
Nutch release 0.8
Eclipse 3.2
Java 1.4 and 1.5
Ubuntu (should work on most platform, though)
Before you start
Setting up Nutch to run into Eclipse can be tricky, and most of the time you are much faster if you edit Nutch in Eclipse but run the scripts from the command line (my 2 cents). However, it's very useful to be able to debug Nutch in Eclipse. But again you might be quickier by looking at the logs (logs/hadoop.log)...
Steps
Install Nutch
Grab a fresh release of Nutch 0.8 or make a fresh checkout of Nutch 0.8 from svn
Do not build Nutch now. Make sure you have no .project and .classpath files in the Nutch directory
Create a new java project in Eclipse
File > New > Project > Java project > click Next
select "Create project from existing source" and use the location where you downloaded Nutch
click on Next, and wait while Eclipse is scanning the folders
add the folder "conf" to the classpath (scroll down the list and right-click on "conf". This step is necessary)
Eclipse should have guessed all the java files that must be added on your classpath. If it's not the case, add "src/java", "src/test" and all plugin "src/java" and "src/test" folders to your source folders. Also add all jars in "lib" and in the plugin lib folders to your libraries
set output dir to "tmp_build", create it if necessary
DO NOT add "build" to classpath
or you can use .classpath file
If you're using the trunk
As of revision 511012 there were a few plugins on the trunk and a couple other files that did not build, and are actually excluded from the ant projects. You may want to remove the following projects from the build structure:
plugin/parse-mp3
plugin/parse-rtf
contrib/*
Configure Nutch
see the
Tutorial change the property "plugin.folders" to "./src/plugin" on $NUTCH_HOME/conf/nutch-site.xml
make sure Nutch is configured correctly before testing it into Eclipse
Build Nutch
In case you setup the project correctly, Eclipse will built Nutch for you into "tmp_build".
Create Eclipse launcher
Menu Run > "Run..."
create "New" for "Java Application"
set in Main class
org.apache.nutch.crawl.Crawl
on tab Arguments, Program Arguments
urls -dir crawl -depth 3 -topN 50
in VM arguments
-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
click on "Run"
if all works, you should see Nutch getting busy at crawling
Debug Nutch in Eclipse
Set breakpoints and debug a crawl
It can be tricky to find out where to set the breakpoint, because of the Hadoop jobs. Here are a few good places to set breakpoints:
Fetcher [line: 371] - run Fetcher [line: 438] - fetch Fetcher$FetcherThread [line: 149] - run() Generator [line: 281] - generate Generator$Selector [line: 119] - map OutlinkExtractor [line: 111] - getOutlinks
If things do not work...
Yes, Nutch and Eclipse can be a difficult companionship sometimes
eclipse: Cannot create project content in workspace
The nutch source code must be out of the workspace folder. My first attemp was download the code with eclipse (svn) under my workspace. When I try to create the project using existing code, eclipse don't let me do it from source code into the workspace. I use the source code out of my workspace and it work fine.
plugin dir not found
Make sure you set your plugin.folders property correct, instead of using a relative path you can use a absoluth one as well in nutch-defaults.xml or may be better in nutch-site.xml
<property> <name>plugin.folders</name> <value>/home/....../nutch-0.8/src/plugin</value>
No plugins loaded during unit tests in Eclipse
During unit testing, Eclipse ignored conf/nutch-site.xml in favor of src/test/nutch-site.xml, so you might need to add the plugin directory configuration to that file as well.
Unit tests work in eclipse but fail when running ant in the command line
Suppose your unit tests work perfectly in eclipse, but each and everyone fail when running ant test in the command line - including the ones you haven't modified. Check if you defined the plugin.folders property in hadoop-site.xml. In that case, try removing it from that file and adding it directly to nutch-site.xml
Run ant test again. That should have solved the problem.
If that didn't solve the problem, are you testing a plugin? If so, did you add the plugin to the list of packages in plugin\build.xml, on the test target?
classNotFound
open the class itself, rightclick
refresh the build dir
missing org.farng and com.etranslate
You may have problems with some imports in parse-mp3 and parse-rtf plugins. Because of incompatibility with apache licence they were left from sources. You can find it here:
http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/
http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/
You need to copy jar files into plugin "lib" path and refresh the project.
debugging hadoop classes
Sometime it makes sense to also have the hadoop classes available during debugging. So, you can check out the Hadoop sources on your machine and add the sources to the hadoop-xxx.jar. Alternatively, you can:
Remove the hadoopXXX.jar from your classpath libraries
Checkout the hadoop brunch that is used within nutch
configure a hadoop project similar to the nutch project within your eclipse
add the hadoop project as a dependent project of nutch project
you can now also set break points within hadoop classes lik inputformat implementations etc.
Original credits: RenaudRichardet