RunNutchInEclipse

This page acts as a resource for working with Nutch from within the Eclipse IDE. It is intended to provide a comprehensive beginning resource for the configuration, building, crawling and debugging of Nutch trunk in the above context.

Tested with

The tutorial here works fine for Nutch 1.6 and 2.x series as well with couple of changes and fixing dependencies. Check the bottom section for suggestions to fixes.

Before you start

Setting up Nutch to run into Eclipse can be tricky, and most of the time you are much faster if you edit Nutch in Eclipse but run the scripts from the command line. However, it's very useful to be able to debug Nutch in Eclipse and is also extremely useful when applying and testing patches as it enables you to see them working in a larger context. This being said, you will still benefit greatly by looking at the hadoop.log output.

This tutorial covers a fully internal Eclipse/Nutch set up, using only Eclipse tools and associated plugins.

Prerequisites

Steps

Install Nutch

Use the Subclipse plugin to check out the latest Nutch Trunk development.

Use https://svn.apache.org/repos/asf/nutch/branches/2.1/ for 2.1 version. The trunk is 1.6 version now.

Establish the Eclipse environment for Nutch

Configure Nutch

Build Nutch

BUILD SUCCESSFUL
Total time: 33 seconds

At this stage it is advisable to right click on the project within the package explorer and click on the refresh option. This will now reveal the new runtime directory. As we previously configured various configuration setting all we need to do is add the seed directory to our /runtime/local directory then we are ready to crawl.

Create Eclipse launcher

org.apache.nutch.crawl.Crawl

urls -dir crawl -depth 3 -topN 50

-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log

Debug Nutch in Eclipse

Fetcher [line: 1115] - run
Fetcher [line: 530] - fetch
Fetcher$FetcherThread [line: 560] - run()
Generator [line: 443] - generate
Generator$Selector [line: 108] - map
OutlinkExtractor [line: 71 & 74] - getOutlinks

Remote Debugging in Eclipse

  1. create a new Debug Configuration as Remote Java Application and remember the port (here: 37649)

  2. launch nutch from command-line but add options to use the Java Debugger JDWP Agent Library, e.g. from bash:

% export NUTCH_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=localhost:37649"
% $NUTCH_HOME/bin/nutch parsechecker http://myurl.com/
  1. the application will be suspended just after launch
  2. now go to Eclipse, set appropriate break-points, and run the previously created Debug Configuration

Instead of creating an extra launch configuration for every tool you want to debug, one single configuration is enough to debug any tool (parsechecker, indexchecher, URL filter, etc.) and that even remotely (crawler/tool running on server, Eclipse debugger locally).

Debugging and Timeouts

Debugging takes time, esp. when inspecting variables, stack traces, etc. Usually too much time, so that some timeout will apply and stop the application. Set timeouts in the nutch-site.xml used for debugging to a rather high value (or -1 for unlimited), e.g., when debugging the parser:

<property>
  <name>parser.timeout</name>
  <value>-1</value>
</property>

If things do not work...

Yes, Nutch and Eclipse can be a difficult companionship sometimes ;-)

Missing dependencies

I have found the following dependencies missing. 1) jsch 2) nekohtml 3) com.sun.syndication 4) tagsoup Adding the above jar files to the buildPath using 'external jars' option has resolved the errors.

eclipse: Cannot create project content in workspace

The Nutch source code must be out of the workspace folder. Alternatively you can download the code with eclipse (svn) under your workspace rather than try to create the project using existing code, eclipse sometimes doesn't let you do it from source code into the workspace.

plugin directory not found

Make sure you set your plugin.folders property correct, instead of using a relative path you can use a absolute one as well in nutch-default.xml or even better in nutch-site.xml. Ideally all efforts should be made to keep nutch-default.xml completely intact.

<property>
  <name>plugin.folders</name>
  <value>/home/....../trunk/src/plugin</value>

No plugins loaded during unit tests in Eclipse

During unit testing, Eclipse ignored conf/nutch-site.xml in favor of src/test/nutch-site.xml, so you might need to add the plugin directory configuration to that file as well.

classNotFound

debugging Hadoop classes

Sometimes (fairly often) it makes sense to also have the Hadoop classes available during debugging. This should really second nature as Nutch heavily relies upon the underlying Hadoop infrastructure. Therefore you can check out (svn) the Hadoop sources into your Eclipse IDE and combine to debug this way. You can:

Non-ported Plugins to 2.x

Few plugins were not ported to Nutch 2.x series yet. If you are following the above tutorial for building Nutch 2.x, please check Nutch2Plugins for more information

Other Resources

http://florianhartl.com/nutch-installation.html. http://techvineyard.blogspot.com/2010/12/build-nutch-20.html

RunNutchInEclipse (last edited 2013-03-21 21:35:24 by kiranchitturi)