Running Nutch in Eclipse

Here are instructions for setting up a development environment for Nutch under the Eclipse IDE. It is intended to provide a comprehensive beginning resource for the configuration, building, crawling and debugging of Nutch trunk in the above context.

Before you start

Setting up Nutch to run into Eclipse can be tricky, and most of the time you are much faster if you edit Nutch in Eclipse but run the scripts from the command line. However, it's very useful to be able to debug Nutch in Eclipse and is also extremely useful when applying and testing patches as it enables you to see them working in a larger context. This being said, you will still benefit greatly by looking at the hadoop.log output. This tutorial covers a fully internal Eclipse/Nutch set up, using only Eclipse tools and associated plugins.

Prerequisites

Steps

Checkout and Build Nutch

  1. Get the latest source code from SVN using terminal. For Nutch 1.x (ie.trunk) run this:
     svn co https://svn.apache.org/repos/asf/nutch/trunk
     cd trunk
    For Nutch 2.x run this:
     svn co https://svn.apache.org/repos/asf/nutch/branches/2.x
     cd 2.x
    For Nutch 1.x (ie. trunk), skip ahead to step #5.
  2. At this point you should have decided which data store you want to use. See the Apache Gora documentation to get more information about it. Here are few of the available options of storage classes:

      org.apache.gora.hbase.store.HBaseStore
      org.apache.gora.cassandra.store.CassandraStore
      org.apache.gora.accumulo.store.AccumuloStore
      org.apache.gora.avro.store.AvroStore
      org.apache.gora.avro.store.DataFileAvroStore
    In “conf/nutch-site.xml” add the storage class name. eg. say you pick HBase as datastore, add this to “conf/nutch-site.xml”:
     <property>
      <name>storage.data.store.class</name>
      <value>org.apache.gora.hbase.store.HBaseStore</value>
      <description>Default class for storing data</description>
     </property>
  3. In ivy/ivy.xml: Uncomment the dependency for the data store that you selected. eg. If you plan to use HBase, uncomment this line:
      <dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />
  4. Set the default datastore in conf/gora.properties. eg. For HBase as datastore, put this in conf/gora.properties:
     gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
  5. Add “http.agent.name” and “http.robots.agents” with appropiate values in “conf/nutch-site.xml”. See conf/nutch-default.xml for the description of these properties. Also, add “plugin.folders” and set it to {PATH_TO_NUTCH_CHECKOUT}/build/plugins. eg. If Nutch is present at "/home/tejas/Desktop/2.x", set the property to:
     <property>
       <name>plugin.folders</name>
       <value>/home/tejas/Desktop/2.x/build/plugins</value>
     </property>
  6. Run this command:
      ant eclipse

Load project in Eclipse

  1. In Eclipse, click on “File” -> “Import...”

  2. Select “Existing Projects into Workspace”

    importproject.png

  3. In the next window, set the root directory to the location where you took the checkout of nutch 2.x (or trunk). Click “Finish”.
  4. You will now see a new project named 2.x (or trunk) being added in the workspace. Wait for a moment until Eclipse refreshes its SVN cache and builds its workspace. You can see the status at the bottom right corner of Eclipse.

    build_workspace.png

  5. In Package Explorer, right click on the project “2.x” (or trunk), select “Build Path” -> “Configure Build Path”

    build_path.png

  6. In the “Order and Export” tab, scroll down and select “2.x/conf” (or trunk/conf). Click on “Top” button. Sadly, Eclipse will again build the workspace but this time it won’t take take much.

    order_and_export.png

Create Eclipse launcher

Now, lets get geared to run something. Lets start off with the inject operation. Right click on the project in “Package Explorer” -> select “Run As” -> select “Run Configurations”. Create a new configuration. Name it as "inject".

run_configs1.png

In the arguments tab, for program arguments, provide the path of the input directory which has seed urls. Set VM Arguments to “-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log”

run_config_2.png

Click "Apply" and then click "Run". If everything was set perfectly, then you should see inject operation progressing on console.

inject_console.png

If you want to find out the java class corresponding to any command, just peek inside "src/bin/nutch" script and at the bottom you would find a switch case with a case corresponding to each command. Here are the important classes corresponding to the crawl cycle:

Operation

Class in Nutch 1.x (i.e.trunk)

Class in Nutch 2.x

inject

org.apache.nutch.crawl.Injector

org.apache.nutch.crawl.InjectorJob

generate

org.apache.nutch.crawl.Generator

org.apache.nutch.crawl.GeneratorJob

fetch

org.apache.nutch.fetcher.Fetcher

org.apache.nutch.fetcher.FetcherJob

parse

org.apache.nutch.parse.ParseSegment

org.apache.nutch.parse.ParserJob

updatedb

org.apache.nutch.crawl.CrawlDb

org.apache.nutch.crawl.DbUpdaterJob

Debug Nutch in Eclipse

Fetcher [line: 1115] - run
Fetcher [line: 530] - fetch
Fetcher$FetcherThread [line: 560] - run()
Generator [line: 443] - generate
Generator$Selector [line: 108] - map
OutlinkExtractor [line: 71 & 74] - getOutlinks

FetcherReducer$FetcherThread run() : line 487 : LOG.info("fetching " + fit.url ....
                                   : line 519 : final ProtocolStatus status = output.getStatus();

GeneratorMapper : map() : line 53
GeneratorReducer : reduce() : line 53
OutlinkExtractor : getOutlinks() : line 84

Remote Debugging in Eclipse (NOT VERIFIED)

  1. create a new Debug Configuration as Remote Java Application and remember the port (here: 37649)

  2. launch nutch from command-line but add options to use the Java Debugger JDWP Agent Library, e.g. from bash:

% export NUTCH_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=localhost:37649"
% $NUTCH_HOME/bin/nutch parsechecker http://myurl.com/
  1. the application will be suspended just after launch
  2. now go to Eclipse, set appropriate break-points, and run the previously created Debug Configuration

Instead of creating an extra launch configuration for every tool you want to debug, one single configuration is enough to debug any tool (parsechecker, indexchecher, URL filter, etc.) and that even remotely (crawler/tool running on server, Eclipse debugger locally).

Debugging and Timeouts

Debugging takes time, esp. when inspecting variables, stack traces, etc. Usually too much time, so that some timeout will apply and stop the application. Set timeouts in the nutch-site.xml used for debugging to a rather high value (or -1 for unlimited), e.g., when debugging the parser:

<property>
  <name>parser.timeout</name>
  <value>-1</value>
</property>

Troubleshooting

eclipse: Cannot create project content in workspace

The Nutch source code must be out of the workspace folder. Alternatively you can download the code with eclipse (svn) under your workspace rather than try to create the project using existing code, eclipse sometimes doesn't let you do it from source code into the workspace.

Plugin directory not found

Make sure you set your plugin.folders property correct, instead of using a relative path you can use a absolute one as well in nutch-default.xml or even better in nutch-site.xml. Ideally all efforts should be made to keep nutch-default.xml completely intact.

<property>
  <name>plugin.folders</name>
  <value>/home/....../trunk/src/plugin</value>

No plugins loaded during unit tests in Eclipse

During unit testing, Eclipse ignored conf/nutch-site.xml in favor of src/test/nutch-site.xml, so you might need to add the plugin directory configuration to that file as well.

Debugging Hadoop classes

Sometimes (fairly often) it makes sense to also have the Hadoop classes available during debugging. This should really second nature as Nutch heavily relies upon the underlying Hadoop infrastructure. Therefore you can check out the Hadoop sources into your Eclipse IDE and combine to debug this way. You can:

Non-ported Plugins to 2.x

Few plugins were not ported to Nutch 2.x series yet. If you are following the above tutorial for building Nutch 2.x, please check Nutch2Plugins for more information

RunNutchInEclipse (last edited 2013-06-27 12:12:00 by 76)