Differences between revisions 9 and 10
Revision 9 as of 2013-06-24 04:13:51
Size: 2681
Editor: TejasPatil
Comment: changed the link to HBase
Revision 10 as of 2013-07-24 21:14:57
Size: 3081
Comment:
Deletions are marked like this. Additions are marked like this.
Line 6: Line 6:
 * Grab a distribution of Nutch 2.X from [[http://www.apache.org/dyn/closer.cgi/nutch/|here]]
 * Install and configure HBase. You can get it [[http://archive.apache.org/dist/hbase/|here]] ('''N.B.''' Gora 0.2 uses HBase 0.90.4, however the setup is known to work with more recent versions of the HBase 0.90.x branch)
 * Grab the latest distribution of Nutch 2.X from [[http://www.apache.org/dyn/closer.cgi/nutch/|here]]
 * Install and configure HBase. You can get it [[http://archive.apache.org/dist/hbase/|here]] ('''N.B.''' Gora 0.3 uses HBase 0.90.4, however the setup is known to work with more recent versions of the HBase 0.90.x branch)
Line 23: Line 23:
    <dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*->default" />     <dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />
Line 43: Line 43:
'''N.B.''' The crawl command in the bin/nutch script is deprecated. You should use individual commands or alternatively use the bin/crawl script... which effectively chains together individual commands.
Line 47: Line 49:
'''N.B.''' The process of using the other datastore implementations offered within Gora e.g. Apache Cassandra, Accumulo and Sql, can be achieved simply by tweaking the above settings prior to compiling the Nutch code. '''N.B.''' The process of using the other datastore implementations offered within Gora e.g. Apache Cassandra, Accumulo, can be achieved simply by tweaking the above settings prior to compiling the Nutch code.

'''N.B.''' As of Apache Gora release 0.3, the gora-sql 0.1.1-incubating artifact is deprecated. The choice is to downgrade to Nutch 2.1 if you wish to use MySQL or HSQLDB as a Gora backend.

Nutch 2.0 Tutorial

nutch_logo_medium.gif http://gora.apache.org/resources/img/gora-logo.png http://hbase.apache.org/images/hbase_logo.png

This document describes how to get Nutch 2.0 to use HBase as a storage backend for Gora.

  • Grab the latest distribution of Nutch 2.X from here

  • Install and configure HBase. You can get it here (N.B. Gora 0.3 uses HBase 0.90.4, however the setup is known to work with more recent versions of the HBase 0.90.x branch)

  • Specify the GORA backend in nutch-site.xml

<property>
 <name>storage.data.store.class</name>
 <value>org.apache.gora.hbase.store.HBaseStore</value>
 <description>Default class for storing data</description>
</property>
  • Ensure the HBase gora-hbase dependency is available in ivy/ivy.xml

    <!-- Uncomment this to use HBase as Gora backend. -->
    
    <dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />
  • Ensure that HBaseStore is set as the default datastore in gora.properties

    gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
  • N.B. It's probably worth setting all your usual configuration settings within nutch-site.xml etc. before progressing.

  • Compile Nutch -> ant runtime

  • Make sure HBase is started and working properly as per the quick start tutorial here

You should then be able to use it. Try going to $NUTCH_HOME/runtime/local/bin and do :

  nutch inject /someseedDir
  nutch readdb

N.B. The crawl command in the bin/nutch script is deprecated. You should use individual commands or alternatively use the bin/crawl script... which effectively chains together individual commands.

You should find more details in the logs on $NUTCH_HOME/runtime/local/logs/hadoop.log.

N.B. It's possible to encounter the following exception: java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration; this is caused by the fact that sometimes the hbase TEST jar is deployed in the lib dir. To resolve this just copy the lib over from your installed HBase dir into the build lib dir. (This issue is currently in progress).

N.B. The process of using the other datastore implementations offered within Gora e.g. Apache Cassandra, Accumulo, can be achieved simply by tweaking the above settings prior to compiling the Nutch code.

N.B. As of Apache Gora release 0.3, the gora-sql 0.1.1-incubating artifact is deprecated. The choice is to downgrade to Nutch 2.1 if you wish to use MySQL or HSQLDB as a Gora backend.

For more details of the command line interface options, please see here, or of course run ./bin/nutch which will print usage to std out. Finally, for a more detailed Nutch (1.X) tutorial, please see here

back to FrontPage

Nutch2Tutorial (last edited 2013-07-24 21:14:57 by LewisJohnMcgibbney)