Setting up NUTCH 2.x with CASSANDRA

One of the novelties in Nutch 2 is Apache Gora as a back-end, which provides an in-memory data model and persistence for big data. It allows connecting to different storage options, such as key/value store Apache Accumulo, distributed big data store Apache HBase and column family data store Apache Cassandra. The setting up of Nutch using HBase as a backend is explained in Nutch2Tutorial.

In this tutorial, however, we explain how to run Nutch 2.x using Cassandra.

Step 1: Setting up Cassandra

The version used here is: apache-cassandra-1.2.8-bin.tar.gz

You can find specific guidance to installation of Cassandra: here.

Once installed, you should test the installation by starting Cassandra from the konsole using the following command:

(take care to use 'sudo' unless it was installed without file permission)

Note: Additionally, to get access to Cassandra tables etc. you can start the Cassandra Client by running:

./bin/cassandra-cli -host localhost -port 9160

This should then connect to the 'Test Cluster' and print the following to the console:

"Connected to: "Test Cluster" on localhost/9160

Welcome to Cassandra CLI version 1.2.8 ..."

Further, pressing ? gives several commandline options, such as:

Step 2: Setting up Nutch 2.x

A recent source version of Nutch 2 can be downloaded from here.

It has then to be compiled using ‘ant runtime’.

N.B. run: ‘ant runtime’ from the root of the installation folder

Crawling in Nutch 2.x

Setting up a basic crawl remains the same as in Nutch 1.x, except that you need to start Cassandra (and the Cassandra client) before starting your crawl.

For instructions for setting up and running a basic crawl: see NutchTutorial (Nutch crawling tutorial with 1.x)

Using the crawl script, crawling can be started from Nutch-2.x/runtime/deploy/ by running:

bin/crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>

Note: If Nutch 2.x has been successfully running, it should have created a keyspace, called ‘webpage’, which can be viewed in the Cassandra client, when using the command from above: show keyspaces;

N.B: If you want to start from scratch, making sure no old urls are re-read from the table, one can remove a table from Cassandra through the client E.g. deleting the table: ‘webpage’ by running: drop keyspace webpage;

Checking the results of your crawl (e.g. no. of URLs in Crawldb) works better by using the 'readdb’ command in the bin/nutch script, e.g. getting the crawldb statistics: bin/nutch readdb <crawlDir> -stats

Nutch2Cassandra (last edited 2013-09-06 15:48:10 by 109)