Differences between revisions 3 and 4
Revision 3 as of 2011-08-03 21:55:34
Size: 4475
Comment:
Revision 4 as of 2011-08-04 09:28:39
Size: 4751
Comment:
Deletions are marked like this. Additions are marked like this.
Line 65: Line 65:
Cloudera's CDH3 is Cloudera's distribution including Apache Hadoop. More information can be found [[https://ccp.cloudera.com/display/CDHDOC/CDH3+Quick+Start+Guide|here]]. This common error results due to a bug in MAPREDUCE-967 which modifies the way MapReduce unpacks the job's jar. The old way was to unpack the whole of it, now only classes/ and lib/ are unpacked. This way nutch is missing the plugins/ directory. A workaround is to force unpacking of the plugin/ directory. This can be done by adding the following properties to nutch-site.xml Cloudera's CDH3 is Cloudera's distribution including Apache Hadoop. More information can be found [[https://ccp.cloudera.com/display/CDHDOC/CDH3+Quick+Start+Guide|here]]. This common error results due to a bug in MAPREDUCE-967 which modifies the way MapReduce unpacks the job's jar. The old way was to unpack the whole of it, now only classes/ and lib/ are unpacked. This way Nutch is missing the plugins/ directory. A workaround is to force unpacking of the plugin/ directory. This can be done by adding the following properties to nutch-site.xml
Line 77: Line 77:
It is then necessary to recreate the Nutch job file using ant. and by removing hue-plugins-1.2.0-cdh3u1.jar from the hadoop lib folder (e.g. /usr/lib/hadoop-0.20/lib).

It is then necessary to recreate the Nutch job file using ant. Then finally it is important to set HADOOP_OPTS="-Djob.local.dir=/<MY HOME>/nutch/plugins" in hadoop-env.sh.

Although this is a real nasty workaround it does work.

Error Messages in Nutch 2.0

This page acts as a repository for potential error messages you might experience whilst using Nutch 2.0. It will most likely be dynamic in nature due to the variety of additional software projects which can be combined with Nutch 2.0 and the potential for errors which this presents.

Nutch 2.0 and Apache Cassandra

When trying to configure Nutch (running in distributed mode on Cloudera's CDH3) with Cassandra as the Gora storage mechanism, the following NoSuchMethodError results when attempting to inject the crawldb with a seed list.

Caused by: java.lang.NoSuchMethodError: 
org.apache.thrift.meta_data.FieldValueMetaData.<init>(BZ)V
        at org.apache.cassandra.thrift.CfDef.<clinit>(CfDef.java:299)
        at org.apache.cassandra.thrift.KsDef.read(KsDef.java:753)
        at 
org.apache.cassandra.thrift.Cassandra$describe_keyspace_result.read(Cassandra.java:24338)
        at 
org.apache.cassandra.thrift.Cassandra$Client.recv_describe_keyspace(Cassandra.java:1371)
        at 
org.apache.cassandra.thrift.Cassandra$Client.describe_keyspace(Cassandra.java:1346)
        at 
me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractCluster.java:192)
        at 
me.prettyprint.cassandra.service.AbstractCluster$4.execute(AbstractCluster.java:187)
        at 
me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:101)
        at 
me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:232)
        at 
me.prettyprint.cassandra.service.AbstractCluster.describeKeyspace(AbstractCluster.java:201)
        at 
org.apache.gora.cassandra.store.CassandraClient.checkKeyspace(CassandraClient.java:82)
        at 
org.apache.gora.cassandra.store.CassandraClient.init(CassandraClient.java:69)
        at 
org.apache.gora.cassandra.store.CassandraStore.<init>(CassandraStore.java:68)
        ... 18 more

When using different Gora storage mechanisms we have to manually tweak the Nutch Ivy configuration depending on the choice of Gora store, in this case Cassandra.

To resolve this error the following was added to $NUTCH_HOME/ivy/ivy.xml:

<dependency org="org.apache.gora" name="gora-cassandra" rev="0.2-incubating" conf="*->compile"/>
<dependency org="org.apache.cassandra" name="cassandra-thrift" rev="0.8.1"/>
<dependency org="com.ecyrd.speed4j" name="speed4j" rev="0.9" conf="*->*,!javadoc,!sources"/>
<dependency org="com.github.stephenc.high-scale-lib" name="high-scale-lib" rev="1.1.2" conf="*->*,!javadoc,!sources"/>
<dependency org="com.google.collections" name="google-collections" rev="1.0" conf="*->*,!javadoc,!sources"/>
<dependency org="com.google.guava" name="guava" rev="r09" conf="*->*,!javadoc,!sources"/>
<dependency org="org.apache.cassandra" name="apache-cassandra" rev="0.8.1"/>
<dependency org="me.prettyprint" name="hector-core" rev="0.8.0-2"/>

then the following ant commands were executed

$ ant clean
$ ant

This specified the correct dependencies to be downloaded by Ivy which were then bundled into the nutch-2.0-dev.job file.

In this particular case it was mentioned that Cloudera CDH3 was being used. It has a hue plugins jar with an older thrift library in it, therefore removing this jar from the classpath resolved further errors with running Nutch in distributed mode.

Correspondence on this error can be seen in context here

Missing plugins whilst running Nutch 2.0 on Cloudera's CDH3

Cloudera's CDH3 is Cloudera's distribution including Apache Hadoop. More information can be found here. This common error results due to a bug in MAPREDUCE-967 which modifies the way MapReduce unpacks the job's jar. The old way was to unpack the whole of it, now only classes/ and lib/ are unpacked. This way Nutch is missing the plugins/ directory. A workaround is to force unpacking of the plugin/ directory. This can be done by adding the following properties to nutch-site.xml

<property>
<name>mapreduce.job.jar.unpack.pattern</name>
<value>(?:classes/|lib/|plugins/).*</value>
</property>

<property>
<name>plugin.folders</name>
<value>${job.local.dir}/../jars/plugins</value>
</property>

and by removing hue-plugins-1.2.0-cdh3u1.jar from the hadoop lib folder (e.g. /usr/lib/hadoop-0.20/lib).

It is then necessary to recreate the Nutch job file using ant. Then finally it is important to set HADOOP_OPTS="-Djob.local.dir=/<MY HOME>/nutch/plugins" in hadoop-env.sh.

Although this is a real nasty workaround it does work.

ErrorMessagesInNutch2 (last edited 2013-04-27 00:14:01 by LewisJohnMcgibbney)