Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The primary (core) Nutch configuration files are

This page refers to a number of files. N.B. It should be noted that some of these files are Hadoop specific and no longer packaged with Nutch. These files are namely;

  • core/hadoop-default.xml: This file contains generic default settings for Hadoop daemons and Map/Reduce jobsNutch specific configuration properties.
  • conf/nutch-defaultsite.xml: This file contains generic default site specific settings for Nutch specific configuration properties.

Hadoop configuration

  • core/hadoop-sitedefault.xml: This file contains site specific generic default settings for all Hadoop daemons and Map/Reduce jobs.
  • nutchcore-site.xml: This file contains site specific settings for Nutch specific configuration propertiesall Hadoop daemons and Map/Reduce jobs.
  • mapred-site.xml: This file contains site specific settings for the Hadoop Map/Reduce daemons and jobs.

For more information on the Hadoop configuration files please see GettingStartedWithHadoop#Configuration_files GettingStartedWithHadoop#Configurationfiles

Dennis Kubes explains:

Configuration has two levels, default and final. It is supplied by the org.apache.hadoop.conf.Configuration class and extended in Nutch by the org.apache.nutch.util.NutchConfiguration class.

Although it is configurable, by default hadoop-default.xml and nutch-default.xml are default resources and hadoop-site.xml and nutch-site.xml are final resources. Resources (i.e. resource files) can be added by filename to either the default or final resource set and in fact this is how Nutch extends the Configuration class, by adding nutch-default.xml and nutch-site.xml.

...

  • mapred-default.xml - (Hadoop specific) this is loaded as default resource when a new map-reduce JobConf is created - which means that it is loaded as the last default resource when you prepare the job configuration. Usually you should keep its content to a bare minimum. This is the best place to specify the default number of map and reduce tasks per job. If you feel adventurous you could also put some other stuff there, e.g. set the default compression with mapred.compress.map.output and so on.
  • job.xml - (Hadoop specific Deprecated)this file is created dynamically, and represents a serialized JobConf. When map-reduce tasks are started they read this file as their last default resource (note - this is NOT a final resource!). So, if you accidentally distributed mapred-default.xml to all cluster nodes, but in your job you specified a different number of map or reduce tasks, your settings will take precedence. The same with other settings, such as e.g. the compression setting.

HOWEVER ... a common error is to put too many properties such as default number of map and reduce tasks in hadoop-site.xml. As Dennis explained, this is a final resource - which means that the values you specify there will ALWAYS override your job settings. This is bad, so don't do it (wink) - put them in mapred-default.xml.

...