This page is a collection of information that is useful for new developers. Some of this is going to need to be moved to the Hadoop Wiki but I am putting it here first as I assemble this. Please feel free to add, comment and make corrections.


To new developers: If you want to begin to develop on Nutch do not forget to get started looking at the Hadoop source code. Hadoop is the platform that Nutch is implemented on. In order to understand anything about how Nutch works you need to also understand Hadoop.

What are the Hadoop primitives and how do I use them? Why are they there (what functionality do the add over regular primitives)?

These primitives implement the Hadoop Writable interface (or WritableComparable). What this does is gives Hadoop control over the serialization of these objects. If you look at the higher level Hadoop File System objects like ArrayFile you will see that they implement the same interfaces for serialization. Using these primitive types allows the serialization to be done in the same way as higher order data structures such as MapFile.

How does the Hadoop implementation of MapReduce work?

  1. First you need a JobConf. This class contains all the relevant information for the job. Information that you need to ensure that you include in the JobConf include:

  2. Then you need to submit your job to Hadoop to be run. This is done by calling JobClient.runJob. JobClient. runJob submits the job for starting and handles receiving status updates back from the job. It starts by creating an instance of the JobClient. It continues to push the job toward execution by calling JobClient.submitJob

  3. JobClient.submitJob handles splitting the input files and generating the MapReduce task.