Note: for the 1.0.x series of Hadoop the following articles will probably be easiest to follow:

The below instructions are primarily for the 0.2x series of Hadoop.

Downloading and installing Hadoop

Hadoop can be downloaded from one of the Apache download mirrors. You may also download a nightly build or check out the code from subversion and build it with Ant. Select a directory to install Hadoop under (let's say /foo/bar/hadoop-install) and untar the tarball in that directory. A directory corresponding to the version of Hadoop downloaded will be created under the /foo/bar/hadoop-install directory. For instance, if version 0.21.0 of Hadoop was downloaded untarring as described above will create the directory /foo/bar/hadoop-install/hadoop-0.21.0. The examples in this document assume the existence of an environment variable $HADOOP_INSTALL that represents the path to all versions of Hadoop installed. In the above instance HADOOP_INSTALL=/foo/bar/hadoop-install. They further assume the existence of a symlink named hadoop in $HADOOP_INSTALL that points to the version of Hadoop being used. For instance, if version 0.21.0 is being used then $HADOOP_INSTALL/hadoop -> hadoop-0.21.0. All tools used to run Hadoop will be present in the directory $HADOOP_INSTALL/hadoop/bin. All configuration files for Hadoop will be present in the directory $HADOOP_INSTALL/hadoop/conf.

Startup scripts

The $HADOOP_INSTALL/hadoop/bin directory contains some scripts used to launch Hadoop DFS and Hadoop Map/Reduce daemons. These are:

It is also possible to run the Hadoop daemons as Windows Services using the Java Service Wrapper (download this separately). This still requires Cygwin to be installed as Hadoop requires its df command. See the following JIRA issues for details:

Configuration files

Hadoop Cluster Setup/Configuration contains a description of Hadoop configuration for 0.21.0. The information on this wiki page is not current. See also QuickStart which is current for 0.21.0.

The $HADOOP_INSTALL/hadoop/conf directory contains some configuration files for Hadoop. These are:

More details on configuration can be found on the HowToConfigure page.

Setting up Hadoop on a single node

This section describes how to get started by setting up a Hadoop cluster on a single node. The setup described here is an HDFS instance with a namenode and a single datanode and a Map/Reduce cluster with a jobtracker and a single tasktracker. The configuration procedures described in Basic Configuration are just as applicable for larger clusters.

Basic Configuration

Take a pass at putting together basic configuration settings for your cluster. Some of the settings that follow are required, others are recommended for more straightforward and predictable operation.

An example of a hadoop-site.xml file:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/tmp/hadoop-${user.name}</value>
</property>
<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
</property>
<property>
  <name>mapred.job.tracker</name>
  <value>hdfs://localhost:54311</value>
</property>
<property> 
  <name>dfs.replication</name>
  <value>8</value>
</property>
<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx512m</value>
</property>
</configuration>

Formatting the Namenode

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem, which is implemented on top of the local filesystems of your cluster. You need to do this the first time you set up a Hadoop installation. Do not format a running Hadoop filesystem, this will cause all your data to be erased. Before formatting, ensure that the dfs.name.dir directory exists. If you just used the default, then mkdir -p /tmp/hadoop-username/dfs/name will create the directory. To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command:
% $HADOOP_INSTALL/hadoop/bin/hadoop namenode -format

If asked to \[re\]format, you must reply Y (not just y) if you want to reformat, else Hadoop will abort the format.

Starting a Single node cluster

Run the command:
% $HADOOP_INSTALL/hadoop/bin/start-all.sh
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

Stopping a Single node cluster

Run the command
% $HADOOP_INSTALL/hadoop/bin/stop-all.sh
to stop all the daemons running on your machine.

Separating Configuration from Installation

In the example described above, the configuration files used by the Hadoop cluster all lie in the Hadoop installation. This can become cumbersome when upgrading to a new release since all custom config has to be re-created in the new installation. It is possible to separate the config from the install. To do so, select a directory to house Hadoop configuration (let's say /foo/bar/hadoop-config. Copy all conf files to this directory. You can either set the HADOOP_CONF_DIR environment variable to refer to this directory or pass it directly to the Hadoop scripts with the --config option. In this case, the cluster start and stop commands specified in the above two sub-sections become
% $HADOOP_INSTALL/hadoop/bin/start-all.sh --config /foo/bar/hadoop-config and
% $HADOOP_INSTALL/hadoop/bin/stop-all.sh --config /foo/bar/hadoop-config.
Only the absolute path to the config directory should be passed to the scripts.

Starting up a larger cluster

Stopping the cluster