QuickStart

Get up and running fast

The fastest way may be to just install the [WWW] hadoop virtual image but if you want to actually have something running on your system that can be more easily worked on and later expanded to other boxes stick with this document.

Based on the docs found at the following link, but modified to work with the current distribution: [WWW] http://hadoop.apache.org/core/api/overview-summary.html#overview_description

Please note this was last updated to match svn version 605291. Things may have changed since then. If they have, please update this page.

Requirements

Preparatory Steps

Dowload

Release Versions: can be found here [WWW] http://hadoop.apache.org/core/releases.html

Subversion: First check that the currently build isn't borked [WWW] http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/

Then grab the latest with subversion svn co http://svn.apache.org/repos/asf/hadoop/core/trunk hadoop

run the following commands:

cd hadoop
ant 
ant examples
bin/hadoop

bin/hadoop should display the basic command line help docs and let you know it's at least basically working. If any of the above steps failed use subversion to roll back to an earlier days revision.

Stage 1: Standalone Operation

By default, Hadoop is configured to run things in a non-distributed mode, as a single Java process. This is useful for debugging, and can be demonstrated as follows:

mkdir input
cp conf/*.xml input
bin/hadoop jar build/hadoop-0.16.0-dev-examples.jar grep input output 'dfs[a-z.]+'
cat output/*

Obviously the version number on the jar may have changed by the time you read this. You should see a lot of INFO level logging commands go by when you run it and cat output/* should give you something that looks like this:

cat output/*
2       dfs.
1       dfs.block.size
1       dfs.blockreport.interval
1       dfs.client.block.write.retries
1       dfs.client.buffer.dir
1       dfs.data.dir
1       dfs.datanode.bind
1       dfs.datanode.dns.interface
1       dfs.datanode.dns.nameserver
1       dfs.datanode.du.pct
1       dfs.datanode.du.reserved
1       dfs.datanode.port
...(and so on)

If you saw the error Exception in thread "main" java.lang.NoClassDefFoundError: build/hadoop-0/16/0-dev-examples/jar it means you forgot to type jar after bin/hadoop If you were unable to run this example, roll back to a previous night's version. If it seemed to run fine but cat didn't spit anything out you probably mistyped something. Try copying the command directly from the wiki to avoid typos. You'll need to wipe out the output directory between each run.

Congratulations you have just successfully run your first MapReduce with Hadoop.

Stage 2: Pseudo-distributed Configuration

You can in fact run everything on a single host. To run things this way, put the following in conf/hadoop-site.xml

<configuration>

  <property>
    <name>fs.default.name</name>
    <value>localhost:9000</value>
  </property>

  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
  </property>

  <property>
    <name>dfs.replication</name>
    <value>1</value>
        <!-- set to 1 to reduce warnings when 
        running on a single node -->
  </property>

</configuration>

Now check that the command ssh localhost does not require a password. If it does, execute the following commands:

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Now, try ssh localhost again. If this doesn't work you're doing to have to figure out what's going on with your ssh-agent on your own.

Window Users To start ssh server, you need run "ssh-host-config -y" in cygwin enviroment. If he ask for CYGWIN environment value, set it to "ntsec tty". After you can run server from cygwin "cygrunsrv --start sshd" or from Windows command line "net start sshd".

Mac Users You'll probably need to install something like [WWW] SSHKeychain or [WWW] SSHChain (no idea which is better) to be able to ssh to a computer without having to enter the password every time. This is due to the fact that ssh-agent was designed for X11 systems and OS X isn't an X11 system.

Bootstrapping

A new distributed filesystem must be formatted with the following command, run on the master node:

bin/hadoop namenode -format

You should see a quick series of STARTUP_MSGs and a SHUTDOWN_MSG

Open the conf/hadoop-env.sh file and define JAVA_HOME in it. Then start up the Hadoop daemon with

bin/start-all.sh

It should notify you that it's starting the namenode, datanode, secondarynamenode, and jobtracker.

Input files are copied into the distributed filesystem as follows: bin/hadoop dfs -put <localsrc> <dst> For more details just type bin/hadoop dfs with no options.

Stage 3: Fully-distributed operation

Distributed operation is just like the pseudo-distributed operation described above, except:

  1. Specify hostname or IP address of the master server in the values for fs.default.name and mapred.job.tracker in conf/hadoop-site.xml. These are specified as host:port pairs.

  2. Specify directories for dfs.name.dir and dfs.data.dir in conf/hadoop-site.xml. These are used to hold distributed filesystem data on the master node and slave nodes respectively. Note that dfs.data.dir may contain a space- or comma-separated list of directory names, so that data may be stored on multiple devices.

  3. Specify mapred.local.dir in conf/hadoop-site.xml. This determines where temporary MapReduce data is written. It also may be a list of directories.

  4. Specify mapred.map.tasks and mapred.reduce.tasks in conf/mapred-default.xml. As a rule of thumb, use 10x the number of slave processors for mapred.map.tasks, and 2x the number of slave processors for mapred.reduce.tasks.

  5. List all slave hostnames or IP addresses in your conf/slaves file, one per line.

last edited 2008-04-02 18:51:55 by RomanLakotko