Get up and running fast
The fastest way may be to just install the
hadoop virtual image but if you want to actually have something running on your system that can be more easily worked on and later expanded to other boxes stick with this document.
Based on the docs found at the following link, but modified to work with the current distribution:
http://hadoop.apache.org/core/api/overview-summary.html#overview_description
Please note this was last updated to match svn version 605291. Things may have changed since then. If they have, please update this page.
Requirements
Java 1.5.X
ssh and sshd
rsync
Preparatory Steps
Dowload
Release Versions: can be found here
http://hadoop.apache.org/core/releases.html
Subversion: First check that the currently build isn't borked
http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/
Then grab the latest with subversion svn co http://svn.apache.org/repos/asf/hadoop/core/trunk hadoop
run the following commands:
cd hadoop ant ant examples bin/hadoop
bin/hadoop should display the basic command line help docs and let you know it's at least basically working. If any of the above steps failed use subversion to roll back to an earlier days revision.
Stage 1: Standalone Operation
By default, Hadoop is configured to run things in a non-distributed mode, as a single Java process. This is useful for debugging, and can be demonstrated as follows:
mkdir input cp conf/*.xml input bin/hadoop jar build/hadoop-0.16.0-dev-examples.jar grep input output 'dfs[a-z.]+' cat output/*
Obviously the version number on the jar may have changed by the time you read this. You should see a lot of INFO level logging commands go by when you run it and cat output/* should give you something that looks like this:
cat output/* 2 dfs. 1 dfs.block.size 1 dfs.blockreport.interval 1 dfs.client.block.write.retries 1 dfs.client.buffer.dir 1 dfs.data.dir 1 dfs.datanode.bind 1 dfs.datanode.dns.interface 1 dfs.datanode.dns.nameserver 1 dfs.datanode.du.pct 1 dfs.datanode.du.reserved 1 dfs.datanode.port ...(and so on)
If you saw the error Exception in thread "main" java.lang.NoClassDefFoundError: build/hadoop-0/16/0-dev-examples/jar it means you forgot to type jar after bin/hadoop If you were unable to run this example, roll back to a previous night's version. If it seemed to run fine but cat didn't spit anything out you probably mistyped something. Try copying the command directly from the wiki to avoid typos. You'll need to wipe out the output directory between each run.
Congratulations you have just successfully run your first MapReduce with Hadoop.
Stage 2: Pseudo-distributed Configuration
You can in fact run everything on a single host. To run things this way, put the following in conf/hadoop-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>localhost:9000</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<!-- set to 1 to reduce warnings when
running on a single node -->
</property>
</configuration>
Now check that the command ssh localhost does not require a password. If it does, execute the following commands:
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Now, try ssh localhost again. If this doesn't work you're doing to have to figure out what's going on with your ssh-agent on your own.
Window Users To start ssh server, you need run "ssh-host-config -y" in cygwin enviroment. If he ask for CYGWIN environment value, set it to "ntsec tty". After you can run server from cygwin "cygrunsrv --start sshd" or from Windows command line "net start sshd".
Mac Users You'll probably need to install something like
SSHKeychain or
SSHChain (no idea which is better) to be able to ssh to a computer without having to enter the password every time. This is due to the fact that ssh-agent was designed for X11 systems and OS X isn't an X11 system.
Bootstrapping
A new distributed filesystem must be formatted with the following command, run on the master node:
bin/hadoop namenode -format
You should see a quick series of STARTUP_MSGs and a SHUTDOWN_MSG
Open the conf/hadoop-env.sh file and define JAVA_HOME in it. Then start up the Hadoop daemon with
bin/start-all.sh
It should notify you that it's starting the namenode, datanode, secondarynamenode, and jobtracker.
Input files are copied into the distributed filesystem as follows: bin/hadoop dfs -put <localsrc> <dst> For more details just type bin/hadoop dfs with no options.
Stage 3: Fully-distributed operation
Distributed operation is just like the pseudo-distributed operation described above, except:
Specify hostname or IP address of the master server in the values for fs.default.name and mapred.job.tracker in conf/hadoop-site.xml. These are specified as host:port pairs.
Specify directories for dfs.name.dir and dfs.data.dir in conf/hadoop-site.xml. These are used to hold distributed filesystem data on the master node and slave nodes respectively. Note that dfs.data.dir may contain a space- or comma-separated list of directory names, so that data may be stored on multiple devices.
Specify mapred.local.dir in conf/hadoop-site.xml. This determines where temporary MapReduce data is written. It also may be a list of directories.
Specify mapred.map.tasks and mapred.reduce.tasks in conf/mapred-default.xml. As a rule of thumb, use 10x the number of slave processors for mapred.map.tasks, and 2x the number of slave processors for mapred.reduce.tasks.
List all slave hostnames or IP addresses in your conf/slaves file, one per line.