Get up and running fast
The fastest way may be to just install a pre-configured virtual Hadoop environment. Two such environments are:
The Cloudera Training Virtual Machine. This image runs within the free VMWare player and has Hadoop, Hive, Pig and examples pre-loaded. Video lectures and screencasts walk you through everything.
The OpenSolaris Hadoop Live CD. This virtual Hadoop cluster runs entirely off the CD, and does not require you to install any new software on your system.
Cloudera also provides their distribution for Hadoop (Apache 2.0 Licensed), including support for Hive and Pig and configuration management, in the following formats:
RPMs for Redhat based systems (Centos, Fedora, RHEL, etc)
Debian Packages for Debian based systems (Debian, Ubuntu, etc)
If you want to work exclusively with Hadoop code directly from Apache, the rest of this document can help you get started quickly from there.
Based on the docs found at the following link, but modified to work with the current distribution: http://hadoop.apache.org/core/api/overview-summary.html#overview_description
Please note this was last updated to match svn version 605291. Things may have changed since then. If they have, please update this page.
Requirements
- Java 1.5.X
- ssh and sshd
- rsync
Preparatory Steps
Dowload
Release Versions: can be found here http://hadoop.apache.org/core/releases.html
Subversion: First check that the currently build isn't borked http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/
Then grab the latest with subversion svn co http://svn.apache.org/repos/asf/hadoop/core/trunk hadoop
run the following commands:
cd hadoop ant ant examples bin/hadoop
bin/hadoop should display the basic command line help docs and let you know it's at least basically working. If any of the above steps failed use subversion to roll back to an earlier days revision.
Stage 1: Standalone Operation
By default, Hadoop is configured to run things in a non-distributed mode, as a single Java process. This is useful for debugging, and can be demonstrated as follows:
mkdir input cp conf/*.xml input bin/hadoop jar build/hadoop-0.16.0-dev-examples.jar grep input output 'dfs[a-z.]+' cat output/*
Obviously the version number on the jar may have changed by the time you read this. You should see a lot of INFO level logging commands go by when you run it and cat output/* should give you something that looks like this:
cat output/* 2 dfs. 1 dfs.block.size 1 dfs.blockreport.interval 1 dfs.client.block.write.retries 1 dfs.client.buffer.dir 1 dfs.data.dir 1 dfs.datanode.bind 1 dfs.datanode.dns.interface 1 dfs.datanode.dns.nameserver 1 dfs.datanode.du.pct 1 dfs.datanode.du.reserved 1 dfs.datanode.port ...(and so on)
If you saw the error Exception in thread "main" java.lang.NoClassDefFoundError: build/hadoop-0/16/0-dev-examples/jar it means you forgot to type jar after bin/hadoop If you were unable to run this example, roll back to a previous night's version. If it seemed to run fine but cat didn't spit anything out you probably mistyped something. Try copying the command directly from the wiki to avoid typos. You'll need to wipe out the output directory between each run.
Congratulations you have just successfully run your first MapReduce with Hadoop.
Stage 2: Pseudo-distributed Configuration
You can in fact run everything on a single host. To run things this way, put the following in conf/hadoop-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>localhost:9000</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<!-- set to 1 to reduce warnings when
running on a single node -->
</property>
</configuration>Now check that the command ssh localhost does not require a password. If it does, execute the following commands:
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Now, try ssh localhost again. If this doesn't work you're doing to have to figure out what's going on with your ssh-agent on your own.
Window Users To start ssh server, you need run "ssh-host-config -y" in cygwin enviroment. If he ask for CYGWIN environment value, set it to "ntsec tty". After you can run server from cygwin "cygrunsrv --start sshd" or from Windows command line "net start sshd".
Mac Users You'll probably need to install something like SSHKeychain or SSHChain (no idea which is better) to be able to ssh to a computer without having to enter the password every time. This is due to the fact that ssh-agent was designed for X11 systems and OS X isn't an X11 system.
Bootstrapping
A new distributed filesystem must be formatted with the following command, run on the master node:
bin/hadoop namenode -format
You should see a quick series of STARTUP_MSGs and a SHUTDOWN_MSG
Open the conf/hadoop-env.sh file and define JAVA_HOME in it. Then start up the Hadoop daemon with
bin/start-all.sh
It should notify you that it's starting the namenode, datanode, secondarynamenode, and jobtracker.
Input files are copied into the distributed filesystem as follows: bin/hadoop dfs -put <localsrc> <dst> For more details just type bin/hadoop dfs with no options.
Stage 3: Fully-distributed operation
Distributed operation is just like the pseudo-distributed operation described above, except:
Specify hostname or IP address of the master server in the values for fs.default.name and mapred.job.tracker in conf/hadoop-site.xml. These are specified as host:port pairs.
Specify directories for dfs.name.dir and dfs.data.dir in conf/hadoop-site.xml. These are used to hold distributed filesystem data on the master node and slave nodes respectively. Note that dfs.data.dir may contain a space- or comma-separated list of directory names, so that data may be stored on multiple devices.
Specify mapred.local.dir in conf/hadoop-site.xml. This determines where temporary MapReduce data is written. It also may be a list of directories.
Specify mapred.map.tasks and mapred.reduce.tasks in conf/mapred-default.xml. As a rule of thumb, use 10x the number of slave processors for mapred.map.tasks, and 2x the number of slave processors for mapred.reduce.tasks.
List all slave hostnames or IP addresses in your conf/slaves file, one per line.