Running Hadoop on Amazon EC2

[http://www.amazon.com/gp/browse.html?node=201590011 Amazon EC2] (Elastic Compute Cloud) is a computing service. One allocates a set of hosts, and runs ones's application on them, then, when done, de-allocates the hosts. Billing is hourly per host. Thus EC2 permits one to deploy Hadoop on a cluster without having to own and operate that cluster, but rather renting it on an hourly basis.

This document assumes that you have already followed the steps in [http://docs.amazonwebservices.com/AmazonEC2/gsg/2006-06-26/ Amazon's Getting Started Guide].

Concepts

Conventions

In this document, commands lines that start with '#' are executed on an Amazon instance, while command lines starting with a '%' are executed on your workstation.

Building an Image

To use Hadoop it is easiest to build an image with most of the software you require already installed. Amazon provides [http://developer.amazonwebservices.com/connect/entry.jspa?externalID=354&categoryID=87 good documentation] for building images. Follow the "Getting Started" guide there to learn how to install the EC2 command-line tools, etc.

To build an image for Hadoop:

  1. Run an instance of the fedora base image.
  2. Login to this instance (using ssh).
  3. Install [http://java.sun.com/javase/downloads/index.jsp Java]. Copy the link of the "Linux self-extracting file" from Sun's download page, then use wget to retrieve the JVM. Unpack this in a well-known location, like /usr/local.

    # cd /usr/local
    # wget -O java.bin http://.../jdk-1_5_0_09-linux-i586.bin
    # sh java.bin
    # rm java.bin
  4. Install rsync.

    # yum install rsync
  5. (Optional) install other tools you might need.

    # yum install emacs
    # yum install subversion

    To install [http://ant.apache.org/ Ant], copy the download URL from the website, and then:

    # cd /usr/local
    # wget http://.../apache-ant-1.6.5-bin.tar.gz
    # tar xzf apache-ant-1.6.5-bin.tar.gz
    # rm apache-ant-1.6.5-bin.tar.gz
  6. Install Hadoop.

    # cd /usr/local
    # wget http://.../hadoop-X.X.X.tar.gz
    # tar xzf hadoop-X.X.X.tar.gz
    # rm hadoop-X.X.X.tar.gz
  7. Configure Hadoop (described below).
  8. Edit shell config files to, e.g., add executables to your PATH, and perform other configurations that will make this system easy for you to use. For example, you may wish to add a non-root user account to run Hadoop's daemons.
  9. Start Hadoop, and test your configuration minimally.

    # cd /usr/local/hadoop-X.X.X
    # bin/start-all.sh
    # bin/hadoop jar hadoop-0.7.2-examples.jar pi 10 10000000
    # bin/hadoop dfs -ls
    # bin/stop-all.sh
  10. Create a new image, using Amazon's instructions (bundle, upload & register).

Configuring Hadoop

Hadoop is configured with a single master node and many slave nodes. To facilliate re-deployment without re-configuration, one may register a name in DNS for the master host, then reset the address for this name each time the cluster is re-deployed. Services such as [http://www.dyndns.com/services/dns/dyndns/ DynDNS] make this fairly easy. In the following, we refer to the master as master.mydomain.com. Please replace this with your actual master node's name.

In EC2, the local data volume is mounted as /mnt.

hadoop-env.sh

Specify the JVM location, the log directory and that rsync should be used to update slaves from the master.

# Set Hadoop-specific environment variables here.

# The java implementation to use.  Required.
export JAVA_HOME=/usr/local/jdk1.5.0_09

# Where log files are stored.  $HADOOP_HOME/logs by default.
export HADOOP_LOG_DIR=/mnt/hadoop/logs

# host:path where hadoop code should be rsync'd from.  Unset by default.
export HADOOP_MASTER=master.mydomain.com:/usr/local/hadoop-X.X.X

# Seconds to sleep between slave commands.  Unset by default.  This
# can be useful in large clusters, where, e.g., slave rsyncs can
# otherwise arrive faster than the master can service them.
export HADOOP_SLAVE_SLEEP=1

You must also create the log directory.

% mkdir -p /mnt/hadoop/logs

hadoop-site.xml

All of Hadoop's local data is stored relative to hadoop.tmp.dir, so we only need specify this, plus the name of the master node for DFS (the NameNode) and MapReduce (the JobTracker).

<configuration>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/mnt/hadoop</value>
</property>

<property>
  <name>fs.default.name</name>
  <value>master.mydomain.com:50001</value>
</property>

<property>
  <name>mapred.job.tracker</name>
  <value>master.mydomain.com:50002</value>
</property>

</configuration>

mapred-default.xml

This should vary with the size of your cluster. Typically mapred.map.tasks should be 10x the number of instances, and mapred.reduce.tasks should be 3x the number of instances. The following is thus configured for a 19-instance cluster.

<configuration>

<property>
  <name>mapred.map.tasks</name>
  <value>190</value>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>57</value>
</property>

</configuration>

Security

To access your cluster, you must enable access from at least port 22, for ssh. Generally it is also useful to open a few other ports, to view job progress. Port 50030 is used for the JobTracker's web interface, permitting one to view job status, and port 50060 is used by the TaskTracker's web interface, for more detailed debugging.

% ec2-add-group my-group
% ec2-authorize my-group -p 22
% ec2-authorize my-group -p 50030
% ec2-authorize my-group -p 50060

Instances within the cluster should have unfettered access to one another. This is enabled by the following, replacing the XXXXXXXXXXX with your Amazon web services user id.

% ec2-authorize my-group -o my-group -u XXXXXXXXXXX

Launching your cluster

Start by allocating instances of your image. Use ec2-describe-images to find the your image id, notated as ami-XXXXXXXX below.

To run a 20-node cluster:

% ec2-describe-images
% ec2-run-instances ami-XXXXXXX -k gsg-keypair -g my-group -n 20

Wait a few minutes for the instances to launch.

% ec2-describe-instances

Once instances are launched, register the first host listed in DNS with your master host name.

Create a slaves file containing the rest of the instances and copy it to the master.

% ec2-describe-instances | grep INSTANCE | cut -f 4 | tail +2 > slaves
% scp slaves master.mydomain.com:/usr/local/hadoop-X.X.X/conf/slaves

Format the new cluster's filesystem.

% ssh master.mydomain.com
# /usr/local/hadoop-X.X.X/bin/hadoop namenode -format

Start the cluster.

# /usr/local/hadoop-X.X.X/bin/start-all.sh

Visit your cluster's web ui at http://master.mydomain.com:50030/.

Shutting down your cluster

% ec2-terminate-instances `ec2-describe-instances | grep INSTANCE | cut -f 2`

Future Work

Ideally Hadoop could directly access job data from [http://www.amazon.com/gp/browse.html?node=16427261 Amazon S3] (Simple Storage Service). Initial input could be read from S3 when a cluster is launched, and the final output could be written back to S3 before the cluster is decomissioned. Intermediate, temporary data, only needed between MapReduce passes, would be more efficiently stored in Hadoop's DFS. This would require an implementation of a Hadoop [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/fs/FileSystem.html FileSystem] for S3. There are two issues in Hadoop's bug database related to this:

Please vote for these issues in Jira if you feel this would help your project. (Anyone can create themselves a Jira account in order to vote on issues, etc.)