Differences between revisions 35 and 36
Revision 35 as of 2012-11-29 01:49:59
Size: 3449
Editor: GlenMazza
Comment: Removed duplicate information already available on the Hadoop Site, providing links instead to that information (remaining Apache information should eventually be incorporated into the website.)
Revision 36 as of 2014-02-20 02:49:11
Size: 3556
Comment: updated broken links, clouderas download page is now singular for all the distributions.
Deletions are marked like this. Additions are marked like this.
Line 4: Line 4:
 * The [[http://www.cloudera.com/hadoop-training-virtual-machine|Cloudera Training Virtual Machine]]. This image runs within the free VMWare player and has Hadoop, Hive, Pig and examples pre-loaded. Video lectures and screencasts walk you through everything.
 * The [[http://opensolaris.org/os/project/livehadoop/|OpenSolaris Hadoop Live CD]]. This virtual Hadoop cluster runs entirely off the CD, and does not require you to install any new software on your system.
 * The [[http://www.cloudera.com/content/cloudera-content/cloudera-docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html|Cloudera QuickStart Virtual Machine]]. This image runs within the free VMWare player, VirtualBox, or KVM and has Hadoop, Hive, Pig and examples pre-loaded. Video lectures and screencasts walk you through everything.
 * The [[http://hortonworks.com/products/hortonworks-sandbox/|Hortonworks Sandbox]]. The sandbox is a pre-configured virtual machine that comes with a dozen interactive Hadoop tutorials.
Line 8: Line 8:
Cloudera also provides their [[http://www.cloudera.com/hadoop|distribution for Hadoop]] (Apache 2.0 Licensed), including support for Hive and Pig and configuration management, in the following formats:
 * [[http://www.cloudera.com/hadoop-rpm|RPMs for Redhat based systems]] (Centos, Fedora, RHEL, etc)
 * [[http://www.cloudera.com/hadoop-deb|Debian Packages for Debian based systems]] (Debian, Ubuntu, etc)
 * [[http://www.cloudera.com/hadoop-ec2|AMI for Amazon EC2]]
Cloudera also provides their [[http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.html|distribution for Hadoop]] (Apache 2.0 Licensed), including support for Hive and Pig and configuration management for [[http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH-Version-and-Packaging-Information/cdhvd_topic_2.html|various operating systems]].
Line 14: Line 11:
 * [[http://hadoop.apache.org/docs/stable/single_node_setup.html|Single-Node Setup]]
 * [[http://hadoop.apache.org/docs/stable/cluster_setup.html|Cluster Setup]]
 * [[http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleNodeSetup.html|Single-Node Setup]]
 * [[http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html|Cluster Setup]]
Line 32: Line 29:
See [[http://hadoop.apache.org/common/docs/stable/cluster_setup.html#Configurationml | Hadoop Cluster Setup/Configuration]] for details. See [[http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html#Configuring_the_Hadoop_Daemons_in_Non-Secure_Mode | Hadoop Cluster Setup/Configuration]] for details.

Get up and running fast

The fastest way may be to just install a pre-configured virtual Hadoop environment. Two such environments are:

  • The Cloudera QuickStart Virtual Machine. This image runs within the free VMWare player, VirtualBox, or KVM and has Hadoop, Hive, Pig and examples pre-loaded. Video lectures and screencasts walk you through everything.

  • The Hortonworks Sandbox. The sandbox is a pre-configured virtual machine that comes with a dozen interactive Hadoop tutorials.

Cloudera also provides their distribution for Hadoop (Apache 2.0 Licensed), including support for Hive and Pig and configuration management for various operating systems.

If you want to work exclusively with Hadoop code directly from Apache, the following articles from the website will be most useful:

Note for the above Apache links, if you're having trouble getting "ssh localhost" to work on the following OS's:

Window Users To start ssh server, you need run "ssh-host-config -y" in cygwin environment. If he ask for CYGWIN environment value, set it to "ntsec tty". After you can run server from cygwin "cygrunsrv --start sshd" or from Windows command line "net start sshd".

Mac Users In recent versions of OSX, ssh-agent is already set up with launchd and keychain. This can be verified by executing "echo $SSH_AUTH_SOCK" in your favorite shell. You can use ssh-add -k and -K to add your keys and passphrases to your keychain.

Multi-node cluster setup is largely similar to single-node (pseudo-distributed) setup, except for the following:

  1. The hostname or IP address of your master server in the value for fs.default.name, as hdfs://master.example.com/ in conf/core-site.xml.
  2. The host and port of the your master server in the value of mapred.job.tracker as master.example.com:port in conf/mapred-site.xml.
  3. Directories for dfs.name.dir and dfs.data.dir in conf/hdfs-site.xml. These are local directories used to hold distributed filesystem data on the master node and slave nodes respectively. Note that dfs.data.dir may contain a space- or comma-separated list of directory names, so that data may be stored on multiple local devices.
  4. mapred.local.dir in conf/mapred-site.xml, the local directory where temporary MapReduce data is stored. It also may be a list of directories.

  5. mapred.map.tasks and mapred.reduce.tasks in conf/mapred-site.xml. As a rule of thumb, use 10x the number of slave processors for mapred.map.tasks, and 2x the number of slave processors for mapred.reduce.tasks.
  6. Finally, list all slave hostnames or IP addresses in your conf/slaves file, one per line. Then format your filesystem and start your cluster on your master node, as above.

See Hadoop Cluster Setup/Configuration for details.

QuickStart (last edited 2014-02-20 02:49:11 by SteveKallestad)