Differences between revisions 308 and 309
Revision 308 as of 2016-01-20 05:45:15
Size: 9272
Editor: ArpitAgarwal
Comment: Link to cluster setup instructions in latest 2.x stable release docs
Revision 309 as of 2016-01-20 06:13:12
Size: 9299
Editor: ArpitAgarwal
Comment: Link to 2.x stable release docs for NameNode HA, remove couple of outdated links.
Deletions are marked like this. Additions are marked like this.
Line 41: Line 41:
 * [[NameNodeFailover|How to handle name node failure]]  * [[https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html|Configure NameNode High-Availability]]
Line 45: Line 45:
 * [[PerformanceTuning|Performance:]] getting extra throughput

Apache Hadoop

Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.

General Information

  • HBase, a Bigtable-like structured storage system for Hadoop HDFS

  • Apache Pig is a high-level data-flow language and execution framework for parallel computation. It is built on top of Hadoop Core.

  • Hive a data warehouse infrastructure which allows sql-like adhoc querying of data (in any format) stored in Hadoop

  • ZooKeeper is a high-performance coordination service for distributed applications.

  • Hama, a Google's Pregel-like distributed computing framework based on BSP (Bulk Synchronous Parallel) computing techniques for massive scientific computations.

  • Mahout, scalable Machine Learning algorithms using Hadoop

  • Hadoop Compatible FileSystems (HCFS)

  • Apache Gora, open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support.

User Documentation

Setting up a Hadoop Cluster



The MapReduce algorithm is the foundational algorithm of Hadoop, and is critical to understand.

Contributed parts of the Hadoop codebase

  • These are independent modules that are in the Hadoop codebase but not tightly integrated with the main project -yet.

Developer Documentation


FrontPage (last edited 2016-01-20 06:13:12 by ArpitAgarwal)