Apache Hadoop
Apache Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.
General Information
Official Apache Hadoop Website: download, bug-tracking, mailing-lists, etc.
Overview of Apache Hadoop
Distributions for Hadoop (RPMs, Debs, AMIs, etc)
Presentations, books, articles and papers about Hadoop
PoweredBy, a list of sites and applications powered by Apache Hadoop
- Support
Hadoop Community Events and Conferences
HadoopUserGroups (HUGs)
Yahoo! Hadoop Tutorial: A thorough tutorial covering Hadoop setup, HDFS, and MapReduce
Cloudera Online Hadoop Training: Video lectures, exercises and a pre-configured virtual machine to follow along. Sessions cover Hadoop, MapReduce, Hive, Pig and more.
User Documentation
GettingStartedWithHadoop (lots of details and explanation)
QuickStart (for those who just want it to work now)
Command Line Options for hadoop shell script.
Troubleshooting What do when things go wrong
Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster) (tutorial on installing, configuring and running Hadoop on a single machine)
HowToConfigure Hadoop software
Performance: getting extra throughput
Hadoop Windows/Eclipse Tutorial: Tutorial on how to setup and configure Hadoop development cluster for Windows and Eclipse.
- Map/Reduce
- Examples
- Amazon
- Benchmarks
- Sub-Projects
Hbase, a Bigtable-like structured storage system for Hadoop HDFS
Apache Pig is a high-level data-flow language and execution framework for parallel computation. It is built on top of Hadoop Core.
Hive a data warehouse infrastructure which allows sql-like adhoc querying of data (in any format) stored in Hadoop
ZooKeeper is a high-performance coordination service for distributed applications.
- Contrib
HadoopStreaming (Useful for using Hadoop with other programming languages)
DistributedLucene, a Proposal for a distributed Lucene index in Hadoop
MountableHDFS, Fuse-DFS & other Tools to mount HDFS as a standard filesystem on Linux (and some other Unix OSs)
HDFS-APIs in perl, python, php, etc
Chukwa a data collection, storage, and analysis framework
Developer Documentation
Related Resources
Nutch Hadoop Tutorial (Useful for understanding Hadoop in an application context)
IBM MapReduce Tools for Eclipse (An Eclipse plug-in that simplifies the creation and deployment of MapReduce programs)
- Hadoop IRC channel is #hadoop at irc.freenode.net.
Using Spring and Hadoop (Discussion of possibilities to use Hadoop and Dependency Injection with Spring)
Hama, a Distributed Matrix Computational Package based on Hadoop Map/Reduce
Heart, a Planet-Scale RDF Data Store and a Distributed Processing Engine
Mahout, scalable Machine Learning algorithms using Hadoop
Live Hadoop A three-node, distributed Hadoop cluster running on an OpenSolaris live CD
SGE Integration A guide on tight-integration of Hadoop on Sun Gridengine