= Apache Hadoop = [[http://hadoop.apache.org/|Apache Hadoop]] is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named [[HadoopMapReduce|Map/Reduce]], where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system ([[DFS|HDFS]]) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework. == General Information == * [[http://hadoop.apache.org/|Official Apache Hadoop Website]]: download, bug-tracking, mailing-lists, etc. * [[ProjectDescription|Overview]] of Apache Hadoop * [[FAQ]] * [[HadoopIsNot|What Hadoop is not]] * [[Distribution|Distributions]] for Hadoop (RPMs, Debs, AMIs, etc) * [[HadoopPresentations|Presentations]], [[Books|books]], [[HadoopArticles|articles]] and [[Papers|papers]] about Hadoop * PoweredBy, a list of sites and applications powered by Apache Hadoop * Support * [[Help|Getting help from the hadoop community]]. * [[Support|People and companies for hire]]. * [[Conferences|Hadoop Community Events and Conferences]] * HadoopUserGroups (HUGs) * HadoopSummit * [[http://developer.yahoo.com/hadoop/tutorial/|Yahoo! Hadoop Tutorial]]: A thorough tutorial covering Hadoop setup, HDFS, and [[HadoopMapReduce|MapReduce]] * [[http://www.cloudera.com/hadoop-training-basic|Cloudera Online Hadoop Training]]: Video lectures, exercises and a pre-configured [[http://www.cloudera.com/hadoop-training-virtual-machine|virtual machine]] to follow along. Sessions cover [[http://www.cloudera.com/hadoop-training-programming-with-hadoop|Hadoop]], [[http://www.cloudera.com/hadoop-training-mapreduce-algorithms|MapReduce]], [[http://www.cloudera.com/hadoop-training-hive-introduction|Hive]], [[http://www.cloudera.com/hadoop-training-pig-introduction|Pig]] and more. == User Documentation == * ImportantConcepts * GettingStartedWithHadoop (lots of details and explanation) * QuickStart (for those who just want it to work ''now'') * [[http://hadoop.apache.org/core/docs/current/commands_manual.html|Command Line Options]] for hadoop shell script. * [[HadoopOverview|Hadoop Code Overview]] * [[TroubleShooting|Troubleshooting]] What do when things go wrong * [[Setup| Setting up a Hadoop Cluster]] * [[Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)]] (tutorial on installing, configuring and running Hadoop on a single machine) * [[Running_Hadoop_On_OS_X_10.5_64-bit_(Single-Node_Cluster)]] * HowToConfigure Hadoop software * [[WebApp_URLs|WebApps for monitoring your system]] * [[NameNodeFailover|How to handle name node failure]] * [[GangliaMetrics|How to get metrics into ganglia]] * [[LargeClusterTips|Tips for managing a large cluster]] * [[VirtualCluster|How to bring up a cluster of Virtual Machines]] * [[DiskSetup|Disk Setup: some suggestions]] * [[PerformanceTuning|Performance:]] getting extra throughput * [[http://v-lad.org/Tutorials/Hadoop/00%20-%20Intro.html|Hadoop Windows/Eclipse Tutorial]]: Tutorial on how to setup and configure Hadoop development cluster for Windows and Eclipse. * [[topology_rack_awareness_scripts|Topology Scripts / Rack Awareness]] * Map/Reduce * HadoopMapReduce * HadoopMapRedClasses * HowManyMapsAndReduces * TaskExecutionEnvironment * HowToDebugMapReducePrograms * Examples * WordCount * [[PythonWordCount|Python Word Count]] * [[C++WordCount|C/C++ Word Count]] * [[Grep]] * [[Sort]] * RandomWriter * [[HadoopDfsReadWriteExample|How to read from and write to HDFS]] * Amazon * Running Hadoop on [[AmazonEC2]] * Running Hadoop with AmazonS3 * Benchmarks * [[HardwareBenchmarks|Hardware benchmarks]] * [[DataProcessingBenchmarks|Data processing benchmarks]] * Sub-Projects * [[Hbase]], a Bigtable-like structured storage system for Hadoop HDFS * [[http://wiki.apache.org/pig/|Apache Pig]] is a high-level data-flow language and execution framework for parallel computation. It is built on top of Hadoop Core. * [[Hive]] a data warehouse infrastructure which allows sql-like adhoc querying of data (in any format) stored in Hadoop * ZooKeeper is a high-performance coordination service for distributed applications. * Contrib * HadoopStreaming (Useful for using Hadoop with other programming languages) * DistributedLucene, a Proposal for a distributed Lucene index in Hadoop * [[MountableHDFS]], Fuse-DFS & other Tools to mount HDFS as a standard filesystem on Linux (and some other Unix OSs) * [[HDFS-APIs]] in perl, python, php, etc * [[Chukwa]] a data collection, storage, and analysis framework == Developer Documentation == * [[Roadmap]], listing release plans. * HowToContribute * HowToDevelopUnitTests * HowToUseInjectionFramework * HowToSetupYourDevelopmentEnvironment * [[CodeReviewChecklist|HowToCodeReview]] * [[Jira]] usage guidelines * HowToCommit * HowToRelease * HudsonBuildServer * DevelopmentHints * ProjectSuggestions * [[HadoopUnderIDEA|Building/Testing under IntelliJ IDEA]] == Related Resources == * [[http://wiki.apache.org/nutch/NutchHadoopTutorial|Nutch Hadoop Tutorial]] (Useful for understanding Hadoop in an application context) * [[http://www.alphaworks.ibm.com/tech/mapreducetools|IBM MapReduce Tools for Eclipse]] (An Eclipse plug-in that simplifies the creation and deployment of MapReduce programs) * Hadoop IRC channel is #hadoop at irc.freenode.net. * [[http://www.tom-doehler.de/wordpress/index.php/2007/12/19/spring-and-hadoop/|Using Spring and Hadoop]] (Discussion of possibilities to use Hadoop and Dependency Injection with Spring) * [[http://wiki.apache.org/hama|Hama]], a Distributed Matrix Computational Package based on Hadoop Map/Reduce * [[http://heart.korea.ac.kr|Heart]], a Planet-Scale RDF Data Store and a Distributed Processing Engine * [[http://lucene.apache.org/mahout|Mahout]], scalable Machine Learning algorithms using Hadoop * [[http://opensolaris.org/os/project/livehadoop/|Live Hadoop]] A three-node, distributed Hadoop cluster running on an !OpenSolaris live CD * [[https://rc.usf.edu/trac/hadoop/wiki/SGEIntegration|SGE Integration]] A guide on tight-integration of Hadoop on Sun Gridengine ---- CategoryHomepage