Below are tips for managing large clusters.

Management Tool Support

  1. See the AmazonEC2 and AmazonS3 pages for tips on managing clusters built on EC2 and S3.
  2. See the Management Tools Page for details on integration with Management Tools
  3. Other good documentation: Patterns of Hadoop Deployment

Hadoop Configuration

NameNode Health

The NameNode is a SPOF. When it goes offline, the cluster goes down. If it loses its data, the filesystem is gone. Value it.

Workers

How to rebalance a full datanode

If a datanode is at or near 100% capacity,

  1. Decommission the node: this will copy everything off. 2. Take it offline. 3. Delete the data, clean up the HDDs. 4. Add the node again.

Testing Failure

Things will go wrong. There is always SPOF. Test your failure handling processes before you go live.