Below are tips for managing large clusters.
- Have a good sysadmin if you're not one yourself.
Take a look at a presentation done by Allen Wittenauer: http://tinyurl.com/5foamm
- Have the LAN closed off to untrusted users. Without this, your filesystem is effectively open to everyone on the network.
- Once you are on the private LAN, turn off all firewalls on the machines, as it only creates connectivity problems.
- Use LDAP or similar to manage user accounts.
- Only put the slaves file on your namenode and secondary namenode to prevent confusion.
Use RPMs to install the Hadoop binaries. Cloudera provide some RPMs for this, and a web site to generate configuration RPM files.
- Use kickstart or similar to bring up the machines.
- If you are trying to configure the machines one by one, step away from the keyboard. That is not the way to manage a cluster.
- Keep an eye out for disk SMART messages in the server logs. They warn of trouble.
- Keep an eye on disk capacity, especially on the namenode. You do not want the NN to run out of storage as Bad Things happen.
- Keep the underlying software in sync: OS, Java version.
- Run the rebalancer, throttled back appropriately for your bandwidth
Management Tool Support
See the Management Tools Page for details on integration with Management Tools
Other good documentation: Patterns of Hadoop Deployment
- Don't do it by copying XML files around by hand.
- Look at the cloudera config tools. If you use them, keep the previous RPMs around
Consider a system configuration management package to keep Hadoop's source and configuration consistent across all nodes. Some example packages are bcfg2, SmartFrog, Puppet, cfengine, etc.
- Keep your site XML files under SCM, so you can roll-back, diff changes.
The NameNode is a SPOF. When it goes offline, the cluster goes down. If it loses its data, the filesystem is gone. Value it.
- Never let its disks fill up.
- Consider RAID storage here. If not, set it to save its data to two independent drives, ideally on separate controllers (just in case the controller decides to play up)
- Set the NN up to save one copy of all its data to a remote machine (NFS?), so even if the NN goes down, you can bring up a new machine with the same hostname for everything else to bind to.
- Have identical hardware on all workers in the cluster, eliminating the need to have different configuration options (task slots, data directory locations, etc.)
- Have a common user account on every machine you run Hadoop on, with a common public key in ~/.ssh/authorized_keys
- Track HDDs, their history and their failures. Disk failures are not always independent in a large datacentre.
- Have simple hostname to rack or IP to rack mappings, so the rack detection scripts are trivial.
How to rebalance a full datanode
If a datanode is at or near 100% capacity,
- Decommission the node: this will copy everything off.
- Take it offline.
- Delete the data, clean up the HDDs.
- Add the node again.
Things will go wrong. There is always SPOF. Test your failure handling processes before you go live.
- Simulate a corrupted edit log by killing the namenode process, truncating the (binary) edit log, and bringing it up. See how the team handles it.
- Turn off one of the switches, pull out a network cable. See how the cluster handles it, how it recovers. Then put the switch back on.
- Turn an entire rack off without warning. See what happens when they go offline.
- Turn off DNS. Or just the rDNS side of things.
- Turn off the entire datacenter, switch it back on. Are there any race conditions?
- Write an job that tries to generate too much data, fills up the cluster. How is it handled?