Nov 2009 Testing Framework Conference Call

Some of the people on the Hadoop list are organising a quick conference call on the topic of testing, this wiki page is to go with it

JIRA Issues

Use Cases

Here are some of the use cases that come up when you consider testing Hadoop

Benchmarking

One use case that comes up is stress testing clusters; to see the cluster supports Hadoop "as well as it should", and trying to find out why it doesn't, if it is not adequate. What we have today is TeraSort, where you have to guess the approximate numbers then run the job. TeraSort creates its own test data, which is good, but it doesn't stress the CPUs as realistically as many workloads, and it generates lots of intermediate and final data; there is no reduction.

Basic Cluster Health Tests

There are currently no tests that work with Hadoop via the web pages, no job submission and monitoring. It is in fact possible to bring up a Hadoop cluster in which JSP doesn't work, but the basic tests all appear well -even including TeraSort, provided you use the low-level APIs.

Proposals:

Testing underlying platforms

We need to test the underlying platforms, from the JVM and Linux distributions to any Infrastructure-on-Demand APIs that provide VMs on demand, machines which can run Hadoop.

JVM Testing

An IBM need; can also be used to qualify new Sun releases. Any JVM defect which stops Hadoop running at scale should be viewed as a blocking issue by all JVM suppliers.

OS Testing

Test Hadoop working on the target OS. If Hadoop is packaged in an OS specific format (e.g. RPM), those installations need to be tested.

IaaS Testing

Hadoop can be used to stress test Infrastructure as a Service platforms, and is offered as a service by some companies (Cloudera, EC2).

Hadoop can be used on Eucalyptus installations using EC2 client libraries. This can show up problems with Eucalyptus (different fault messages compared EC2, time zone/clock differences.

Other infrastructures will have different APIs, with different features (private subnets, machine restart and persistence)

See: A Cloud Tools Manifesto

Qualifying Hadoop on different platforms

Currently Hadoop is only used at scale on RHEL + Sun JVM, because that is what Yahoo! run their clusters on, and nobody else is running different platforms in their production clusters -or if they are, they aren't discussing it in public.

What would it take to test Hadoop releases on different operating systems? We'd need clusters of real or virtual machines and then run any cluster qualification tests on them; publish the results. This would not be a performance game; throughput isn't important, it's more "does this work on a specific OS at 100+ machines"?

Exploring the Hadoop Configuration Space

There are a lot of Hadoop configuration options, even ignoring those of the underlying machines and network. For example, what impact does blocksize and replication factor have on your workload? What different network card configuration parameters give the best performance? Which combinations of options break things?

When combined with IaaS platforms, the configuration space gets even larger.

Manually exploring the configuration space takes too long; currently everyone tries to stick closed to the Yahoo! configurations which are believed to work -whenever someone strays off it, interesting things happen. For example, setting a replication factor of only 2 found a duplication bug; running Hadoop on a machine that isn't quite sure of its hostname shows up other assumptions as things you can not rely on.

Proposal: Make this a research topic, pull in the experts in testing, and give encouragement to work on this problem. Offering cluster time may help.

Testing applications that run on Hadoop

This was goal of Alex's Circus prototype: something to make it easier for you to be confident that your code will work.

Testing changes to Hadoop, fast

Hadoop unit/functional testing is slow with MiniMR/MiniDFS cluster setup and teardowns per test. This could be addressed by having more Mini cluster reuse, but it could be even faster if people could push out newly compiled JARs and test them at scale.

Testing Hadoop Distributions

This is a problem which Cloudera and others who distribute/internally package and deploy Hadoop have: you need to know that your RPMs or other redistributables work.

It's similar to the cluster acceptance test problem, except that you need to create the distribution packages and install them on the remote machines, then run the tests. The testing-over-IaaS platforms use cases are closer.

Simulating Cluster Failures

Cluster failure handling -especially the loss of large portions of a large datacenter, is something that is not currently formally tested. There are big fixes that go into Hadoop to test some of this, but loss of a quarter of the datanodes is a disaster that doesn't get tested at scale before a release is made.

Proposal: Any Infrastructure API ought should offer the opportunity to simulate failures, either by turning off nodes without warning, or (better) breaking the network connections between live nodes.