Testing HBase Performance and Scalability
Content
June 2007, First Evaluation of Region Server -- June 8th, 2007
September 2007, Second Evaluation of Region Server -- September 16th, 2007
0.1.2 HBase Performance Evaluation Run -- April 25th, 2008
0.2.0 HBase Performance Evaluation Run -- August 8th, 2008
0.19.0RC1 HBase Performance Evaluation Run -- January 16th, 2009
Tool Description
HADOOP-1476 adds to HBase src/test the script org.apache.hadoop.hbase.PerformanceEvaluation (June 12th, 2007). It runs the tests described in Performance Evaluation, Section 7 of the BigTable paper. See the citation for test descriptions. They will not be described below. The script is useful evaluating HBase performance and how well it scales as we add region servers.
Here is the current usage for the PerformanceEvaluation script:
[stack@aa0-000-12 ~]$ ./hadoop-trunk/src/contrib/hbase/bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation
Usage: java org.apache.hadoop.hbase.PerformanceEvaluation[--master=host:port] [--miniCluster] <command> <nclients>
Options:
master Specify host and port of HBase cluster master. If not present,
address is read from configuration
miniCluster Run the test on an HBaseMiniCluster
Command:
randomRead Run random read test
randomReadMem Run random read test where table is in memory
randomWrite Run random write test
sequentialRead Run sequential read test
sequentialWrite Run sequential write test
scan Run scan test
Args:
nclients Integer. Required. Total number of clients (and HRegionServers)
running: 1 <= value <= 500
Examples:
To run a single evaluation client:
$ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 1If you pass nclients > 1, PerformanceEvaluation starts up a mapreduce job in which each map runs a single loading client instance.
To run the PerformanceEvaluation script, compile the HBase test classes:
$ cd ${HBASE_HOME}
$ ant compile-testThe above ant target compiles all test classes into ${HADOOP_HOME}/build/contrib/hbase/test. It also generates ${HADOOP_HOME}/build/contrib/hbase/hadoop-hbase-test.jar. The latter jar includes all HBase test and src classes and has org.apache.hadoop.hbase.PerformanceEvaluation as its Main-Class. Use the test jar running PerformanceEvaluation on a hadoop cluster (You'd run the client as a MR job when you want to run multiple clients concurrently).
Here is how to run a single-client PerformanceEvaluation sequentialWrite test:
{{{$ ${HADOOP_HOME}/src/contrib/hbase/bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 1 }}}
Here is how you would run the same on hadoop cluster:
{{{$ ${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/build/contrib/hbase/hadoop-hbase-test.jar sequentialWrite 1 }}}
For the latter, you will likely have to copy your hbase configurations -- e.g. your ${HBASE_HOME}/conf/hbase*.xml files -- to ${HADOOP_HOME}/conf and make sure they are replicated across the cluster so your hbase configurations can be found by the running mapreduce job (in particular, clients need to know the address of the HBase master).
Note, the mapreduce mode of the testing script works a little different from single client mode. It does not delete the test table at the end of each run as is done when the script runs in single client mode. Nor does it pre-run the sequentialWrite test before its runs the sequentialRead test (the table needs to be populated with data first before the sequentialRead can run). For the mapreduce version, the onus is on the operator to organize the correct order in which to run the jobs. To delete a table, use the hbase shell and run the drop table command (Run 'help;' for how after starting the shell).
{{{$ ${HBASE_HOME}/bin/hbase shell }}}
One Region Server on June 8th, 2007
Here are some first figures for HBase in advance of any profiling and before addition of caching, etc., taken June 8, 2007
This first test ran on a mini test cluster of four machines only: not the 1768 of the BigTable paper. Each node had 8G of RAM and 2x dual-core 2Ghz Opterons. Every member ran a HDFS datanode. One node ran the namenode and the HBase master, another the region server and a third an instance of the PerformanceEvaluation script configured to run one client. Clients write ~1GB of data: One million rows, each row has a single column whose value is 1000 randomly-generated bytes (See the BigTable paper for a better description).
Experiment |
HBase |
BigTable |
random reads |
68 |
1212 |
random reads (mem) |
Not implemented |
10811 |
random writes |
847 |
8850 |
sequential reads |
301 |
4425 |
sequential writes |
850 |
8547 |
scans |
3063 |
15385 |
The above table lists how many 1000-byte rows read/written per second. The BigTable values are from '1' Tablet Server column of Figure 6 of the BigTable paper.
Except for scanning, we seem to be an order of magnitude off at the moment. Watching the region server during the write tests, it was lightly loaded. At a minimum, there would appear to be issues with liveness/synchronization in need of fixing.
More to follow after more analysis.
One Region Server on September 16th, 2007
Ran same setup as for the first test above on same machines. The main performance improvement in hbase is that batch updates are only sent to the server by the client on commit where before each batch operation -- start, put, commit -- required a trip to the server. This change cuts the number of trips to the server by 2/3rds at least. Otherwise, the client/server communication has changed where it makes sense to pass bytes rather than an object wrapping bytes for some savings RPCing.
Here is the loading command run: {{{$ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation randomRead 1 }}}
Experiment |
HBase20070708 |
HBase20070916 |
BigTable |
random reads |
68 |
272 |
1212 |
random reads (mem) |
Not implemented |
Not implemented |
10811 |
random writes |
847 |
1460 |
8850 |
sequential reads |
301 |
267 |
4425 |
sequential writes |
850 |
1278 |
8547 |
scans |
3063 |
3692 |
15385 |
The above table lists how many 1000-byte rows read/written per second.
Random reads are almost 4x faster, random and sequential writes ~50% faster, and scans about ~20% faster but still a long ways to go...
HBase Release 0.15.0
Experiment |
HBase20070708 |
HBase20070916 |
0.15.0 |
BigTable |
random reads |
68 |
272 |
264 |
1212 |
random reads (mem) |
Not implemented |
Not implemented |
Not implemented |
10811 |
random writes |
847 |
1460 |
1277 |
8850 |
sequential reads |
301 |
267 |
305 |
4425 |
sequential writes |
850 |
1278 |
1112 |
8547 |
scans |
3063 |
3692 |
3758 |
15385 |
HBase TRUNK 12/19/2007, r605675
Below are numbers for version r605675, 12/19/2007. Looks like writing got a bit better -- probably because of recent lock refactoring -- but reading speed has almost halved. I'd have expected the read speeds to also have doubled but I'd guess there is still a HADOOP-2434 like issue over in the datanode.
I've also added numbers for sequential writes, random and next ('scan') reads into and out of a single *open* HDFS mapfile for comparison: i.e. random reading, we are not opening the file each time and the mapfile index is loaded into memory. Going by current numbers, pure mapfile writes are slower than the numbers google posted in initial bigtable paper and reads just a bit faster (except when scanning). GFS must be fast.
Experiment Run |
HBase20070708 |
HBase20070916 |
0.15.0 |
20071219 |
mapfile |
BigTable |
random reads |
68 |
272 |
264 |
167 |
685 |
1212 |
random reads (mem) |
Not implemented |
Not implemented |
Not implemented |
Not Implemented |
- |
10811 |
random writes |
847 |
1460 |
1277 |
1400 |
- |
8850 |
sequential reads |
301 |
267 |
305 |
138 |
- |
4425 |
sequential writes |
850 |
1278 |
1112 |
1691 |
5494 |
8547 |
scans |
3063 |
3692 |
3758 |
3731 |
25641 |
15385 |
Subsequently I profiled the mapfile PerformanceEvaluation. Turns out generation of the values and keys to insert were taking a bunch of CPU time. After making a fix key and value generations were between 15-25% (the alternative was precompiling keys and values which would take loads of memory). Rerunning tests, it looks like there can be a pretty broad range of fluctuation in mapfile numbers between runs. I also noticed that the 0.15.x random reads seem to be 50% faster than TRUNK. Investigate.
HBase 0.1.2 (Candidate) 04/25/2008
Numbers for the 0.1.2 candidate. The 'mapfile', '20071219', and 'BigTable' columns are copied from the 'TRUNK 12/19/2007' above. The new columns are for 0.1.2 and for mapfile in 0.16.3 hadoop (This latter test uses new MapFilePerformanceTest script).
Experiment Run |
20071219 |
0.1.2 |
mapfile |
mapfile0.16.3 |
BigTable |
random reads |
167 |
351 |
685 |
644 |
1212 |
random reads (mem) |
Not implemented |
Not implemented |
Not Implemented |
- |
10811 |
random writes |
1400 |
2330 |
- |
- |
8850 |
sequential reads |
138 |
349 |
- |
- |
4425 |
sequential writes |
1691 |
2479 |
5494 |
6204 |
8547 |
scans |
3731 |
6278 |
25641 |
47662 |
15385 |
We've near doubled in most areas over the hbase from 20071219. Its a combination of improvements in hadoop -- particularly scanning -- and in HBase itself (locking was redone, we customized RPC to use codes rather than class names, etc.).
HBase 0.2.0 08/08/2008
Numbers for hbase 0.2.0 on java6 on hadoop 0.17.1 and hbase 0.2.0 on hadoop 0.18.0 using java7. Also includes numbers for hadoop mapfile for hadoop 0.16.4, 0.17.1, and hadoop 0.18.0 (on java6) as well as rows copied from above tables so you can compare progress.
Experiment Run |
20071219 |
0.1.2 |
0.2.0java6 |
0.2.0java7 |
mapfile |
mapfile0.16.4 |
mapfile0.17.1 |
mapfile0.18.0 |
BigTable |
random reads |
167 |
351 |
428 |
560 |
685 |
644 |
568 |
915 |
1212 |
random reads (mem) |
Not implemented |
Not implemented |
Not Implemented |
- |
- |
- |
- |
- |
10811 |
random writes |
1400 |
2330 |
2167 |
2218 |
- |
- |
- |
- |
8850 |
sequential reads |
138 |
349 |
427 |
582 |
- |
- |
- |
- |
4425 |
sequential writes |
1691 |
2479 |
2076 |
1966 |
5494 |
6204 |
5684 |
5800 |
8547 |
scans |
3731 |
6278 |
3737 |
3784 |
25641 |
47662 |
55692 |
58054 |
15385 |
HBase 0.19.0RC1 01/16/2009
Numbers for hbase 0.19.0RC1 on hadoop 0.19.0 and java6.
{{{[stack@aa0-000-13 ~]$ ~/bin/jdk/bin/java -version java version "1.6.0_11" Java(TM) SE Runtime Environment (build 1.6.0_11-b03) Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode)}}}
Tried java7 but no discernible difference (build 1.7.0-ea-b43).
Also includes numbers for hadoop mapfile. Table includes last test, 0.2.0java6 (and hadoop 0.17.2) from above for easy comparison.
Start cluster fresh for each test then wait for all regions to be deployed before starting up tests (means no content in memcache which means that for such as random read we are always going to the filesystem, never getting values from memcache).
Experiment Run |
0.2.0java6 |
mapfile0.17.1 |
0.19.0RC1!Java6 |
0.19.0RC1!Java6!Zlib |
0.19.0RC1!Java6,8Clients |
mapfile0.19.0 |
BigTable |
random reads |
428 |
568 |
540 |
80 |
768 |
768 |
1212 |
random reads (mem) |
- |
- |
- |
- |
- |
- |
10811 |
random writes |
2167 |
2218 |
9986 |
- |
- |
- |
8850 |
sequential reads |
427 |
582 |
464 |
- |
- |
- |
4425 |
sequential writes |
2076 |
5684 |
9892 |
7182 |
14027 |
7519 |
8547 |
scans |
3737 |
55692 |
20971 |
20560 |
14742 |
55555 |
15385 |
Some improvement writing and scanning (faster than BigTable paper seemingly). Random Reads still lag. Sequential Reads lag badly. A bit of fetch-ahead as we did scanning should help here.
Speedup is combo of hdfs improvements, hbase improvements including batching when writing and scanning (the bigtable PE description alludes to scans using prefetch), and use of two JBOD'd disks -- as in google paper -- where previous in tests above, all disks were RAID'd. Otherwise, hardware is same, similar to bigtable papers's dual dual-core opterons, 1G for hbase, etc.
Of note, the mapfile numbers are less than those of hbase when writing because the mapfile tests write one file whereas hbase after first split is writing to multiple files concurrently. On the other hand, hbase random read is very like mapfile random read, at least in single client case; we're effectively asking the filesystem for a random value from the midst of a file in both cases. The mapfile numbers are useful as guage of how much hdfs has come on since the last time we ran PE.
Block compression (native zlib -- hbase bug won't let you specify anything but the DefaultCodec, e.g. lzo) is a little slower writing, way slower random-reading but about same scanning.
The 8 concurrent clients write a single regionserver instance. Our cluster is four computers. Load was put up by running a MR job as follows: $ ./bin/hadoop org.apache.hadoop.hbase.PerformanceEvaluation randomRead 8 MR job ran two mappers per computer so 8 clients running concurrently. Timings were those reported at head of the MR job page in the UI.
More Information
As of this writing (2011), YCSB is a popular tool for testing performance. See the HBase book for more information http://hbase.apache.org/book.html