THIS PAGE IS NOW OBSOLETE AND NO LONGER MAINTAINED. ROADMAP IS KEPT UP IN JIRA. SEE THE JIRA PER-VERSION ROADMAP PAGE: https://issues.apache.org/jira/browse/HBASE?report=com.atlassian.jira.plugin.system.project:roadmap-panel
0.90 January, 2011
- New Master
- Data Durability
- Much improved i/o profile
Old Road Maps (Obsolete)
New features are planned in approximately six month windows. And are listed in approximate priority order.
First release to offer data durability via HDFS-200 and HDFS-142. For release late May/June.
September 2008 - March 2008
- Integrate Zookeeper The Bigtable paper, Section 4., describes how Chubby, a distributed lock manager and repository-of-state, is used as the authority for the list of servers that make up a Bigtable cluster, the location of the root region, and as the repository for table schemas. Currently HBase has its master process run all of the services ascribed by the Bigtable paper to Chubby..Instead we would move these services out of the single-process HBase master to instead run in a zookeeper cluster.. Zookeeper is a Chubby near-clone that, like HBase, is a subproject of Hadoop. Integrating zookeeper will make cluster state robust against individual server failures and make for tidier state transitions.
- Make the HBase Store file format pluggable
Currently HBase is hardcoded to use the Hadoop MapFile as its base file system type (i.e. the SSTable from the Bigtable paper). Experience has shown the Hadoop MapFile type as suboptimal given HBase access patterns. For example, the MapFile index is ignorant of HBase 'rows'. We would change HBase to instead run against a file interface (HBASE-61 Create an HBase-specific MapFile implementation). A configuration option would dictate which file format implementation an HBase instance would use, just as you can swap 'engines' in MySQL.
Once the abstraction work had finished, we would add a new file format to go against this interface to replace Hadoop MapFile. The new file format would be more amenable to HBase I/O patterns. It will either be the TFile specified in the attachment to HADOOP-3315 New binary format or something similar.
Here are some notes on new file format
- Redo HBase RPC Profiling revealed that RPC calls are responsible for a large portion of the latency involved servicing requests. The HBase RPC is currently the RPC from Hadoop with minor modifications. Hadoop RPC was designed for the passing of occasional messages rather than for bulk data transfers (HDFS uses a different RPC mechanism for bulk data transfer). Among other unwanted attributes, at its core, the HBase RPC has a bottleneck such that it handles a single request at a time. We'd replace our RPC with an asynchronous RPC better suited to the type of traffic carried by HBase communication.
- Batched Updates
We would add to the HBase client a means of batching updates at the client. Currently updates are run one-at-a-time. On invocation of the batching feature, the HBase client would take on updates until it hit a threshold or a flush was explicitly invoked. The client would then sort the buffered edits by region and pass them in bulk, concurrently, out to the appropriate region servers. This feature would improve bulk upload performance. Done -- Stack
- In-memory tables Implement serving one or all column families of a table from memory.
- Locality Groups Section 6 of the Bigtable paper describes Locality Groups, a means of on-the-fly grouping column families. The group can be treated as though it were a single column family. In implementation, all Locality Group members are saved to the same Store in the file system rather than to a Store per column family as is done in HBase currently. A Locality Groups' berth can be widened or narrowed by the administrator as usage patterns evolve without need of schema change and attendant re-upload. At their maximum spread, all families would be part of a single Locality Group. In this configuration, HBase would act like a row-orientated store. At their narrowest, a Locality Group would map to a single column family.
- Data-Locality Awareness The Hadoop map reduce framework does makes a best effort at running tasks on the server hosting the task data after the dictum that its cheaper moving the processing to the data rather than the inverse. HBase needs smarts to assign regions to the region server that is running on the server hosting the regions' data. HBase needs to supply map reduce hints such that the Hadoop framework runs tasks beside the region server hosting the task input. These changes will make for savings in network I/O.
- Secondary indices The Bigtable primary key is the lexicographically sorted row. Add a means of adding secondary indices to a table.
- Access control Bigtable can control user access at the column family level. Leverage the Hadoop access control mechanism.
- Master fail over
An HBase cluster has a single master. If the master fails, the cluster shuts down. Develop a robust master failover. Done -- Stack
Overall, much progress was made towards the goal of enhancing robustness and scalability. 293 issues were identified and fixed. However, a couple of key priorities were not addressed:
- "Too many open file handles" This mostly a Hadoop HDFS problem, although some of the pressure can be relieved by changing the HBase RPC.
- Taking advantage of Hadoop append support. Minimal append support did not appear in Hadoop until Hadoop 0.18. While what is there is enough for HBase, this will probably be pushed into HBase 0.19 as we are quickly approaching the one month lag between Hadoop and HBase releases.
- Checkpointing/Syncing was deferred until Hadoop support is available.