This effort is still a "work in progress". Please feel free to add comments, but please make them stand out by bolding or underlining them. Thanks!
Table of Contents
Introduction
This document gives a quick overview of HBase, the Hadoop simple database. It is extremely similar to Google's Bigtable, with a just a few differences. If you understand Bigtable, great. If not, you should still be able to understand this document.
Data Model
HBase uses a data model very similar to that of Bigtable. Users store data rows in labelled tables. A data row has a sortable key and an arbitrary number of columns. The table is stored sparsely, so that rows in the same table can have crazily-varying columns, if the user likes.
A column name has the form "<family>:<label>" where <family> and <label> can be any string you like. A single table enforces its set of <family>s (called "column families"). You can only adjust this set of families by performing administrative operations on the table. However, you can use new <label> strings at any write without preannouncing it. HBase stores column families physically close on disk. So the items in a given column family should have roughly the same write/read behavior.
Writes are row-locked only. You cannot lock multiple rows at once. All row-writes are atomic by default.
All updates to the database have an associated timestamp. The HBase will store a configurable number of versions of a given cell. Clients can get data by asking for the "most recent value as of a certain time". Or, clients can fetch all available versions at once.
Conceptual View
Conceptually a table may be thought of a collection of rows that are located by a row key (and optional timestamp) and where any column may not have a value for a particular row key (sparse). The following example is a slightly modified form of the one on page 2 of the
Bigtable Paper.
|
Row Key |
Time Stamp |
Column "contents:" |
Column "anchor:" |
Column "mime:" |
|
|
"com.cnn.www" |
t9 |
|
"anchor:cnnsi.com" |
"CNN" |
|
|
t8 |
|
"anchor:my.look.ca" |
"CNN.com" |
|
|
|
t6 |
"<html>..." |
|
|
"text/html" |
|
|
t5 |
"<html>..." |
|
|
|
|
|
t3 |
"<html>..." |
|
|
|
|
Physical Storage View
Although, at a conceptual level, tables may be viewed as a sparse set of rows, physically they are stored on a per-column basis. This is an important consideration for schema and application designers to keep in mind.
Pictorially, the table shown in the conceptual view above would be stored as follows:
|
Row Key |
Time Stamp |
Column "contents:" |
|
"com.cnn.www" |
t6 |
"<html>..." |
|
t5 |
"<html>..." |
|
|
t3 |
"<html>..." |
|
Row Key |
Time Stamp |
Column "anchor:" |
|
|
"com.cnn.www" |
t9 |
"anchor:cnnsi.com" |
"CNN" |
|
t8 |
"anchor:my.look.ca" |
"CNN.com" |
|
|
Row Key |
Time Stamp |
Column "mime:" |
|
"com.cnn.www" |
t6 |
"text/html" |
It is important to note in the diagram above that the empty cells shown in the conceptual view are not stored. Thus a request for the value of the "contents:" column at time stamp t8 would return a null value. Similarly, a request for an "anchor:" value at time stamp t9 for "my.look.ca" would return a null value.
However, if no timestamp is supplied, the most recent value for a particular column would be returned and would also be the first one found since time stamps are stored in descending order. Consequently the value returned for "contents:" if no time stamp is supplied is the value for t6 and the value for an "anchor:" for "my.look.ca" if no time stamp is supplied is the value for time stamp t8.
Example
To show how data is stored on disk, consider the following example:
A program writes rows "row[0-9]", column "anchor:foo"; then writes rows "row[0-9]"; column "anchor:bar"; and finally writes rows "row[0-9]" column "anchor:foo". After flushing the memcache and compacting the store, the contents of the MapFile would look like:
row=row0, column=anchor:bar, timestamp=1174184619081 row=row0, column=anchor:foo, timestamp=1174184620720 row=row0, column=anchor:foo, timestamp=1174184617161 row=row1, column=anchor:bar, timestamp=1174184619081 row=row1, column=anchor:foo, timestamp=1174184620721 row=row1, column=anchor:foo, timestamp=1174184617167 row=row2, column=anchor:bar, timestamp=1174184619081 row=row2, column=anchor:foo, timestamp=1174184620724 row=row2, column=anchor:foo, timestamp=1174184617167 row=row3, column=anchor:bar, timestamp=1174184619081 row=row3, column=anchor:foo, timestamp=1174184620724 row=row3, column=anchor:foo, timestamp=1174184617168 row=row4, column=anchor:bar, timestamp=1174184619081 row=row4, column=anchor:foo, timestamp=1174184620724 row=row4, column=anchor:foo, timestamp=1174184617168 row=row5, column=anchor:bar, timestamp=1174184619082 row=row5, column=anchor:foo, timestamp=1174184620725 row=row5, column=anchor:foo, timestamp=1174184617168 row=row6, column=anchor:bar, timestamp=1174184619082 row=row6, column=anchor:foo, timestamp=1174184620725 row=row6, column=anchor:foo, timestamp=1174184617168 row=row7, column=anchor:bar, timestamp=1174184619082 row=row7, column=anchor:foo, timestamp=1174184620725 row=row7, column=anchor:foo, timestamp=1174184617168 row=row8, column=anchor:bar, timestamp=1174184619082 row=row8, column=anchor:foo, timestamp=1174184620725 row=row8, column=anchor:foo, timestamp=1174184617169 row=row9, column=anchor:bar, timestamp=1174184619083 row=row9, column=anchor:foo, timestamp=1174184620725 row=row9, column=anchor:foo, timestamp=1174184617169
Note that column "anchor:foo" is stored twice (because the timestamp differs) and that the most recent timestamp is the first of the two entries (so the most recent update is always found first).
Client API
See the Javadoc for
HTable and
HBaseAdmin
Scanner API
To obtain a scanner, a Cursor-like row 'iterator' that must be closed,
instantiate an HTable, and then invoke obtainScanner. This method returns an
HScannerInterface against which you call
next and ultimately
close.
HRegion (Tablet) Server
To the user, a table seems like a list of data tuples, sorted by row key. Physically, tables are broken into HRegions. An HRegion is identified by its tablename plus a start/end-key pair. A given HRegion with keys <start> and <end> will store all the rows from [<start>, <end>]. A set of HRegions, sorted appropriately, forms an entire table.
All data is physically stored using Hadoop's DFS. Data is served to clients by a set of HRegionServers, usually one per machine. A given HRegion is served by only one HRegionServer at a time.
When a client wants to make updates, it contacts the relevant HRegionServer and commits the update to an HRegion. Upon commit, the data is added to the HRegion's HMemcache and to the HRegionServer's HLog. The HMemcache is a memory buffer that stores and serves the most-recent updates. The HLog is an on-disk log file that tracks all updates. The commit() call will not return to the client until the update has been written to the HLog.
When serving data, the HRegion will first check its HMemcache. If not available, it will then check its on-disk HStores. There is an HStore for each column family in an HRegion. An HStore might consist of multiple on-disk HStoreFiles. Each HStoreFile is a B-Tree-like structure that allow for relatively fast access.
Periodically, we invoke HRegion.flushcache() to write the contents of the HMemcache to an on-disk HStore's files. This adds a new HStoreFile to each HStore. The HMemcache is then emptied, and we write a special token to the HLog, indicating the HMemcache has been flushed.
On startup, each HRegion checks to see if there have been any writes to the HLog since the most-recent invocation of flushcache(). If not, then all relevant HRegion data is reflected in the on-disk HStores. If yes, the HRegion reconstructs the updates from the HLog, writes them to the HMemcache, and then calls flushcache(). Finally, it deletes the HLog and is now available for serving data.
Thus, calling flushcache() infrequently will be less work, but HMemcache will consume more memory and the HLog will take a longer time to reconstruct upon restart. If flushcache() is called frequently, the HMemcache will take less memory, and the HLog will be faster to reconstruct, but each flushcache() call imposes some overhead.
The HLog is periodically rolled, so it consists of multiple time-sorted files. Whenever we roll the HLog, the HLog will delete all old log files that contain only flushed data. Rolling the HLog takes very little time and is generally a good idea to do from time to time.
Each call to flushcache() will add an additional HStoreFile to each HStore. Fetching a value from an HStore can potentially access all of its HStoreFiles. This is time-consuming, so we want to periodically compact these HStoreFiles into a single larger one. This is done by calling HStore.compact().
Compaction is an expensive operation that runs in background. Its triggered when the number of HStoreFiles cross a configurable threshold.
The Google Bigtable paper has a slightly-confusing hierarchy of major and minor compactions. We have just two things to keep in mind:
A "flushcache()" drives all updates out of the memory buffer into on-disk structures. Upon flushcache, the log-reconstruction time goes to zero. Each flushcache() will add a new HStoreFile to each HStore.
a "compact()" consolidates all the HStoreFiles into a single one.
Unlike Bigtable, Hadoop's HBase allows no period where updates have been "committed" but have not been written to the log. This is not hard to add, if it's really wanted.
We can merge two HRegions into a single new HRegion by calling HRegion.closeAndMerge(). Currently both regions have to be offline for this to work.
When a region grows larger than a configurable size, HRegion.closeAndSplit() is called on the region server. Two new regions are created by dividing the parent region. The new regions are reported to the master for it to rule which region server should host each of the daughter splits. The division is pretty fast mostly because the daughter regions hold references to the parent's HStoreFiles -- one to the top half of the parent's HStoreFiles, and the other to the bottom half. While the references are in place, the parent region is marked offline and hangs around until compactions in the daughters cleans up all parent references at which time the parent is removed.
OK, to sum up so far:
Clients access data in tables.
tables are broken into HRegions.
HRegions are served by HRegionServers. Clients contact an HRegionServer to access the data within its row-range.
HRegions store data in:
HMemcache, a memory buffer for recent writes
HLog, a write-log for recent writes
HStores, an efficient on-disk set of files. One per col-group.
HBase Master Server
Each HRegionServer stays in contact with the single HMaster. The HMaster is responsible for telling each HRegionServer what HRegions it should load and make available.
The HMaster keeps a constant tally of which HRegionServers are alive at any time. If the connection between an HRegionServer and the HMaster times out, then:
The HRegionServer kills itself and restarts in an empty state.
The HMaster assumes the HRegionServer has died and reallocates its HRegions to other HRegionServers
Note that this is unlike Google's Bigtable, where a TabletServer can still serve Tablets after its connection to the Master has died. We tie them together, because we do not use an external lock-management system like Bigtable. With Bigtable, there's a Master that allocates tablets and a lock manager (Chubby) that guarantees atomic access by TabletServers to tablets. HBase uses just a single central point for all HRegionServers to access: the HMaster.
(This is no more dangerous than what Bigtable does. Each system is reliant on a network structure (whether HMaster or Chubby) that must survive for the data system to survive. There may be some Chubby-specific advantages, but that's outside HBase's goals right now.)
As HRegionServers check in with a new HMaster, the HMaster asks each HRegionServer to load in zero or more HRegions. When the HRegionServer dies, the HMaster marks those HRegions as unallocated, and attempts to give them to different HRegionServers.
Recall that each HRegion is identified by its table name and its key-range. Since key ranges are contiguous, and they always start and end with NULL, it's enough to simply indicate the start-key.
Unfortunately, this is not quite enough. Because of merge() and split(), we may (for just a moment) have two quite different HRegions with the same name. If the system dies at an inopportune moment, both HRegions may exist on disk simultaneously. The arbiter of which HRegion is "correct" is the HBase meta-information (to be discussed shortly). In order to distinguish between different versions of the same HRegion, we also add a unique 'regionId' to the HRegion name.
Thus, we finally get to this identifier for an HRegion: tablename + startkey + regionId Here's an example where the table is name hbaserepository, the start key is w-nk5YNZ8TBb2uWFIRJo7V== and the region id is 6890601455914043877: hbaserepository,w-nk5YNZ8TBb2uWFIRJo7V==,6890601455914043877
META Table
We can also use this identifier as a row-label in a different HRegion. Thus, the HRegion meta-info is itself stored in an HRegion. We call this table, which maps from HRegion identifiers to physical HRegionServer locations, the META table.
The META table itself can grow large, and may be broken into separate HRegions. To locate all components of the META table, we list all META HRegions in a ROOT table. The ROOT table is always contained in a single HRegion.
Upon startup, the HMaster immediately attempts to scan the ROOT table (because there is only one HRegion for the ROOT table, that HRegion's name is hard-coded). It may have to wait for the ROOT table to be allocated to an HRegionServer.
Once the ROOT table is available, the HMaster scans it and learns of all the META HRegions. It then scans the META table. Again, the HMaster may have to wait for all the META HRegions to be allocated to different HRegionServers.
Finally, when the HMaster has scanned the META table, it knows the entire set of HRegions. It can then allocate these HRegions to the set of HRegionServers.
The HMaster keeps the set of currently-available HRegionServers in memory. Since the death of the HMaster means the death of the entire system, there's no reason to store this information on disk.
Unlike Bigtable, which stores information about Tablet->TabletServer mapping in Chubby, Google's distributed lock server, all information about the HRegion->HRegionServer mapping is stored physically in the META table (since there is no equivalent to Chubby in the Hadoop environment).
Consequently each row in the META and ROOT tables has three members of the "info:" column family:
info:regioninfo contains a serialized
HRegionInfo object info:server contains a serialized string which is the output from
HServerAddress.toString(). This string can be supplied to one of the
HServerAddress constructors. info:startcode a serialized long integer generated by the HRegionServer when it starts. The HRegionServer sends this start code to the master so the master can determine if the server information in the META and ROOT regions is stale.
Thus, a client does not need to contact the HMaster after it learns the location of the ROOT HRegion. The load on HMaster should be relatively small: it deals with timing out HRegionServers, scanning the ROOT and META upon startup, and serving the location of the ROOT HRegion.
The HBase client is fairly complicated, and often needs to navigate the ROOT and META HRegions when serving a user's request to scan a specific user table. If an HRegionServer is unavailable or it does not have an HRegion it should have, the client will wait and retry. At startup or in case of a recent HRegionServer failure, the correct mapping info from HRegion to HRegionServer may not always be available.
Summary
HRegionServers offer access to HRegions (an HRegion lives at one HRegionServer)
HRegionServers check in with the HMaster
If the HMaster dies, the whole system dies
The set of current HRegionServers is known only to the HMaster
The mapping between HRegions and HRegionServers is stored in two special HRegions, which are allocated to HRegionServers like any other.
The ROOT HRegion is a special one, the location of which the HMaster always knows.
It's the client's responsibility to navigate all this.
Current Status
As of this writing (2007/08/16), there are approximately 27,000 lines of code in "src/contrib/hbase/src/java/org/apache/hadoop/hbase/" directory on the Hadoop SVN trunk.
There are also about 7200 lines of test cases.
All of the single-machine operations (safe-committing, merging, splitting, versioning, flushing, compacting, log-recovery) are complete, have been tested, and seem to work great.
The multi-machine stuff (the HMaster and the HRegionServer) are actively being enhanced and debugged.
Issues and TODOs:
How do we know if a region server is really dead, or if the network is partitioned or if the region server is merely late in reporting in or getting its lease renewed? If we decide that a region server is dead, and it is not, it could still be doing updates on behalf of clients, adding to its log. It is not until it does successfully report in that it knows the master has "delisted" it. Only at that point does it start flushing the cache, finishing the log, etc. In the mean time the master may be ripping the rug out from under it by trying to split its log file (the most recent of which will be zero length because it is visible, but has no content until the region server closes it), and may have already reassigned the regions being served by the region server to another one, which will at a minimum lose data, and in the worst case, corrupt the region. This issue is being addressed in
HADOOP-1937 Vuk Ercegovac <vercego AT SPAMFREE us DOT ibm DOT com> of IBM Almaden Research pointed out that keeping HBase HRegion edit logs in HDFS is currently flawed. HBase writes edits to logs and to a memcache. The 'atomic' write to the log is meant to serve as insurance against abnormal RegionServer exit: on startup, the log is rerun to reconstruct an HRegion's last wholesome state. But files in HDFS do not 'exist' until they are cleanly closed -- something that will not happen if RegionServer exits without running its 'close'.
The HMemcache lookup structure is relatively inefficient
Implement some kind of block caching in HRegion. While the DFS isn't hitting the disk to fetch blocks, HRegion is making IPC calls to DFS (via MapFile)
Investigate possible performance problem or memory management issue related to random reads. As more and more random reads are done, performance slows down and the memory footprint increases
Profile. Bulk of time seems to be spent RPC'ing. Improve RPC or amend how hbase uses RPC.
See
hbase issues for list of whats being currently worked on.
Comments
Please add comments about the architecture below. In the future, as this page grows too big, it will be split into multiple sub-pages based on the architectural component. Applicable comments will then be moved to that page. At that point, comments on this page should be related to an overall architectural issue or one that spans multiple components. Thank you.
It is not Row-Oriented.
by Udanax <webmaster AT SPAMFREE udanax DOT org> 2007/02/06
I think Hbase should be compact (space-efficient), fast and should be able to manage high-demand load. It should be able to handle sparse tables efficiently. So, for wide and sparse data, Hbase must store data by columns like C-Store does.
I agree. See the sections on the conceptual data model and the physical data model. -- JimKellerman 2007/05/30
A column-oriented system handles NULLs more easily with significantly smaller performance overhead, and supports both Horizontal and Vertical Parallel Processing.
Bigtable (and Hbase) do not store nulls. If there is no value for a particular key, then an empty or null value will be returned -- JimKellerman 2007/05/30
Let's consider the following case: You may be familiar to RDF(Resource Description Framework) Storage from W3C, which is
Storing and managing very large amounts of structured data
Row/column space can be sparse
Columns are in the form of (family: optional qualifier). This is a RDF Properties
Columns have type information
In both Bigtable, and Hbase, there is no notion of type. Keys and values in Bigtable are arbitrary strings. In Hbase, values are an arbitrary byte array. -- JimKellerman 2007/05/30
Because of the design of the system, columns are easy to create (and are created implicitly)
In Bigtable, column families are easy to create but they require administration priviliges (Access Control Lists control who can manipulate the schema. New column family members can be created at any time. Hbase follows this metaphor. -- JimKellerman 2007/05/30
Column families can be split into locality groups (Ontologies!)
Let's assume a large amount of RDF documents are stored in the system. And then, vertical(column) data set by one of RDF properties can be read fast from Table, because it is column-stored. Please let me know if you don't agree with me.