This page is a work in progress

Introduction

The Bigtable data model and therefor the HBase data model too since it's a clone, is particularly well adapted to data-intensive systems. Getting high scalability from your relational database isn't done by simply adding more machines because its data model is based on a single-machine architecture. For example, a JOIN between two tables is done in memory and does not take into account the possibility that the data has to go over the wire. Companies who did propose relational distributed databases had a lot of redesign to do and this why they have high licensing costs. The other option is to use replication and when the slaves are overloaded with writes, the last option is to begin sharding the tables in sub-databases. At that point, data normalization is a thing you only remember seeing in class which is why going with the data model presented in this paper shouldn't bother you at all.

Overview

To put it simply, HBase can be reduced to a Map<byte[], Map<byte[], Map<byte[], Map<Long, byte[]>>>>. The first Map maps row keys to their column families. The second maps column families to their column keys. The third one maps column keys to their timestamps. Finally, the last one maps the timestamps to a single value. The keys are typically strings, the timestamp is a long and the value is an uninterpreted array of bytes. The column key is always preceded by its family and is represented like this: family:key. Since a family maps to another map, this means that a single column family can contain a theoretical infinity of column keys. So, to retrieve a single value, the user has to do a get using three keys:

row key+column key+timestamp -> value

Rows

The row key is treated by HBase as an array of bytes but it must have a string representation. A special property of the row key Map is that it keeps them in a lexicographical order. For example, numbers going from 1 to 100 will be ordered like this: 1,10,100,11,12,13,14,15,16,17,18,19,2,20,21,...,9,91,92,93,94,95,96,97,98,99

To keep the integers natural ordering, the row keys have to be left-padded with zeros. To take advantage of this, the functionalities of the row key Map are augmented by offering a scanner which takes a start row key (if not specified, the first one in the table) and an stop row key (if not specified, the last one in the table). For example, if the row keys are dates in the format YYYYMMDD, getting the month of July 2008 is a matter of opening a scanner from 20080700 to 20080800. It does not matter if the specified row keys are existing or not, the only thing to keep in mind is that the stop row key will not be returned which is why the first of August is given to the scanner.

Column Families

A column family regroups data of a same nature in HBase and has no constraint on the type. The families are part of the table schema and stay the same for each row; what differs from rows to rows is that the column keys can be very sparse. For example, row "20080702" may have in its "info:" family the following column keys:

info:aaa

info:bbb

info:ccc

While row "20080703" only has:

info:12342

Developers have to be very careful when using column keys since a key with a length of zero is permitted which means that in the previous example data can be inserted in column key "info:". We strongly suggest using empty column keys only when no other keys will be specified. Also, since the data in a family has the same nature, many attributes can be specified regarding performance and timestamps.

Timestamps

The values in HBase may have multiple versions kept according to the family configuration. By default, HBase sets the timestamp to each new value to current time in milliseconds and returns the latest version when a cell is retrieved. The developer can also provide its own timestamps when inserting data as he can specify a certain timestamp when fetching it.

Family Attributes

The following attributes can be specified or each families:

Implemented

Still not implemented

Real Life Example

The following example is the same one given during HBase ETS presentation available in French in the presentation page.

A good example on how to demonstrate the HBase data model is a blog because of its simple features and domain. Suppose the following mini-SRS:

The Source ERD

Let us consider the ERD (entity relationship diagram) below:

http://people.apache.org/~jdcryans/db_blog.jpg

The HBase Target Schema

A first solution could be :

Table

Row Key

Family

Attributs

blogtable

TTYYYYMMDDHHmmss

info:

Always contains the column keys author,title,under_title. Should be IN-MEMORY and have a 1 version

text:

No column key. 3 versions

comment_title:

Column keys are written like YYYMMDDHHmmss. Should be IN-MEMORY and have a 1 version

comment_author:

Same keys. 1 version

comment_text:

Same keys. 1 version

usertable

login_name

info:

Always contains the column keys password and name. 1 version

The row key for blogtable is a concatenation of it's type (shortened to 2 letters) and it's timestamp. This way, the rows will be gathered first by type and then by date throughout the cluster. It means more chances of hitting a single region to fetch the needed data. Also you can see that the one-to-many relationship between BLOGENTRY and COMMENT is handled by putting each attributes of the comments as a family in blogentry and by using it a date as a column key, all comments are already sorted.

One advantage of this design is that when you show the "front page" of your blog, you only have to fetch the family "info:" from blogtable. When you show an actual blog entry, you fetch a whole row. Another advantage is that by using timestamps in the row key, your scanner will fetch sequential rows if you want to show, for example, the entries from the last month.

Hbase/DataModel (last edited 2009-11-10 21:35:15 by tuxracer69)