Questions
Can someone give an example of basic API-usage going against hbase?
Why do I see "java.io.IOException...(Too many open files)" in my logs?
How do I access HBase from my Ruby/Python/Perl/PHP/etc. application?
How do I create a table with a column family named "count" (or some other HQL reserved word)?
Why is HBase ignoring HDFS client configuration such as dfs.replication?
Answers
See
Bryan Duxbury's post on this topic.
2. Can someone give an example of basic API-usage going against hbase?
The two main client-side entry points are
HBaseAdmin and
HTable. Use HBaseAdmin to create, drop, list, enable and disable tables. Use it also to add and drop table column families. For adding, updating and deleting data, use HTable. Here is some pseudo code absent error checking, imports, etc., that creates a table, adds data, does a fetch of just-added data and then deletes the table.
// First get a conf object. This will read in the configuration
// that is out in your hbase-*.xml files such as location of the
// hbase master node.
HBaseConfiguration conf = new HBaseConfiguration();
// Create a table named 'test' that has two column families,
// one named 'content, and the other 'anchor'. The colons
// are required for column family names.
HTableDescriptor desc = new HTableDescriptor("test");
desc.addFamily(new HColumnDescriptor("content:"));
desc.addFamily(new HColumnDescriptor("anchor:"));
HBaseAdmin admin = new HBaseAdmin(conf);
admin.createTable(desc);
HTableDescriptor[] tables = admin.listTables();
// New table should be in list of returned tables.
// Or you could call admin.exists();
HTable table = new HTable(conf, "test");
// Add content to 'column:' on a row named 'row_x'
String row = "row_x";
BatchUpdate update = new BatchUpdate(row);
update.put("content:", Bytes.toBytes("some content");
table.commit(update);
// Now fetch the content just added
byte data[] = table.get(row, "content:");
// Delete the table.
admin.deleteTable(desc.getName());
For further examples, check out the hbase unit tests. These are probably your best source for sample code. Start with the code in org.apache.hadoop.hbase.TestHBaseCluster. It does a general table setup and then performs various client operations on the created table: loading, scanning, deleting, etc.
Don't forget your client will need a running hbase instance to connect to (See the Getting Started section toward the end of this
Hbase Package Summary page).
See Hbase/Jython for the above example code done in Jython
3. What other hbase-like applications are there out there?
Apart from Google's bigtable, here are ones we know of:
PNUTS, a Platform for Nimble Universal Table Storage, being developed internally at Yahoo!
Amazon SimpleDB is a web service for running queries on structured data in real time. "
Hypertable is an open source project based on published best practices and our own experience in solving large-scale data-intensive tasks." "
Cassandra is a distributed storage system for managing structured data while providing reliability at a massive scale." "
Bigdata(R) is an open-source scale-out storage and computing fabric supporting optional transactions, very high concurrency, and very high aggregate IO rates." "
Neptune is Distributed Large scale Structured Data Storage, and open source project implementing Google's Bigtable."
4. Can I fix OutOfMemoryExceptions in hbase?
Out-of-the-box, hbase uses a default of 1G heap size. Set the HBASE_HEAPSIZE environment variable in ${HBASE_HOME}/conf/hbase-env.sh if your install needs to run with a larger heap. HBASE_HEAPSIZE is like HADOOP_HEAPSIZE in that its value is the desired heap size in MB. The surrounding '-Xmx' and 'm' needed to make up the maximum heap size java option are added by the hbase start script (See how HBASE_HEAPSIZE is used in the ${HBASE_HOME}/bin/hbase script for clarification).
Otherwise, particularly if small cells, upping the default hbase.io.index.interval configuration (or setting io.map.index.skip) -- see the hbase-default.xml for descriptions -- has the greatest effect on amount of heap used. You can also try downing hbase.regionserver.globalMemcache.upperLimit and hbase.regionserver.globalMemcache.lowerLimit.
5. How do I enable hbase DEBUG-level logging?
Either add the following line to your log4j.properties file -- log4j.logger.org.apache.hadoop.hbase=DEBUG -- and restart your cluster or, if running a post-0.15.x version, you can set DEBUG via the UI by clicking on the 'Log Level' link (but you need set 'org.apache.hadoop.hbase' to DEBUG without the 'log4j.logger' prefix).
6. Why do I see "java.io.IOException...(Too many open files)" in my logs?
Currently Hbase is a file handle glutton. Running an Hbase loaded w/ more than a few regions, its possible to blow past the common 1024 default file handle limit for the user running the process. Running out of file handles is like an OOME, things start to fail in strange ways. To up the users' file handles, edit /etc/security/limits.conf on all nodes and restart your cluster.
# Each line describes a limit for a user in the form: # # domain type item value # hbase - nofile 32768
You may need to also edit sysctl.conf.
The math runs roughly as follows: Per column family, there is at least one mapfile and possibly up to 5 or 6 if a region is under load (lets say 3 per column family on average). Multiply by the number of regions per region server. So, for example, say you have a schema of 3 column familes per region and that you have 100 regions per regionserver, the JVM will open 3 * 3 * 100 mapfiles -- 900 file descriptors not counting open jar files, conf files, etc (Run 'lsof -p REGIONSERVER_PID' to see for sure).
Or you may be running into
kernel limits?
7. What can I do to improve hbase performance?
See Performance Tuning on the wiki home page
8. How do I access Hbase from my Ruby/Python/Perl/PHP/etc. application?
9. How do I create a table with a column family named "count" (or some other HQL reserved word)?
Obsolete. No longer applicable.
To delete an explicit cell, add a delete record of the exact same timestamp (Use the commit that takes a timestamp when committing the BatchUpdate that contains your delete). Entering a delete record that is newer than the cell you would delete will also work when scanning and getting with timestamps that are equal or newer to the delete entry but there is nothing to stop you going behind the delete cell entry by specifying a timestamp that is older retrieving old entries.
If you want to delete all cell entries whenever they were written, use the HTable.deleteAll method. It will go find all cells and for each enter a delete record with a matching timestamp.
There is nothing to stop you adding deletes or puts with timestamps that are from the far future or of the distant past but doing so is likely to get you into trouble; its a known issue that hbase currently does not do the necessary work checking all stores to see if an old store has an entry that should override additions made recently.
11. What ports does HBase use?
Not counting the ports used by hadoop -- hdfs and mapreduce -- by default, hbase runs the master and its informational http server at 60000 and 60010 respectively and regionservers at 60020 and their informational http server at 60030. ${HBASE_HOME}/conf/hbase-default.xml lists the default values of all ports used. Also check ${HBASE_HOME}/conf/hbase-site.xml for site-specific overrides.
12. Why is HBase ignoring HDFS client configuration such as dfs.replication?
If you have made HDFS client configuration on your hadoop cluster, HBase will not see this configuration unless you do one of the following:
Add a pointer to your HADOOP_CONF_DIR to CLASSPATH in hbase-env.sh
Add a copy of hadoop-site.xml to ${HBASE_HOME}/conf, or
If only a small set of HDFS client configurations, add them to hbase-site.xml
The first option is the better of the three since it avoids duplication.
13. Any advice for smaller clusters in write-heavy environments?
14. Can I change the regionserver behavior so it, for example, orders keys other than lexicographically, etc.?
Yes, by subclassing HRegionServer. For example that orders the row return by column values, see
HBASE-605
15. Can I safely move the master from node A to node B?
Yes. HBase must be shutdown. Edit your hbase-site.xml configuration across the cluster setting hbase.master to point at the new location.
16. Can I safely move the hbase rootdir in hdfs?
Yes. HBase must be down for the move. After the move, update the hbase-site.xml across the cluster.
17 Can HBase development be done on windows?
See the
quickstart page for Hadoop. The requirements for developing HBase on Windows is the same as for Hadoop.
18 Please explain HBase version numbering?
Originally HBase lived under src/contrib in Hadoop Core. The HBase version was that of the hosting Hadoop. The last HBase version that bundled under contrib was part of Hadoop 0.16.1 (March of 2008).
The first HBase Hadoop subproject release was versioned 0.1.0. Subsequent releases went at least as far as 0.2.1 (September 2008).
In August of 2008, consensus had it that since HBase depends on a particular Hadoop Core version, the HBase major+minor versions would from now on mirror that of the Hadoop Core version HBase depends on. The first HBase release to take on this new versioning regimine was 0.18.0 HBase; HBase 0.18.0 depends on Hadoop 0.18.x.
Sorry for any confusion caused.
19 What version of Hadoop do I need to run HBase?
Different versions of HBase require different versions of Hadoop. Consult the table below to find which version of Hadoop you will need:
|
HBase Release Number |
Hadoop Release Number |
|
0.1.x |
0.16.x |
|
0.2.x |
0.17.x |
|
0.18.x |
0.18.x |
|
0.19.x |
0.19.x |
Releases of Hadoop can be found
here. We recommend using the most recent version of Hadoop possible, as it will contain the most bug fixes.
Note that HBase-0.2.x can be made to work on Hadoop-0.18.x. HBase-0.2.x ships with Hadoop-0.17.x, so to use Hadoop-0.18.x you must recompile Hadoop-0.18.x, remove the Hadoop-0.17.x jars from HBase, and replace them with the jars from Hadoop-0.18.x.
Also note that after HBase-0.2.x, the HBase release numbering schema will change to align with the Hadoop release number on which it depends.
20 Any other troubleshooting pointers for me?
Please see our
Troubleshooting page.
21 Are there any Schema Design examples?
The following text is taken from Jonathan Gray's mailing list posts.
- There's a very big difference between storage of relational/row-oriented databases and column-oriented databases. For example, if I have a table of 'users' and I need to store friendships between these users... In a relational database my design is something like:
Table: users(pkey = userid) Table: friendships(userid,friendid,...) which contains one (or maybe two depending on how it's impelemented) row for each friendship.
In order to lookup a given users friend, SELECT * FROM friendships WHERE userid = 'myid';
The cost of this relational query continues to increase as a user adds more friends. You also begin to have practical limits. If I have millions of users, each with many thousands of potential friends, the size of these indexes grow exponentially and things get nasty quickly. Rather than friendships, imagine I'm storing activity logs of actions taken by users.
In a column-oriented database these things scale continuously with minimal difference between 10 users and 10,000,000 users, 10 friendships and 10,000 friendships.
Rather than a friendships table, you could just have a friendships column family in the users table. Each column in that family would contain the ID of a friend. The value could store anything else you would have stored in the friendships table in the relational model. As column families are stored together/sequentially on a per-row basis, reading a user with 1 friend versus a user with 10,000 friends is virtually the same. The biggest difference is just in the shipping of this information across the network which is unavoidable. In this system a user could have 10,000,000 friends. In a relational database the size of the friendship table would grow massively and the indexes would be out of control.
Q: Can you please provide an example of "good de-normalization" in HBase and how its held consistent (in your friends example in a relational db, there would be a cascadingDelete)? As i think of the users table: if i delete an user with the userid='123', then if have to walk through all of the other users column-family "friends" to guaranty consistency?! Is de-normalization in HBase only used to avoid joins? Our webapp doenst use joins at the moment anyway.
You lose any concept of foreign keys. You have a primary key, that's it. No secondary keys/indexes, no foreign keys.
It's the responsibility of your application to handle something like deleting a friend and cascading to the friendships. Again, typical small web apps are far simpler to write using SQL, you become responsible for some of the things that were once handled for you.
Another example of "good denormalization" would be something like storing a users "favorite pages". If we want to query this data in two ways: for a given user, all of his favorites. Or, for a given favorite, all of the users who have it as a favorite. Relational database would probably have tables for users, favorites, and userfavorites. Each link would be stored in one row in the userfavorites table. We would have indexes on both 'userid' and 'favoriteid' and could thus query it in both ways described above. In HBase we'd probably put a column in both the users table and the favorites table, there would be no link table.
That would be a very efficient query in both architectures, with relational performing better much better with small datasets but less so with a large dataset.
Now asking for the favorites of these 10 users. That starts to get tricky in HBase and will undoubtedly suffer worse from random reading. The flexibility of SQL allows us to just ask the database for the answer to that question. In a small dataset it will come up with a decent solution, and return the results to you in a matter of milliseconds. Now let's make that userfavorites table a few billion rows, and the number of users you're asking for a couple thousand. The query planner will come up with something but things will fall down and it will end up taking forever. The worst problem will be in the index bloat. Insertions to this link table will start to take a very long time. HBase will perform virtually the same as it did on the small table, if not better because of superior region distribution.
Q:[Michael Dagaev] How would you design an Hbase table for many-to-many association between two entities, for example Student and Course?
I would define two tables:
Student: student id student data (name, address, ...) courses (use course ids as column qualifiers here) Course: course id course data (name, syllabus, ...) students (use student ids as column qualifiers here)
Does it make sense?
A[Jonathan Gray] : Your design does make sense.
As you said, you'd probably have two column-families in each of the Student and Course tables. One for the data, another with a column per student or course. For example, a student row might look like: Student : id/row/key = 1001 data:name = Student Name data:address = 123 ABC St courses:2001 = (If you need more information about this association, for example, if they are on the waiting list) courses:2002 = ...
This schema gives you fast access to the queries, show all classes for a student (student table, courses family), or all students for a class (courses table, students family).