Howl, A Shared Table Management Service for Hadoop

Motivation

Data processors using Hadoop have a common need for table management services. The goal of a table management service is to track data that exists in a Hadoop grid and present that data to users in a tabular format. Such a table management service needs to provide a single input and output format to users so that individual users need not be concerned with the storage formats that are chosen for particular data sets. As part of having a single format, the data will need to be described by one type of schema and have a single datatype system.

Additionally, users should be free to choose the best tools for their use cases. The Hadoop project includes Map Reduce, Streaming, Pig, and Hive, and additional tools exist such as Cascading. Each of these tools has users who prefer it, and there are use cases best addressed by each of these tools. Two users on the same grid who need to share data should not be constrained to use the same tool but rather should be free to choose the best tool for their use case. A table management service that presents data in the same way to all of the tools can alleviate this problem by providing interfaces to each of the data processing tools.

There are also a few other features a table management service should provide, such as notification of when data arrives.

So, a grid with such a table management service would look like:

howl_pic.jpg

In this picture the boxes on top represent the data processing tools, while those on the bottom represent storage formats.

Current Situation

Hive has a metadata store that tracks all of the tables and partitions known to a given instance of Hive. It also includes a Thrift service so that non-Hive clients can browse, create, and update the metadata.

hive.jpg

Yahoo! has been working on Owl, which provides metadata as well as other table management services. Like Hive it has a metadata store, and a REST server for non-Java based clients. It also provides OwlInputFormat and OwlOutputFormat for Map Reduce users to read and write data, and OwlLoader and OwlStorage so that Pig users can read and write data. Owl uses Pig's schema and datatype model.

Owl includes StorageDrivers, which are used to translate between a storage format's input and output formats and Owl's input and output formats. This means that for any new storage format, only one piece of code needs to be written to enable Pig and Map Reduce users to read and write the data.

owl.jpg

Owl provides table management services for Map Reduce and Pig. But given that it is a separate metadata store from Hive, these services are not open to Hive users.

Howl

The proposal of this project is to combine Hive's metadata store and Owl into a new system, called Howl.

howl.jpg

Howl will use Hive's metastore and thrift service in order to minimize integration costs with Hive. It will also adopt Hive's schema and datatype model as Pig's schemas and datatypes are compatible with this model. The OwlInputFormat, OwlOutputFormat, OwlLoader, and OwlStorage classes from Owl will be reworked as HowlInputFormat, HowlOutputFormat, HowlLoader, and HowlStorage.

The initial release of Howl will allow interoperability of data between Pig, Map Reduce, and Hive. It will also include a command line interface (CLI) so that users can create, modify, and read metadata. At least initially there will not be a HowlSerDe. Hive will continue to use Howl as a metadata store in the same way it uses its metadata store today. The initial release will also include a storage driver for RCFile.

Future goals for Howl include:

Example

To get a feel for how this might work, consider the following example. Assume that on a daily basis, Joe copies data onto his company's Hadoop grid via distcp and runs a Pig Latin script against it to cleanse and transform the data. Then his manager Suzie uses that data in her SQL queries to make business decisions. With Howl in the grid, this will look something like the following.

To load the data onto the grid and register it with Howl, Joe would do:

    hadoop distcp file:///data/staging/20100716/file.dat hdfs://mynamenode.mygrid.mycompany.com/data/rawevents/20100716/data
    howl "ALTER TABLE rawevents ADD PARTITION 20100716 hdfs://mynamenode.mygrid.mycompany.com/data/rawevents/20100716/data"

To process the data with Pig, he would then run a Pig Latin script:

    A = LOAD 'rawevents' USING HowlLoader(); -- schema is known by Howl and provided to Pig, the user need not specify it.
    B = FILTER A BY date = '20100716'; -- this will get passed as a partition filter to Howl so that only the appropriate partitions of rawevents are read.
    ...
    STORE Z INTO 'processedevents' USING HowlStorage("date=20100716");

The above store in Pig Latin would automatically register the data in Howl. Suzie would then be able to run her daily query against this data:

    SELECT zipcode, COUNT(clicks)
    FROM processedevents -- table name is the same in Hive as in Pig
    WHERE date = '20040716' -- partitioning is the same too
    GROUP by zipcode

If, at some point in the future, instead of storing raw data in text file and copying it in via distcp, a change is made to store it via RCFile, then the only work that needs to be done is the table needs to be altered to indicate that RCFile rather than text file is the default storage format. The Pig Latin script need not be changed. And old data will not need to be converted. If there is a monthly Pig Latin script that roles up daily raw events, Howl will handle the fact that some of the data is stored in text and some in RCFile and present a single stream to Pig for processing.

Join Us

Currently Howl's code is hosted at github: http://github.com/yahoo/howl

Howl issues are discussed on howldev@yahoogroups.com. You can join it by sending mail to howldev-subscribe@yahoogroups.com

Howl (last edited 2010-08-19 23:47:32 by AlanGates)