Apache Owl Wiki

Vision

Owl provides a more natural abstraction for Map-Reduce and Map-Reduce-based technologies (e.g., Pig, SQL) by allowing developers to express large datasets as tables, which in turn consist of rows and columns. Owl tables are similar, but not identical to familiar database / data warehouse tables.

The core M/R programming interface as we know it (the mapper, reducer, output collector, record reader and input format ) all deal with collection of abstract data objects, not files. However, the current set of InputFormat implementations provided by job API are relatively primitive and are heavily coupled to file formats and HDFS paths to describe input and output locations. From an application programmer’s perspective, one has to think about both the abstract data and the physical representation and storage location, which is a disconnect from the abstract data API. In the meantime, the number of file formats and (de)serialization libraries have flourished in the Hadoop community. Some of these require certain metadata to operate/optimize. While providing optimization and performance enhancements, these file formats and SerDe libs don’t make it any easier to develop applications on and manage very big data sets.

High Level Diagram

owl.jpg

As one can see, Owl gives Hadoop users a uniform interface for organizing, discovering and managing data stored in many different formats, and to promote interoperability among different programming frameworks. Owl presents a single logical view of data organization and hides the complexity and evolutions in underlying physical data layout schemes. It gives Hadoop applications a stable foundation to build upon.

Main Properties and Features

Feature

Status

Owl is a stand-alone table store, not tied to any particular data query or processing languages, supporting MR, Pig Latin, and Pig SQL

current

Owl has a flexible data partitioning model, with multiple levels of partitioning, physical and logical partitioning, and partition pruning for query optimization

current

Owl has a flexible interface for pushing projections and filters all the way down

current

Owl has a framework for storing data in many storage formats, and different storage formats can co-exist within the same table

current

Owl provides capability discovery mechanism to allow applications to take advantage of unique features of storage drivers

current

Owl supports both managed tables (completely managed by Owl) and unmanaged tables (called "external tables" in many databases)

currently support external tables

Owl manages a unified schema for each table (by unifying the schemas of its partitions)

current

Owl has support for storing custom metadata associated with tables and partitions

current

Owl has support for automatic data retention management

future

Owl has support for notifications on data change (new data added, data restated, etc. )

future

Owl has support for converting data between write-friendly and read-friendly formats

future

Owl has support for addressing HDFS NameNode limitations by decreasing the number of files needed to store very large data sets

future

Owl provides a security model for secure data access

future

Prerequisite

Owl depends on Pig for its tuple classes as its basic unit of data container, and Hadoop 20 for OwlInputFormat. Its first release will require Pig 0.7 or later and Hadoop 20.2 or late. Owl integrates with Zebra 0.7 out-of-the-box.

Getting Owl

Initially, we like to open source Owl as a Pig contrib project. In the long term, Owl could become a separate Hadoop subproject as it provides a platform service all Hadoop applications.

Owl would live as a Pig contrib project at:

Owl source code

Compilation prerequisite:

Download the following binary

How to compile

Deploying Owl

For development environment, Owl supports jetty 7.0 (with jetty-runner) and derby 10.5. For production deployment, Owl supports:

After installing Tomcat and MySQL, you will need these files:

Create db schema in MySql:

Set up parameters in owlServerConfig:

Deploy Owl to Tomcat:

Developing on Owl

Owl has two major public APIs. Owl Driver provides management APIs against three core Owl abstractions: "Owl Table", "Owl Database", and "Partition". This API is backed up by an internal Owl metadata store that runs on Tomcat and a relational database. OwlInputFormat provides a data access API and is modeled after the traditional Hadoop InputFormat. In the future, we plan to support OwlOutputFormat and thus the notion of "Owl Managed Table" where Owl controls the data flow into and out of "Owl Tables". Owl also supports Pig integration with OwlPigLoader/Storer module.

Client API Javadoc is at owlJavaDoc.jar

Sample code is attached to write a client application against owl:

Next Step

We recognize that Hive already addressed some of the above problems, and that there is significant overlap between Owl and Hive. Yet we also believe that Owl adds important new features that are necessary for managing very large tables. We look forward to collaborating with the Hive team on finding the right model for integration between the two systems and creating a unified data management system for Hadoop.

owl (last edited 2010-04-02 05:54:07 by c-24-6-21-177)