Apache Owl Wiki
Vision
Owl provides a more natural abstraction for Map-Reduce and Map-Reduce-based technologies (e.g., Pig, SQL) by allowing developers to express large datasets as tables, which in turn consist of rows and columns. Owl tables are similar, but not identical to familiar database / data warehouse tables.
The core M/R programming interface as we know it (the mapper, reducer, output collector, record reader and input format ) all deal with collection of abstract data objects, not files. However, the current set of InputFormat implementations provided by job API are relatively primitive and are heavily coupled to file formats and HDFS paths to describe input and output locations. From an application programmer’s perspective, one has to think about both the abstract data and the physical representation and storage location, which is a disconnect from the abstract data API. In the meantime, the number of file formats and (de)serialization libraries have flourished in the Hadoop community. Some of these require certain metadata to operate/optimize. While providing optimization and performance enhancements, these file formats and SerDe libs don’t make it any easier to develop applications on and manage very big data sets.
High Level Diagram
As one can see, Owl gives Hadoop users a uniform interface for organizing, discovering and managing data stored in many different formats, and to promote interoperability among different programming frameworks. Owl presents a single logical view of data organization and hides the complexity and evolutions in underlying physical data layout schemes. It gives Hadoop applications a stable foundation to build upon.
Main Properties and Features
Feature |
Status |
Owl is a stand-alone table store, not tied to any particular data query or processing languages, supporting MR, Pig Latin, and Pig SQL |
current |
Owl has a flexible data partitioning model, with multiple levels of partitioning, physical and logical partitioning, and partition pruning for query optimization |
current |
Owl has a flexible interface for pushing projections and filters all the way down |
current |
Owl has a framework for storing data in many storage formats, and different storage formats can co-exist within the same table |
current |
Owl provides capability discovery mechanism to allow applications to take advantage of unique features of storage drivers |
current |
Owl supports both managed tables (completely managed by Owl) and unmanaged tables (called "external tables" in many databases) |
currently support external tables |
Owl manages a unified schema for each table (by unifying the schemas of its partitions) |
current |
Owl has support for storing custom metadata associated with tables and partitions |
current |
Owl has support for automatic data retention management |
future |
Owl has support for notifications on data change (new data added, data restated, etc. ) |
future |
Owl has support for converting data between write-friendly and read-friendly formats |
future |
Owl has support for addressing HDFS NameNode limitations by decreasing the number of files needed to store very large data sets |
future |
Owl provides a security model for secure data access |
future |
Prerequisite
Owl depends on Pig for its tuple classes as its basic unit of data container, and Hadoop 20 for OwlInputFormat. Its first release will require Pig 0.7 or later and Hadoop 20.2 or late. Owl integrates with Zebra 0.7 out-of-the-box.
Getting Owl
Initially, we like to open source Owl as a Pig contrib project. In the long term, Owl could become a separate Hadoop subproject as it provides a platform service all Hadoop applications.
Owl would live as a Pig contrib project at:
Compilation prerequisite:
Download the following binary
- JDK 1.6
- Ant 1.7.1
- JDBC driver
- or Oracle 11g JDBC driver
How to compile
- check out latest PIG trunk
- compile Pig -- ant jar
- cd contrib/owl
- copy MySQL(or oracle) JDBC driver to contrib/owl/java/lib directory
- buid owl driver jar file -- ant jar -Dpig.root=../..
- build owl web application -- ant war -Dpig.root=../..
- run owl unit test using jetty and derby without any setup steps -- ant test -Dpig.root=../..
Deploying Owl
For development environment, Owl supports jetty 7.0 (with jetty-runner) and derby 10.5. For production deployment, Owl supports:
- Tomcat 6.0
- MySQL 5.1 or Oracle 11g
After installing Tomcat and MySQL, you will need these files:
owl-<0.x.x>.war - owl web application at contrib/owl/build
owl-<0.x.x>.jar - owl client library OwlInputFormat and OwlDriver with all their dependent 3rd party libraries at contrib/owl/build
- mysql
- mysql_schema.sql - owl database schema file at contrib/owl/setup/mysql
- owlServerConfig.xml - owl server configuration file at contrib/owl/setup/mysql
- oracle
- oracle_schema.sql - owl database schema file at contrib/owl/setup/oracle
- owlServerConfig.xml - owl server configuration file at contrib/owl/setup/oracle
Create db schema in MySql:
- create a database "owl" in mysql
- create db schema with "mysql_schema.sql"
- make sure the user specified in jdbc connection string has full access to all objects in the newly created "owl" db
Set up parameters in owlServerConfig:
- update jdbc driver connection information in owlServerConfig.xml
- put this file on the same box where tomcat is installed
Deploy Owl to Tomcat:
- deploy owl war file to Tomcat
set up -Dorg.apache.hadoop.owl.xmlconfig=<full path to owlServerConfig.xml> for the Tomcat deployment
Developing on Owl
Owl has two major public APIs. Owl Driver provides management APIs against three core Owl abstractions: "Owl Table", "Owl Database", and "Partition". This API is backed up by an internal Owl metadata store that runs on Tomcat and a relational database. OwlInputFormat provides a data access API and is modeled after the traditional Hadoop InputFormat. In the future, we plan to support OwlOutputFormat and thus the notion of "Owl Managed Table" where Owl controls the data flow into and out of "Owl Tables". Owl also supports Pig integration with OwlPigLoader/Storer module.
Client API Javadoc is at owlJavaDoc.jar
- Owl driver API - org.apache.hadoop.owl.client
OwlInputFormat API - org.apache.hadoop.owl.mapreduce
Sample code is attached to write a client application against owl:
Sample code using OwlDriver API: TestOwlDriverSample.java
Next Step
We recognize that Hive already addressed some of the above problems, and that there is significant overlap between Owl and Hive. Yet we also believe that Owl adds important new features that are necessary for managing very large tables. We look forward to collaborating with the Hive team on finding the right model for integration between the two systems and creating a unified data management system for Hadoop.