Apache Zebra Wiki
Zebra is a storage layer that provides a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code. It provites
- columnar storage format for fast data projection
- schema language to manage physical storage metadata
- CPU/space-efficient data serialization
In the future, it could also support predicate pushdown for further performance improvement. Initially, Zebra is released as a contrib project in Pig and can become a hadoop subproject later on.
Zebra requires Hadoop 20 (as of July 24th, 2009 with Hadoop patch 6150) that supports TFile and works with Pig 0.3.0 with patch PIG-660. This patch makes PIG work with Hadoop 20. Zebra has been submitted as PIG-833.
Zebra has been committed as a Pig contrib project at:
- JDK 1.6
- Ant 1.7.1
- Javacc 4.2
How to compile:
- check out latest PIG trunk
- apply the latest patch from PIG-660
copy hadoop20.jar attached to PIG-833 to Pig's top level ./lib
- run 'ant jar' (generate Pig binary compatible with Hadoop 20)
- run 'ant -Dtestcase=none test-core' (for zebra tests)
- cd contrib/zebra
- ant jar
- ant test (for tests)
Zebra jar will be generated at build/contrib/zebra directory
Sample Mapreduced code, Pig scripts attached to this wiki.
Java doc is available at Zebra JavaDoc