Developer Guide

1. Code Organization and a brief architecture

1.1. Introduction

Hive comprises of 3 main components:

Apart from these major components, Hive also contains a number of other components. These are as follows:

The following top level directories contain helper libraries, packaged configuration files etc..:

1.2. SerDe

What is SerDe

Note that the "key" part is ignored when reading, and is always a constant when writing. Basically the row object is only stored into the "value".

One principle of Hive is that Hive does not own the HDFS file format - Users should be able to directly read the HDFS files in the Hive tables using other tools, or use other tools to directly write to HDFS files that can be read by Hive through "CREATE EXTERNAL TABLE", or can be loaded into Hive through "LOAD DATA INPATH" which just move the file into Hive table directory.

Note that org.apache.hadoop.hive.serde is the deprecated old serde library. Please look at org.apache.hadoop.hive.serde2 for the latest version.

Hive currently use these FileFormat classes to read and write HDFS files:

Hive currently use these SerDe classes to serialize and deserialize data:

How to write your own SerDe:

Some important points of SerDe:

1.2.1. ObjectInspector

Hive uses ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns.

ObjectInspector provides a uniform way to access complex objects that can be stored in multiple formats in the memory, including:

A complex object can be represented by a pair of ObjectInspector and Java Object. The ObjectInspector not only tells us the structure of the Object, but also gives us ways to access the internal fields inside the Object.

1.3. MetaStore

MetaStore contains metadata regarding tables, partitions and databases. This is used by Query Processor during plan generation.

1.4. Query Processor

The following are the main components of the Hive Query Processor:

1.4.1. Compiler

1.4.2. Parser

1.4.3. TypeChecking

1.4.4. Semantic Analysis

1.4.5. Plan generation

1.4.6. Task generation

1.4.7. Execution Engine

1.4.8. Plan

1.4.9. Operators

1.4.10. UDFs and UDAFs

2. Compiling Hive

Hive can be made to compile against different versions of Hadoop.

2.1. Default Mode

From the root of the source tree:

ant package

will make Hive compile against hadoop version 0.19.0. Note that:

2.2. Advanced Mode

ant -Dtarget.dir=<my-install-dir> package

ant -Dhadoop.version=0.17.1 package

ant -Dhadoop.root=~/src/hadoop-19/build/hadoop-0.19.2-dev -Dhadoop.version=0.19.2-dev

note that:

In this particular example - ~/src/hadoop-19 is a checkout of the hadoop 19 branch that uses 0.19.2-dev as default version and creates a distribution directory in build/hadoop-0.19.2-dev by default.

3. Unit tests and debugging

3.1. Layout of the unit tests

Hive uses junit for unit tests. Each of the 3 main components of Hive have their unit test implementations in the corresponding src/test directory e.g. trunk/metastore/src/test has all the unit tests for metastore, trunk/serde/src/test has all the unit tests for serde and trunk/ql/src/test has all the unit tests for the query processor. The metastore and serde unit tests provide the TestCase implementations for junit. The query processor tests on the other hand are generated using Velocity. The main directories under trunk/ql/src/test that contain these tests and the corresponding results are as follows:

3.2. Tables in the unit tests

3.3. Running unit tests

Run all tests:

ant test

Run all positive test queries:

ant test -Dtestcase=TestCliDriver

Run a specific positive test query:

ant test -Dtestcase=TestCliDriver -Dqfile=groupby1.q

The about test produces the following files:

3.4. Adding new unit tests

First, write a new myname.q in ql/src/test/queries/clientpositive

Then, run the test with the query and overwrite the result (useful when you add a new test)

ant test -Dtestcase=TestCliDriver -Dqfile=myname.q -Doverwrite=true

Then we can create a patch by:

svn add ql/src/test/queries/clientpositive/myname.q ql/src/test/results/clientpositive/myname.q.out
svn diff > patch.txt

Debugging Hive

3.5. Debugging Hive code

Hive code includes both client-side code (e.g., compiler, semantic analyzer, and optimizer of HiveQL) and server-side code (e.g., operator/task/SerDe implementations). The client-side code are running on your local machine so you can easily debug it using Eclipse the same way as you debug a regular local Java code. The server-side code is distributed and running on the Hadoop cluster, so debugging server-side Hive code is a little bit complicated. In addition to printing to log files using log4j, you can also attach the debugger to a different JVM under unit test (single machine mode). Below are the steps on how to debug on server-side code.

4. Pluggable interfaces

4.1. File Formats

Please refer to Hive User Group Meeting August 2009 Page 59-63.

4.2. SerDe - how to add a new SerDe

Please refer to Hive User Group Meeting August 2009 Page 64-70.

4.3. Map-Reduce Scripts

Please refer to Hive User Group Meeting August 2009 Page 71-73.

4.4. UDFs and UDAFs - how to add new UDFs and UDAFs

Please refer to Hive User Group Meeting August 2009 Page 74-87.

Hive/DeveloperGuide (last edited 2009-11-06 19:00:41 by Ning Zhang)