Pig OLAP Support

Background

Online analytical data processing is a powerful method for business data analysis. However, the preprocessing of full OLAP cubes involves multiplications and aggregations of large volume of data, thus being computationally expensive in nature. By allowing the process to be executed on grid data processing framework such as MapReduce, this data analysis technique can be more accessible for most business users.

Goals

The projection operator will be provided just for experimenting with data at this stage. Pig currently doesn't have indexing infrastructure support so it can be too slow in practice. My primary aim is to use Pig for cube processing then export to external system/frontend for browsing.

Implementation

Cube generation algorithm will be based on paper from Microsoft Research http://paul.rutgers.edu/~aminabdu/cs541/cube_op.pdf but converted to MapReduce version. We only work on flattened star, snowflake schema as generally do.

Map Phase

Map:( k1 , v1 ) -> list( k2 , v2 )

Map Input: ((x1,x2,x3, ..., xn), (measure1, measure2))

Map Output: 

Cn,1
((MAGICKEY,x2,x3, ..., xn), (measure1, measure2)1)
((x1,MAGICKEY,x3, ..., xn), (measure1, measure2)1)
...
((x1, x2, x3, ..., MAGICKEY), (measure1, measure2)1)


Cn,2
((MAGICKEY,MAGICKEY,x3, ..., xn-1, xn), (measure1, measure2)1)
((MAGICKEY, x2, MAGICKEY, ..., xn-1, xn), (measure1, measure2)1)
...
((x1, x2, x3, ...,MAGICKEY, MAGICKEY), (measure1, measure2)1)

...

Cn,n-1

There will be 2^n-1 output records from an input

where n = number of dimensions

Reduce Phase

Reduce: ( k2 , list ( v2 )) -> list( v2 )

Basically just apply aggregate function(s) on all the measures. The logical view of output (and obviously also a point in hypercube) will look like:-

aggregate1

aggregate2

aggregate3

measure1

output1

output2

output3

measure2

output4

output5

output6

measure3

output7

output8

output9

Language Design

We will allow the preprocessing of full OLAP cube. This will result in a new type called "Cube"

Analytical cube creation syntax

Name = CUBE Name
       Dimension (ColumnName (,ColumnName)* )
       MEASURE (ColumnName (,ColumnName)* )
       AGGREGATE (FuncName (,FuncName)* )
       [ MAGICKEY (Number) ]?

Neglecting "MAGICKEY" clause will result in automatically picked MAGICKEY (will be hardcoded)

Sample:-

Y = CUBE X
    DIMENSION (D1, D2, D3)
    MEASURE   (M1,M2)
    AGGREGATE (SUM, AVG, UDF1)
    MAGICKEY (2.718281828459045) ;

Cube Browsing Syntax

Only 1D and 2D can be displayed and understood easily by normal users so we are only interested in 1D projection, and 2D pivoting of hypercubes.

Normal Projection

We model 1D cube browsing as a process of traversing a dimension/dimension-member tree

For example,

D1 = {Sydney, Melbourne, Brisbane}

D2 = { Pi, Jon, Lap }

D3 = { Jan, Feb }

             Root
           /   |  \
         D1   D2   D3       *
       /  | \
     Syd Mel Bris
     / \
    D2  D3                  *
   / | \ 
 Pi Jon Lap
     |
     D1                     *
    /  \
   Jan Feb

Note: Only levels with "*" can be viewed

/D1 would give:-

Syd

1000

Mel

1200

Bris

1500

/D1/Syd/D2 would give:- (Under D1=Syd)

Pi

100

Jon

200

Lap

300

So, the syntax might look like:-

Z = CUBEVIEW X AT /D1/Syd/D2 ;

where X is a cube

This command will result in a bag containing the aggregated data

Pivoting

Basically we just want to be able to define two dimensions for pivoting. Dimensions appeared up in the dimension tree are not available in the lower level.

Z = CUBEVIEW X AT /D1/Syd/(D2,D3) ;

Output will look like this (in Bag):-

(Under D1=Syd)

Jan

Feb

Pi

50

50

Jon

150

50

Lap

100

200

Limitations

Initially, there will be no support for:-