4449
Comment:

← Revision 4 as of 20090920 23:38:11 ⇥
4449
converted to 1.6 markup

No differences found! 
Pig OLAP Support
Background
Online analytical data processing is a powerful method for business data analysis. However, the preprocessing of full OLAP cubes involves multiplications and aggregations of large volume of data, thus being computationally expensive in nature. By allowing the process to be executed on grid data processing framework such as MapReduce, this data analysis technique can be more accessible for most business users.
Goals
 We will allow population of full OLAP cubes
 We will allow tabular projection operations (1D  as Bag, 2D  as crosstab Bag)
The projection operator will be provided just for experimenting with data at this stage. Pig currently doesn't have indexing infrastructure support so it can be too slow in practice. My primary aim is to use Pig for cube processing then export to external system/frontend for browsing.
Implementation
Cube generation algorithm will be based on paper from Microsoft Research http://paul.rutgers.edu/~aminabdu/cs541/cube_op.pdf but converted to MapReduce version. We only work on flattened star, snowflake schema as generally do.
Map Phase
Map:( k1 , v1 ) > list( k2 , v2 )
Map Input: ((x1,x2,x3, ..., xn), (measure1, measure2)) Map Output: Cn,1 ((MAGICKEY,x2,x3, ..., xn), (measure1, measure2)1) ((x1,MAGICKEY,x3, ..., xn), (measure1, measure2)1) ... ((x1, x2, x3, ..., MAGICKEY), (measure1, measure2)1) Cn,2 ((MAGICKEY,MAGICKEY,x3, ..., xn1, xn), (measure1, measure2)1) ((MAGICKEY, x2, MAGICKEY, ..., xn1, xn), (measure1, measure2)1) ... ((x1, x2, x3, ...,MAGICKEY, MAGICKEY), (measure1, measure2)1) ... Cn,n1
There will be 2^n1 output records from an input
where n = number of dimensions
Reduce Phase
Reduce: ( k2 , list ( v2 )) > list( v2 )
Basically just apply aggregate function(s) on all the measures. The logical view of output (and obviously also a point in hypercube) will look like:

aggregate1 
aggregate2 
aggregate3 
measure1 
output1 
output2 
output3 
measure2 
output4 
output5 
output6 
measure3 
output7 
output8 
output9 
Language Design
We will allow the preprocessing of full OLAP cube. This will result in a new type called "Cube"
Analytical cube creation syntax
Name = CUBE Name Dimension (ColumnName (,ColumnName)* ) MEASURE (ColumnName (,ColumnName)* ) AGGREGATE (FuncName (,FuncName)* ) [ MAGICKEY (Number) ]?
Neglecting "MAGICKEY" clause will result in automatically picked MAGICKEY (will be hardcoded)
Sample:
Y = CUBE X DIMENSION (D1, D2, D3) MEASURE (M1,M2) AGGREGATE (SUM, AVG, UDF1) MAGICKEY (2.718281828459045) ;
Cube Browsing Syntax
Only 1D and 2D can be displayed and understood easily by normal users so we are only interested in 1D projection, and 2D pivoting of hypercubes.
Normal Projection
We model 1D cube browsing as a process of traversing a dimension/dimensionmember tree
For example,
D1 = {Sydney, Melbourne, Brisbane}
D2 = { Pi, Jon, Lap }
D3 = { Jan, Feb }
Root /  \ D1 D2 D3 * /  \ Syd Mel Bris / \ D2 D3 * /  \ Pi Jon Lap  D1 * / \ Jan Feb
Note: Only levels with "*" can be viewed
/D1 would give:
Syd 
1000 
Mel 
1200 
Bris 
1500 
/D1/Syd/D2 would give: (Under D1=Syd)
Pi 
100 
Jon 
200 
Lap 
300 
So, the syntax might look like:
Z = CUBEVIEW X AT /D1/Syd/D2 ;
where X is a cube
This command will result in a bag containing the aggregated data
Pivoting
Basically we just want to be able to define two dimensions for pivoting. Dimensions appeared up in the dimension tree are not available in the lower level.
Z = CUBEVIEW X AT /D1/Syd/(D2,D3) ;
Output will look like this (in Bag):
(Under D1=Syd)

Jan 
Feb 
Pi 
50 
50 
Jon 
150 
50 
Lap 
100 
200 
Limitations
Initially, there will be no support for:
 Nondiscrete dimension. This will require histogram generation support in Pig.
 Hierarchical dimension or hierarchical dimension that forms a lattice.
 No indexing support on cube browsing