Author: Yiming Xu, Mingming Ge

1. Background: Why Kylin on Parquet

Currently, Kylin uses Apache HBase as the storage for OLAP cubes.

HBase is very fast, while it also has some drawbacks:

This proposal is to use Apache Parquet + Spark to replace HBase:

2. Parquet file layouts on HDFS

3. Dimension/measure layouts in Parquet


Parquet file schema:
    1:           OPTIONAL INT64 R:0 D:1
    2:           REQUIRED DOUBLE R:0 D:0
    3:           OPTIONAL INT64 R:0 D:1
    110000:      OPTIONAL INT64 R:0 D:1
    110001:      OPTIONAL INT64 R:0 D:1

4. Data types mapping in Parquet

TypeSparkParquet
Numeric typesByteTypeINT32
Numeric typesShortTypeINT32
Numeric typesIntegerTypeINT32
Numeric typesLongTypeINT64
Numeric typesFloatTypeFLOAT
Numeric typesDoubleTypeDOUBLE
Numeric typesDecimalTypeINT32,INT64,BinaryType,FIXED_LEN_BYTE_ARRAY
String typeStringTypeBYTE_ARRAY
Binary typeBinaryTypeBYTE_ARRAY
Boolean typeBooleanTypeBOOLEAN
Datetime typeTimestampTypeINT96
Datetime typeDateTypeINT32

5. How to build Cube into Parquet

(ShaofengShi: this part need detailed info)

6. How to query with Parquet

  

  

7. Performance

      Kyligence provides dataset tool for SSB and TPC-H which contains test SQL case, the repositories are as follows:


                 

             

8. Next step