View Source

Background

Build Engine

Maybe you have known that, pre-calculated cuboid data is stored in parquet files in Kylin 4. Before saving into HDFS/Object Storage, Kylin will do repartition on pre-calculated cuboid data. This article will introduce how this pre-calculated cuboid data is repartitioned, and how this affect query performance.

When you did specific a shard by column, Kylin will do repartition(Transformation of Spark Dataframe) by shard by column, and partition number is calclulated by following configuration (kylin.storage.columnar.shard-size-mb, kylin.storage.columnar.shard-rowcount). If you do not specific a shard by column, repartition is done only with repartition number.

Query Engine

For example, there's a column which have high cardinality called seller_id and our scenario will filter contains this column. There're some sample SQLs:

select count(*) from kylin_sales left join kylin_order where SELLER_ID = '10000233'
select count(*) from kylin_sales left join kylin_order where SELLER_ID in (10000233,10000234,10000235)
select count(*) from kylin_sales left join kylin_order where SELLER_ID is NULL
select count(*) from kylin_sales left join kylin_order where SELLER_ID in (10000233,10000234,10000235) and SELLER_ID = 10000233 
select count(*) from kylin_sales left join kylin_order where SELLER_ID = 10000233 or SELLER_ID = 1

When filter in SQL query contains those operators which related to shard by column:

Equality
In
IsNull

Query Engine can purge all parquet files which did not contains specific values. So we suggest you to set the column as shard by column which both has high cardinality and used in where clause.

How to use

Model design

Edit model and add dimension seller_id. Remember that the type of dimension should be normal but not derived.

Apache Kylin > Improve query performance by setting shard by column > image2020-8-13_11-13-43.png

Cube Design

From Cube Designer → Advanced Setting → Rowkeys, find the column seller_id and set the shard by to true. Remember that now only support one shard by column, so there should be at most one shard by column which set to true.

Apache Kylin > Improve query performance by setting shard by column > image2020-8-13_11-15-19.png

Advanced configuration

For each cuboid, if you want to specific the count of all parquet files(or the size fof them). THe following are supported in cube level.

Key	Default value	Descripation
kylin.storage.columnar.shard-size-mb	128	The max size of each parquet file, in MB.
kylin.storage.columnar.shard-rowcount	2500000	Each parquet files should contain at most 2.5 million rows.
kylin.storage.columnar.shard-countdistinct-rowcount	1000000	Since that fact that Bitmap has bigger size, so we can specific the max row count for cuboid with contain Bitmap. By default it contains at most 1.0 million row.

Build Segment

Considering the size of file size, total size is 147.7 M, so file count should be 147.7 M/15M = 11.

Considering the row count of file size, total row count is 300000, so file count should be 300000/10000 = 3.

So, the final file count is (11 + 3)/2 = 7.

2020-09-02 14:25:22,320 INFO  [Scheduler 2044728756 Job f73e2bca-0c0b-4a48-b8e7-a2eacb116f52-245] job.NSparkExecutable:41 : 2020-09-02 14:25:22,320 INFO  [thread-build-cuboid-2047] job.CubeBuildJob:344 : Cuboids are saved to temp path : hdfs://cdh-master:8020
_temp
2020-09-02 14:25:48,504 INFO  [BadQueryDetector] service.BadQueryDetector:148 : Detect bad query.
2020-09-02 14:26:48,504 INFO  [BadQueryDetector] service.BadQueryDetector:148 : Detect bad query.
2020-09-02 14:26:48,767 INFO  [Scheduler 2044728756 Job f73e2bca-0c0b-4a48-b8e7-a2eacb116f52-245] job.NSparkExecutable:41 : 2020-09-02 14:26:48,767 DEBUG [thread-build-cuboid-2047] util.HadoopUtil:302 : Use provider:org.apache.kylin.common.storage.DefaultStorageProvider
2020-09-02 14:26:48,777 INFO  [Scheduler 2044728756 Job f73e2bca-0c0b-4a48-b8e7-a2eacb116f52-245] job.NSparkExecutable:41 : 2020-09-02 14:26:48,777 INFO  [thread-build-cuboid-2047] utils.Repartitioner:121 : File length repartition num : 11, Row count Rpartition num: 3
2020-09-02 14:26:48,777 INFO  [Scheduler 2044728756 Job f73e2bca-0c0b-4a48-b8e7-a2eacb116f52-245] job.NSparkExecutable:41 : 2020-09-02 14:26:48,777 INFO  [thread-build-cuboid-2047] utils.Repartitioner:124 : Repartition size is :7

[root@cdh-worker-2 20200902]# hadoop fs -du -h /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047
21.9 M  21.9 M  /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00000-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet
20.9 M  20.9 M  /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00001-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet
21.5 M  21.5 M  /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00002-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet
21.7 M  21.7 M  /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00003-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet
21.4 M  21.4 M  /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00004-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet
20.0 M  20.0 M  /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00005-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet
20.2 M  20.2 M  /LACUS/LACUS/UserActionPrj/parquet/xx/20200104000000_20200105000000_QVV/2047/part-00006-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet

Query Log

You can check "Total scan files" in kylin.log .

2020-09-02 14:49:12,366 INFO  [Query 414c15ae-ac04-2ac1-918e-3fe8c97a86bd-99] service.QueryService:387 :
==========================[QUERY]===============================
Query Id: 414c15ae-ac04-2ac1-918e-3fe8c97a86bd
SQL: xxx
...
Total scan count: 1409511
Total scan files: 4
Total metadata time: 0ms
Total spark scan time: 386ms
Total scan bytes: 67481914
Result row count: 500
...
==========================[QUERY]===============================