Maybe you have known that, pre-calculated cuboid data is stored in parquet files in Kylin 4. Before saving into HDFS/Object Storage, Kylin will do repartition on pre-calculated cuboid data. This article will introduce how this pre-calculated cuboid data is repartitioned, and how this affect query performance.
When you did specific a shard by column, Kylin will do repartition(Transformation of Spark Dataframe) by shard by column, and partition number is calclulated by following configuration (kylin.storage.columnar.shard-size-mb, kylin.storage.columnar.shard-rowcount). If you do not specific a shard by column, repartition is done only with repartition number.
For example, there's a column which have high cardinality called seller_id and our scenario will filter contains this column. There're some sample SQLs:
select count(*) from kylin_sales left join kylin_order where SELLER_ID = '10000233' select count(*) from kylin_sales left join kylin_order where SELLER_ID in (10000233,10000234,10000235) select count(*) from kylin_sales left join kylin_order where SELLER_ID is NULL select count(*) from kylin_sales left join kylin_order where SELLER_ID in (10000233,10000234,10000235) and SELLER_ID = 10000233 select count(*) from kylin_sales left join kylin_order where SELLER_ID = 10000233 or SELLER_ID = 1 |
When filter in SQL query contains those operators which related to shard by column:
Query Engine can purge all parquet files which did not contains specific values. So we suggest you to set the column as shard by column which both has high cardinality and used in where clause.
Edit model and add dimension seller_id. Remember that the type of dimension should be normal but not derived.
From Cube Designer → Advanced Setting → Rowkeys, find the column seller_id and set the shard by to true. Remember that now only support one shard by column, so there should be at most one shard by column which set to true.
For each cuboid, if you want to specific the count of all parquet files(or the size fof them). THe following are supported in cube level.
Key | Default value | Descripation |
---|---|---|
kylin.storage.columnar.shard-size-mb | 128 | The max size of each parquet file, in MB. |
kylin.storage.columnar.shard-rowcount | 2500000 | Each parquet files should contain at most 2.5 million rows. |
kylin.storage.columnar.shard-countdistinct-rowcount | 1000000 | Since that fact that Bitmap has bigger size, so we can specific the max row count for cuboid with contain Bitmap. By default it contains at most 1.0 million row. |
Considering the size of file size, total size is 147.7 M, so file count should be 147.7 M/15M = 11.
Considering the row count of file size, total row count is 300000, so file count should be 300000/10000 = 3.
So, the final file count is (11 + 3)/2 = 7.
2020-09-02 14:25:22,320 INFO [Scheduler 2044728756 Job f73e2bca-0c0b-4a48-b8e7-a2eacb116f52-245] job.NSparkExecutable:41 : 2020-09-02 14:25:22,320 INFO [thread-build-cuboid-2047] job.CubeBuildJob:344 : Cuboids are saved to temp path : hdfs://cdh-master:8020 _temp 2020-09-02 14:25:48,504 INFO [BadQueryDetector] service.BadQueryDetector:148 : Detect bad query. 2020-09-02 14:26:48,504 INFO [BadQueryDetector] service.BadQueryDetector:148 : Detect bad query. 2020-09-02 14:26:48,767 INFO [Scheduler 2044728756 Job f73e2bca-0c0b-4a48-b8e7-a2eacb116f52-245] job.NSparkExecutable:41 : 2020-09-02 14:26:48,767 DEBUG [thread-build-cuboid-2047] util.HadoopUtil:302 : Use provider:org.apache.kylin.common.storage.DefaultStorageProvider 2020-09-02 14:26:48,777 INFO [Scheduler 2044728756 Job f73e2bca-0c0b-4a48-b8e7-a2eacb116f52-245] job.NSparkExecutable:41 : 2020-09-02 14:26:48,777 INFO [thread-build-cuboid-2047] utils.Repartitioner:121 : File length repartition num : 11, Row count Rpartition num: 3 2020-09-02 14:26:48,777 INFO [Scheduler 2044728756 Job f73e2bca-0c0b-4a48-b8e7-a2eacb116f52-245] job.NSparkExecutable:41 : 2020-09-02 14:26:48,777 INFO [thread-build-cuboid-2047] utils.Repartitioner:124 : Repartition size is :7 |
[root@cdh-worker-2 20200902]# hadoop fs -du -h /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047 21.9 M 21.9 M /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00000-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet 20.9 M 20.9 M /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00001-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet 21.5 M 21.5 M /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00002-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet 21.7 M 21.7 M /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00003-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet 21.4 M 21.4 M /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00004-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet 20.0 M 20.0 M /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00005-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet 20.2 M 20.2 M /LACUS/LACUS/UserActionPrj/parquet/xx/20200104000000_20200105000000_QVV/2047/part-00006-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet |
You can check "Total scan files" in kylin.log .
2020-09-02 14:49:12,366 INFO [Query 414c15ae-ac04-2ac1-918e-3fe8c97a86bd-99] service.QueryService:387 : ==========================[QUERY]=============================== Query Id: 414c15ae-ac04-2ac1-918e-3fe8c97a86bd SQL: xxx ... Total scan count: 1409511 Total scan files: 4 Total metadata time: 0ms Total spark scan time: 386ms Total scan bytes: 67481914 Result row count: 500 ... ==========================[QUERY]=============================== |