Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Welcome to Kylin Wiki.

Preparation

In order to let readers understand the performance differences between Kylin 3 and Kylin 4 simply and directly, I provided a performance benchmark report in a standard software and hardware environment. Because I am familiar with AWS products, AWS EMR was chosen as my benchmark platform.

Besides, I chose TPC-H (https://github.com/Kyligence/kylin-tpch) and SSB (https://github.com/Kyligence/ssb-kylin) as the benchmark standard. The scale factor used in this test is 10 ( meaning fact table has 60 million rows).


The following table shows the aspects compared between different versions in this benchmark report.

Metrics/Aspect

Description

Cubing Duration

Duration of pre-calculation(cube building) process(load source table into Kylin) .

Cube Size

Disk space occupied by cube/index.

Response Time

Serial query test lasting fifteen minutes, taking the 95th percentile of the overall Response Time as the result.


The following table shows information about software and hardware used in this performance benchmark.

Item

Value

Instance Type

m5.4xlarge

Node Memory

64 GB

Node vCPU

16

Node Disk

400 * 2; SSD

Network Brand with

Up to 10 Gbps

Node Count

A master node and four worker nodes

Allocated Memory on Yarn

202 GB

Allocated Cores on Yarn

52

Kylin Version

3.1.2 & 4.0.0

EMR Version

5.31

Hadoop Version

2.10.0

HBase Version

1.4.13


Benchmark Results


Figure-1 : Cubing duration of TPC-H (sf = 10)


Figure-2 : Storage size of TPC-H (sf = 10)


Figure-3 : Avg response time of SSB Query (sf=10)


Figure-4 : Avg response time of TPC-H Query (sf=10)


Conclusions

Cubing duration and cube size.

Compared with Kylin 3's MR cube engine, thanks to higher resource utilization and no more steps of converting cuboid to specific data format(HFile), Kylin 4 greatly reduces the cubing duration by 62.6%.
In Kylin 3, the cuboid files are stored in two different formats, instead Kylin 4 uses Parquet. We know Parquet has better encode efficiency and higher compression ratio, so the disk space of same cube reduced greatly by 72.56%.

Kylin 4(New Spark Engine) has a higher and stable resource utilization

Figure-5 : Kylin 3(MR engine) has lower resource utilization


Kylin 3(MR engine) has lower resource utilization

Figure-6 : Kylin 4(New Spark Engine) has a higher and stable resource utilization

Query performance.

In big query scenarios(query which scans and does onsite complex calculations on a large mount of partitions/files), Kylin 3 query optimization is difficult, and needs to optimize HBase RS Server and Kylin Query Server repeatedly. In stress test scenarios, query node is unstable because it need do post-calculation on large data set, and performance(query latency) is getting worse as time goes by. Kylin 4 removes the single bottleneck of Query Server, and both Response Time and QPS are obviously improved and performance is stable during the stress test. In TPC-H query set, response time of Kylin 4 is improved by 5-7 times, and its concurrency is also improved by 4 times.

P95 response time of TPC-H Query under different concurrency

Figure-7 : P95 response time of TPC-H Query under different concurrency

In the point query scenario (query which scans small mount of partitions/files and do not need too much onsite calculations) , Kylin 4 can meet the sub-second query latency requirement after some simple parameters adjustment, and its performance is relatively close to Kylin 3 (to be specific, only worse sightly) .

Cost of learning and difficulty of performance optimization(parameter adjustment).

Compared with Kylin 4, Kylin 3 has many building steps, and different steps depends on different components, such as Hive, MapReduce and HBase. It is necessary to learn and understand many architectures and technical details, and be familiar with many parameters related to these components, so it is depressing for new user when they know they have to learn so many things.
Instead, the cubing and query of Kylin 4 are uniformly switched to the popular Spark engine, and new users only need to master Spark to learn and adjust parameters. These learning materials for Spark can be easily found, and the commonly used parameters are far less than Kylin 3.

  • No labels