Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Welcome to Kylin Wiki.


Preparation

In order to let readers understand the performance differences between Kylin 4 on S3 and Kylin4 On S3 with Soft Affinity and Local Cache simply and directly, I provided a performance benchmark report in a standard software and hardware environment. Because I am familiar with AWS products(EC2, S3) was chosen as my benchmark platform.

Besides, I chose TPC-H (https://github.com/Kyligence/kylin-tpch) as the benchmark standard. The scale factor used in this test is 100 ( meaning fact table has 600 million rows).


The following table shows the aspects compared between different versions in this benchmark report.

Metrics/Aspect

Description

Cubing Duration

Duration of pre-calculation(cube building) process(load source table into Kylin) .

Cube Size

Disk space occupied by cube/index.

Response Time

Serial query test lasting fifteen minutes, taking the average of the overall Response Time as the result.


The following table shows information about software and hardware used in this performance benchmark.


There are three role of EC2 node for test.

  • Distribution Node (which installed Zookeeper & Mysql service):
ItemValue
Instance Typem5.xlarge
Node Memory16 GB
Node vCPU4
Node Disk30 GB(gp2)
Node Count1
Network Brand with5 Gbps
Zookeeper Version3.4.3
Mysql Version5.7
  • Master Node(which installed Kylin 4 & Spark Master & Hive Metastore):
ItemValue
Instance Typem5.4xlarge
Node Memory64 GB
Node vCPU16
Node Disk100 GB(gp2)
Node Count1
Network Brand with5 Gbps
Kylin Version4.0.0
Spark Version3.1.1(on Hadoop 3.2)
Hive Version
2.3.9
  • Slave Node(Which only installed Spark worker):
ItemValue
Instance Typem5.4xlarge
Node vCPU16
Node Disk400GB *2(SSD)
Node Count4
Network Brand with5 Gbps
Spark Version 3.1.1(on Hadoop 3.2)

Benchmark Results

Figure-1 : Cubing duration of TPC-H (sf = 100)


Q1Q2Q3Q4Q5Q6Q7Q8Q9Q10Q11Q12Q13Q14Q15Q16Q17Q18Q19Q20Q21Q22
Cubing duration(Minutes)1.29.5543.4530.5726.7818.918.653.8561.6330.4510.7337.2524.0325.240.139.2253.074645.54497.1545

Table-1 : Cubing duration of TPC-H (sf = 100)


Figure-2 : Storage size of TPC-H (sf = 100)



Q1Q2Q3Q4Q5Q6Q7Q8Q9Q10Q11Q12Q13Q14Q15Q16Q17Q18Q19Q20Q21Q22
Cuboid Storage(GB)0.0001521.5512.740.464.030.00312.9216.8937.555.041.810.78924.050.12780.21371.96.849.115.567.8931.753.5138

Table-2 : Storage size of TPC-H (sf = 100)




Figure-3 : Avg response time of TPC-H Query (sf=100)



Q1Q2Q3Q4Q5Q6Q7Q8Q9Q10Q11Q12Q13Q14Q15Q16Q17Q18Q19Q20Q21Q22
Only S310365249.332559.892126552.22441.11531.442563.893850.332819.566749.22810.6719969.11427.789927.675965.672086.139519.63968.511847.38410536330.5
Soft + local cache610.274048.912705.825359.18285.55192.73341.551587.822433.643519.185666.55464.1822198.73216.097975.914673.61131.88692.5533.67135.130987.94167.8

Table-3: Avg response time of TPC-H Query (sf=100)


Conclusions

Query performance.

In big query scenarios(query which scans and does onsite complex calculations on a large mount of partitions/files) which use TPCH-100, response time of Kylin 4 on S3 with Soft Affinity and Local Cache has significant less than kylin 4 on S3 only.

Thanks to Soft Affinity and Local Cache, Kylin 4 query performance improvements can be achieved in basically most queries.

It is observed that the results (Q4, Q13) of turning on the Soft Affinity and Local Cache are lower than when using S3 alone as storage. This may be due to some reason that the data was not read through the cache. The underlying reason was not carried out in this test. Further analysis, we will gradually improve in the subsequent optimization process.

On the conclusion, Soft Affinity and Local Cache can achieve significant performance improvements for both simple and complex queries.

  • No labels