Benchmark Report for Parquet Storage

Kylin on HBase has been relatively mature after a long period of development, but it also has limitations. Therefore, Kyligence launched the solution of Kylin on Parquet. Through the standard data set test, compared with Kylin 3.0 which still uses HBase, the performance of Kylin on Parquet's build engine has been greatly improved, and it has better performance for complex queries.

This article mainly uses the standard SSB data set and TPC-H data set to obtain the performance data about the build engines and query engines of Kylin on Parquet and Kylin 3.0 respectively, and then conducts comparative analysis to allow users to understand the advantages and disadvantages of Kylin on Parquet compared to Kylin 3.0.

SSB (Star Schema Benchmark) is a set of benchmark test specifications used to test the performance of database products in star mode, and is also a data set often used in the OLAP field.
TPC (Transaction Processing Performance Council) has a variety of benchmark test systems, and here we use the TPC-H data set. The main purpose of using TPC-H is to test the response time of complex queries of the database system, in order to evaluate the decision support ability of specific queries.

Kyligence has developed SSB and TPC-H data set tools for Kylin, and it includes standard SQL. The source code repository address is as follows:

https://github.com/Kyligence/ssb-kylin

https://github.com/Kyligence/kylin-tpch

Environment

Hadoop cluster with 4 physical nodes
Yarn queue has 400G memory and 128 CPU cores

Kylin 3.0 uses the MapReduce engine. Kylin on Parquet currently only supports the internal customized version of the Spark engine. Compared with the community version, the customized version is mainly optimized for performance, and other aspects are not different from the community version of Spark.

Spark source code repository

https://github.com/Kyligence/spark/tree/2.4.1-kylin-r3

Spark binary package download

https://download-resource.s3.cn-north-1.amazonaws.com.cn/osspark/spark-2.4.1-os-kylin-r3

Performance of Build Engine

Over SSB

The following two figures show the comparison between the construction time and the storage space occupied after the construction. We can see that under the SSB 60 million and 90 million data volumes, the new build engine has doubled the speed of construction, and eventually the storage space occupied has been reduced by nearly double.

It is worth mentioning that the final data constructed by Kylin on Parquet stores only the data on HDFS. Since the Kylin on HBase cuboid file construction is completed, the files on HDFS need to be converted to HFile, and for the preparation of merge, the data on HDFS is not by default. It will be cleared, so the actual storage will be double the space; after using Parquet, only one piece of data can be used for querying and segment merging, so the overall comparison, Kylin on Parquet takes about only Kylin on 1/3 to 1/4 of HBase storage!

Kylin on Parquet

Kylin on Parquet

Kylin on HBase

Performance of Query Engine

The query engine of Kylin on Parquet will create a resident process on YARN during the first query, which is specially used to process query tasks, so the first query will be slower (the initialization process is about 20 seconds). The time of the first query is not counted.

In the past week, the problem of query engine compatibility has been further fixed. At present, most SQL queries including CountDistinct, TopN, Percentile, etc. are supported.

We use the SSB data set (90 million rows) and TPC-H (12 million rows) official standard SQL for query response time testing. The lower the query response time, the better the query engine performance. The standard query SQL for both data sets can be found in the SSB and TPC-H data set tool warehouses mentioned at the beginning of the article.

Over SSB

From the figure below, we can see that for the SSB dataset, Kylin on Parquet query response is slower than Kylin 3.0, but most queries can still be returned within 1 second.

Over TPC-H

915pxBecause the main purpose of TPC-H is to test the response time of complex queries in the database system, the SQL of the TPC-H data set is more complicated and requires higher requirements. As you can see from the figure below, Kylin on Parquet has more processing time for complex SQL queries Fast and has obvious advantages.

Conclusion

According to the performance comparison data of the Kylin on Parquet and Kylin3.0 query build engines, we can see that the performance of the Kylin on Parquet build engine has been greatly improved, and the build time and storage space have been reduced by nearly double. From the comparison results of the SSB data set query, the query engine has a certain gap with Kylin3.0 for simple query requests, but most of them can still achieve second-level responses. For the more complex SQL used in the TPC-H data set test, generally the post-calculation will be more, and the new query engine will have better performance.
At present, Kylin on Parquet is still in the stage of continuous improvement. Finally, the address of the GitHub warehouse is attached, https://github.com/apache/kylin/tree/kylin-on-parquet-v2. Welcome to raise issues and pr.

Space shortcuts

Page tree

Environment

Performance of Build Engine

Over SSB

Performance of Query Engine

Over SSB

Over TPC-H

Conclusion