Apache CarbonData 2.0.0 Release

Apache CarbonData community is pleased to announce the release of the Version 2.0.0 in The Apache Software Foundation (ASF).

CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenarios, it supports queries on a single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!

We encourage you to use the release https://archive.apache.org/dist/carbondata/2.0.0/, and feedback through the CarbonData user mailing lists!

This release note provides information on the new features, improvements, and bug fixes of this release.

What’s New in CarbonData Version 2.0.0?

CarbonData 2.0.0 intention was to move closer to unified analytics. In this version of CarbonData, around 200 JIRA tickets related to improvements, and bugs have been resolved. Following are the summary.

Cache Pre-priming in Index Server:

CarbonData has a concept to load the min/max cache into memory on the first query that is executed on the specified table. This causes degrade to the query performance. To overcome this and improve the performance of the First time query, the user can enable pre-priming feature which will load the min/max cache into memory on each load.

Carbon Extension for Spark 2.4, without Carbon Session

Due to the tight integration of carbon with spark, Carbon requires CarbonSession to be created instead of SparkSession, this behavior has led to many user usability concerns. To make the integration layer modular, CarbonData now supports the SparkSessionExtention API which enables carbon to integrate its parser and optimizer to the existing SparkSession.

MV Time-series support with Rollup support, multiple granularity

Analytics data such as application performance monitoring, network data, sensor data, events, clicks, banking, server metrics, etc., has to be aggregated and analyzed or monitored over a period of time for business needs. CarbonData supports pre-computation of aggregations and joins through Materialized views which provides faster performance results, timeseries support is required for many customers.

Spatial Index DataMap

For queries which require a filter on a spatial object like a region on a 2D map, CarbonData performance will be slow when querying spatial data, These type of queries would be treated as a full scan query, causing significant performance degrade.

To overcome this limitation in carbon, a concept called as ‘spatial indexing’, that allows for accessing a spatial object efficiently is implemented. It is a common technique used by spatial databases.

Secondary Index (SI)

When unsorted data is stored in carbon the pruning process tends to give false positives when comparing min/max. For example a blocklet might have 3,5,8 integer values in it which means the min=3 and max=8. If the user has a filter expression with the value 4 then the pruning process will give a false report that this blocklet will have data and the reading flow should decompress this page and read the contents. This would lead to unnecessary IO finally resulting in a performance degrade.

To improve the query performance, Secondary Index has been designed on the existing min/max architecture which is basically a reverse index of the unique data to the blocklets it is present in. This will give the exact location of the data so that false positive scenarios during pruning are minimized.

Support CDC merge functionality

In the current data warehouse world slowly changing dimensions (SCD) and change data capture(CDC) are very common scenarios. Legacy systems like RDBMS can handle these scenarios very well because of the support of transactions.

To keep up with the existing database technologies, CarbonData now supports CDC and SCD functionalities.

Support Flink streaming write to carbon

CarbonData needs to be integrated with fault-tolerant streaming dataflow engines like Apache Flink, where users can build a flink streaming job and use flink sink to write data to carbon through CarbonSDK. Flink sink will generate table stage files. Data from Stage files can be inserted to the carbon table by carbon Insert stage command, by making them visible for query.

Add segment

Many customers have already generated data with different formats like ORC, Parquet, JSON, CSV etc. If users want to migrate to CarbonData for better performance or for better features(SDK) then there was no mechanism. All the existing data had to be converted to CarbonData to migrate.

To solve this limitation, add segment is introduced so that the user can easily add segments of different formats to a carbon table and run the queries.

Hive Write support(Non-transactional)

CarbonData now supports write and read from Hive execution engine. It will be helpful for users who want to try carbon without migrating to spark. Also, users can now convert their existing parquet/orc table directly to carbon format for ETL purposes.

Insert into performance improvement

Currently insert and load command have a common code flow, which induces many overheads to insert command because features like BadRecords are not required by the insert command.

Now load and insert flow have been separated and some additional optimizations are implemented to insert command such as,

1. Rearrange projections instead of rows.

2. Use internal row instead row object from spark.

It is observed that these optimization resulted in 40% insert performance improvement for TPCH data.

Optimize Bucket Table

Bucketing feature is used to distribute/organize the table/partition data into multiple files such that similar records are present in the same file. Join operation on datasets will cause a large volume of data shuffling making it quite slow, which can be avoided on bucket columns. Bucket tables have been made consistent with spark to improve the join performance by avoiding shuffle for the bucket column.

Pycarbon support

CarbonData now provides python API(PyCarbon) support for integrating it with AI frameworks like TensorFlow, PyTorch, MXNet. By using PyCarbon, AI framework should be able to read training data faster by leveraging CarbonData's indexing and caching ability. Since CarbonData is a columnar storage, AI developers should also be able to perform projection and filtering to pick required data for training efficiently.

Materialized view on all table such as Parquet and ORC

CarbonData’s datamap interface can be used to improve the query performance of other formats like Parquet/ORC. One of the implementations of datamap interface is MV table which precompute the aggregation results based on the user input. By creating MV datamap on a parquet/orc table the user can get the benefit of quering a pre-computed data instead of raw data which results in better query results.

This is possible as carbon will redirect the query to the MV datamap instead of the parquet tables

Behaviour Change

Field class has been moved from sdk module to core module, therefore all the integration code has to refactor the package for Field.java accordingly.
PreAggregate, Global Dictionary have been removed and the user would not be able to use these features from this version. The user needs to create new table in old version without global dictionary/preaggregate, and then upgrade to new version.
Lucene/Bloom/MV from old versions will not work in new version because of a change in the "_system" folder location. User can drop these datamaps and create new after upgrade.
CarbonSession has been deprecated in this version. It is advised to use CarbonExtensions instead.

Please find the detailed JIRA list here.

Page tree