Blog

Date: 2019-7-14

Author: Jacky Li, Ravidra Pesala

### Approaching CarbonData 1.6

When we started CarbonData project in 2016, the goal of it was to provide an unified data solution for various data analytic scenarios. Three year has passed, CarbonData has grown to become a popular solution for various scenarios, including

- Adhoc analytic on detail record
- Near realtime streaming data analytic
- Large scale history data with update capability
- Data mart with materialized view and SQL rewrite capability

In CarbonData 1.6.0, following features further improve CarbonData's capability on above scenarios:

- *distributed index server*, to serve block pruning for hyper scale data. In real world use case, we have production system that serve 10 trillion record in one table with second level query response time.
- *adjustable index key*, now user can change SORT_COLUMNS after creating table, which gives user the ability to tune the query performance dynamically as gaining more knowledge of what business needs.
- *formally support of spark, presto, hive*, the three most popular compute engine in big data world.
- *production ready Materialized View and SQL rewrite*, which is a powerful feature to accelearte query performance in data mart scenario.

Thanks to these unique features, uptill now lots of opensource and commercial CarbonData software has be deployed, including various use case in Internet, Telecom, Finacial and Banking, Smart City, etc

### Towards CarbonData 2

On one hand, as Apache Carbon is approaching version 1.6.0, the technologies supporting above scenarios become mature and production ready. On the other hand, analytic workload and runtime environement is becoming even more diverse than three years ago, for example, Cloud environement requires more complex data managment for cloud storage, AI required data management for unstructured data. Thus it is high time now to discuss the future of CarbonData.

First, let's exam what is chaning in the data landscape, so that we can target CarbonData 2.x in a more reasonable direction.

In my humble opinion, there are 3 trends in the data world

##### Trend 1: Data Lake and Data Warehouse is converging

As more and more project relies on data lake and more and more user found data lake is lack of some basic features comparing to data warehouse, like data reliability and performance, while a data warehouse is notorious to scale. So, in order to meet business goal, data lake and data warehouse are often used in a complementary manner.

##### Trend 2: Cloud makes data management more complex

With elastic and low cost resources, cloud is changing how enterprise stores and analyzes their data. More and more user employes private cloud and public cloud technology, while they enjoy the benefit that cloud bring, they also have to manage more complex scenarios like cloud burst, data synchronization, etc. By the end of the day, they even find there are more data silos to manage.

##### Trend 3: AI raises new challeges for data management

Nowadays, AI is becoming everywhere in application. Before a model can be trained, according to industry study, 80% of the effort is spent on data preparation, and with the popular deep learning technology, unstructured data becomes dominat in AI domain. All these changes of data usage is leading to new challeges of data management, including data trasform, data tracking, version control, etc.

### The Goal of CarbonData 2

With these challeges in mind, following goal of CarbonData 2 is proposed:

**Goal: Build a data fabric to manage large scale and diverse data reliably**.

By the term *data fabric* I mean

- It can be used as Data Lake with high reliability and high performance
- It can be used as Data Warehouse with scalability and flexibility of compute-storage decoupling
- Be ready to support data management for Hybric Cloud and AI application

Thus, making CarbonData an unified data management solution for **Data Warehouse + Data Lake + AI**

Finally, it may become something looks like:

![CarbonData 2](/Users/jacky/Documents/CarbonData2.jpg)

How can we achieve these goals? To achieve it, we must leverage the strength of CarbonData 1.x and add new features that towards these goals. So, we might ask ourselves what really makes CarbonData 1.x unique. And I summarize it as following.

- segment data organization, as a basis for large scale, faster loading time, and transactional operation management
- materialized and distributed metadata, as a basis for easy of data migration and again, large scale, since metadata is totally treated as data, thus avoiding a single process of holding metadata that limited by memory.
- multi-level indexing, as a basis for providing fast query performance while offering tunable loading performance to the user.
- main-delta based IUD with ACID compliance, as a basis for SCD scenario while keeping minimum IO impact on immutable file system

These are the most important features that make CarbonData unique among so many big data solutions. When going towards CarbonData 2.x, these features should be preserved and leveraged.

### CarbonData 2 Roadmap

Finally, we propose following road map of CarbonData 2. This list is an initial draft of what we think CarbonData 2 should have, and they will be implemented span across multiple 2.x versions

#### Segment Plugin Interface

- Segment related code refactory to abstract *Segment Plugin Interface* and make it format-neutral to allow plugin added by community developer
- Following format may be support as builtin plugin in the initial iteration: carbon-row, carbon-columnar, csv; more plugin may be supported in the community

#### Transactional Segment Operation

- Support transactioanl optionation on segment, making data management ACID compliant on HDFS and the cloud. No more dirty data.
- Make following operation formal in SQL and DataFrame API (can be done later in 2.x)
- ALTER TABLE ADD SEGMENT

- ALTER TABLE DROP SEGMENT

- ALTER TABLE MOVE SEGMENT, and support setting segment moving policy

- SEGMENT ITERATIVE QUERY (for pagination and sampling purpose)

- ALTER TABLE COMPACTION

- SHOW SEGMENTS

#### Cloud Ready

- Segment location awareness, supporting cloud storage, on-prem and hybric cloud

- Segment replication, for caching, cloud burst, cloud data synchronization, etc

#### New features for Bad Record

- specify data validation during load

- a new way to collect bad record during loading, and provide easy-to-use tool for exploring bad records

#### New features for Update

- Support MERGE syntax to simplify the SCD type 2 kind of updates
- Support timestamp or version based query (Time Travel)
- Support update/delete on streaming table.

#### Integration

- Adapting to Spark extension interface, removing all features that conflicting with extension mechanism
- Support Spark 2.4 integration
- Flink integration by SDK to write and read carbondata
- Support integration with more Hadoop distribution
- SDK support transactional table

#### CarbonUI

- support segment management UI

- backend server can act as a central server to trigger data management

- data connection management between cloud and on-prem

#### Misc

- upgrade java to java 1.8 as default
- compiled SQL template for higher query performance for small table
- support multiple spark version in mvn repo, like carbon-2.3.2_2.11:1.6

Blog

Blog

qrcode的安装