Apache CarbonData

Abstract

Apache CarbonData is a new Apache Hadoop native file format for faster interactive query using advanced columnar storage, index, compression and encoding techniques to improve computing efficiency, in turn it will help speedup queries an order of magnitude faster over PetaBytes of data.

CarbonData github address: https://github.com/HuaweiBigData/carbondata

Background

Huawei is an ICT solution provider, we are committed to enhancing customer experiences for telecom carriers, enterprises, and consumers on big data, In order to satisfy the following customer requirements, we created a new Hadoop native file format:

Based on these requirements, we investigated existing file formats in the Hadoop eco-system, but we could not find a suitable solution that satisfying requirements all at the same time, so we start designing CarbonData.

Rationale

CarbonData contains multiple modules, which are classified into two categories:

  1. CarbonData File Format: which contains core implementation for file format such as columnar,index,dictionary,encoding+compression,API for reading/writing etc.

  2. CarbonData integration with big data processing framework such as Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution runtime.

CarbonData File Format

CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc. And CarbonData has following unique features:

Indexing

In order to support fast interactive query, CarbonData leverage indexing technology to reduce I/O scans. CarbonData files stores data along with index, the index is not stored separately but the CarbonData file itself contains the index. In current implementation, CarbonData supports 3 types of indexing:

1. Multi-dimensional Key (B+ Tree index)

2. Inverted index

3. MinMax index

Global Dictionary

Besides I/O reduction, CarbonData accelerates computation by using global dictionary, which enables processing/query engines to perform all processing on encoded data without having to convert the data (Late Materialization). We have observed dramatic performance improvement for OLAP analytic scenario where table contains many columns in string data type. The data is converted back to the user readable form just before processing/query engine returning results to user.

Column Group

Sometimes users want to perform processing/query on multi-columns in one table, for example, performing scan for individual record in troubleshooting scenario. In this case, row format is more efficient than columnar format since all columns will be touched by the workload. To accelerate this, CarbonData supports storing a group of column in row format, so data in column group is stored together and enable fast retrieval.

Optimized for multiple use cases

CarbonData indices and dictionary is highly configurable. To make storage optimized for different use cases, user can configure what to index, so user can decide and tune the format before loading data into CarbonData.

For example

Use Case

Supporting Features

Interactive OLAP query

 Columnar format, Multi-dimensional Key (B+ Tree index), Minmax index, Inverted index

High throughput scan

 Global dictionary, Minmax index 

 Low latency point query

 Multi-dimensional Key (B+ Tree index), Partitioning

Individual record query

Column group, Global dictionary

BigData Processing Framework Integration

Example: https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala

Initial Goals

Our initial goals are to bring CarbonData into the ASF, transition internal engineering processes into the open, and foster a collaborative development model according to the "Apache Way".

Current Status

CarbonData is production ready and already provide a large set of features. The current license is already Apache 2.0.

Meritocracy

We intend to radically expand the initial developer and user community by running the project in accordance with the "Apache Way". Users and new contributors will be treated with respect and welcomed. By participating in the community and providing quality patches/support that move the project forward, they will earn merit. They also will be encouraged to provide non-code contributions (documentation, events, community management, etc.) and will gain merit for doing so. Those with a proven support and quality track record will be encouraged to become committers.

Community

If CarbonData is accepted for incubation, the primary initial goal is to build a large community. We really trust that CarbonData will become a key project for big data column-like platforms, and so, we bet on a large community of users and developers.

Known Risks

Development has been sponsored mostly by a one company.For the project to fully transition to the Apache Way governance model, development must shift towards the meritocracy-centric model of growing a community of contributors balanced with the needs for extreme stability and core implementation coherency.

Orphaned products

Huawei is fully committed CarbonData. Moreover, Huawei has a vested interest in making CarbonData succeed by driving its close integration with sister ASF projects. We expect this to further reduces the risk of orphaning the product.

Inexperience with Open Source

Huawei has been developing and using open source software since a long time. Additionally, several ASF veterans agreed to mentor the project and are listed in this proposal. The project will rely on their guidance and collective wisdom to quickly transition the entire team of initial committers towards practicing the Apache Way.

Reliance on Salaried Developers

Most of the contributors are paid to work in big data space. While they might wander from their current employers, they are unlikely to venture far from their core expertises and thus will continue to be engaged with the project regardless of their current employers.

An Excessive Fascination with the Apache Brand

While we intend to leverage the Apache ‘branding’ when talking to other projects as testament of our project’s ‘neutrality’, we have no plans for making use of Apache brand in press releases nor posting billboards advertising acceptance of CarbonData into Apache Incubator.

Initial Source

https://github.com/HuaweiBigData/carbondata.git

External Dependencies

All external dependencies are licensed under an Apache 2.0 license or Apache-compatible license. As we grow the Carbondata community we will configure our build process to require and validate all contributions and dependencies are licensed under the Apache 2.0 license or are under an Apache-compatible license.

Required Resources

Mailing lists

Git Repository

Issue Tracking

Initial Committers

Affiliations

Sponsors

Champion

Mentors

Sponsoring Entity

The Apache Incubator

CarbonDataProposal (last edited 2016-05-25 20:23:52 by jbonofre)