Blog

Change data capture (CDC) is a process that captures changes made in a database and ensures that those changes are replicated to a destination such as a data warehouse or a data lake. To generalize, this change data can be both from database changes or any custom changes applied by users.


Read the complete blog here

Spark is no doubt a powerful processing engine and a distributed cluster computing framework for faster processing. Unfortunately there are few areas where spark has drawbacks. If we combine Apache Spark with Apache CarbonData, it can overcome those drawbacks. Few of the drawbacks with Apache Spark are as below:

  1. No Support for ACID transaction
  2. No data quality enforcement
  3. Small files problem
  4. Inefficient data skipping


Read the complete blog here.

We have seen a lot of interest for an efficient and reliable solution to provide the mutation and transaction capability into the data lakes. In the data lake, it is very common that users generate reports based on a single set of data. As various types of data flow into data lake, the state of data cannot be immutable. Various use cases requiring mutating data includes data changes with time, late arriving data, balancing real time availability and backfilling, state changing data like CDC, data snapshotting, data cleansing etc, While generating reports, these will result in write/update the same set of tables.


Read the complete blog here.

Materialized view is a pre-computed data set which is one of the most important query performance tuning tools used in Bigdata systems, allowing users to pre-join complex views and pre-compute summaries for quick response time. In CarbonData, materialized views helps in improving performance by doing pre-computation of relevant query projections,filters and expensive operations like aggregations and joins. With materialized views on carbon table, we can avoid unnecessary big-table full-table scans to make query faster.


Read the complete blog here.

CarbonData uses caching to increase the query performance by caching block/blocklet index information and prunes using the cache. Using caching, the number of files that are to be read are reduced thereby reducing the IO time and improving the overall query performance.


Read the complete blog here.