Status

Current state: Under Discussion

Discussion threadhere

JIRA

Released: Unreleased

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).



Motivation

Cassandra optimization logic is currently based on hard coded heuristic optimizations which use assumptions instead of looking at the resource cost of the queries. One of those assumptions is that filtering should always be avoided if possible. Unfortunately, based on data distribution, that assumption can be wrong and lead to a less efficient execution path.  

Moreover, the current optimization logic is difficult to understand and follow as it is spread around the code base, making the addition of more advanced functions like joins or subqueries much harder than it should be.

Cassandra also does not provide insight on how it plans to execute a query or statistics on the query execution. Both of them are important for query optimization and troubleshooting.

Pre-requisites

  • Simplify the CQL layer to make it more robust to changes
  • Address some of the inconsistencies of the CQL language to make it more user-friendly

Goals

  • Introduce a Cascades(2) query optimizer with rules easily extendable 
  • Improve query performance for most common queries
  • Add support for EXPLAIN and EXPLAIN ANALYZE to help with query optimization and troubleshooting
  • Lay the groundwork for the addition of features like joins, subqueries, OR/NOT and index ordering
  • Put in place some performance benchmarks to validate query optimizations

Figure 1 below shows at a high level the new architecture proposed by this CEP for the CQL layer. 

Fig 1: New architecture proposed by this CEP

Non-goals

  • Trying to optimize any possible query. The expectation is that work on query optimizations will continue after this CEP is completed.
  • Providing support for new functionalities like join and sub-queries. One of the goals of this CEP is to provide the foundation for those features at the optimizer level but not to implement them.

Proposed Changes

Removing unnecessary CQL query limitations and inconsistencies

CQL has many query limitations and inconsistencies. At the code level those inconsistencies make the code harder to change than it should be whereas at the CQL level they result in a confusing syntax and behavior.

Some examples are:

  1. Some operators can be used on specific use cases, but not all. For example != is supported in LWT conditions but not in a WHERE clause.
  2. A difference in behavior based on operator types: Some operators can be combined together, like CONTAINS and CONTAINS KEY.
  3. Inconsistencies in bind marker support. Bind markers are supported in Tuples or UDTs but not in collections.
  4. You can define aliases, but they cannot be used in other clauses (e.g., GROUP BY).

An important prerequisite for the optimizer integration is to remove those limitations and inconsistencies for facilitating the migration to some relational algebra.

Moving CQL layers structures to relational algebra

Relational algebra offers a standardized way to represent queries that help eliminate ambiguities inherent in a natural language-based query language like CQL.

It allows the optimizer to explore various logical transformations of a query. For example, with the future introduction of joins, the optimizer can strategically apply selection and projection operations at earlier stages, effectively minimizing the volume of data processed. Similarly, when handling disjunctions (OR conditions), it will have the capability to decompose these into a union of multiple select operations, streamlining the query execution process.

With transformations expressed in relational algebra, the optimizer can more easily estimate the costs of different query plans and select the most efficient one.

In the context of this CEP only the following operators will be implemented:

  • Fundamental operators:
    • Select (σ)
    • Projection (π)
    • Union (∪)
  • Extended operators:
    • Rename (ρ)
    • Duplicate Elimination (δ)
    • Aggregation (γ)
    • Sorting (τ)
    • Grouping (𝛾)

Introducing query simplification/normalization

To facilitate the optimizer work, there is a need to transform different but equivalent queries into a consistent form. The goal is also to reduce queries to their simplest form, removing redundant or unnecessary elements, to make the subsequent optimization steps more efficient.

Normalization ensures that equivalent queries behave similarly, regardless of how they are written, leading to more predictable system behavior.

To allow for easy extensibility, the component in charge of query simplification and normalization should be rule-based. Providing an easy extension point for further simplification and normalization steps.

Some example of simplification rules are:

  • Constant folding: 5 + 3 ⇒ 8 
  • Eliminating unnecessary operations: -(-5) ⇒ 5 or removing ORDER BY clusteringColumn ASC

The introduction of disjunctions, negations, joins and subqueries in future works will come with the need for new simplifications and normalizations rules that could be easily added to this framework (e.g., predicate push-down, column pruning, …).     

Bringing normal query and secondary index optimization logic together

Currently, the decision to use a secondary index is separated from how the index will decide to perform queries. To compare the cost of the different access methods, the optimizer needs access to some secondary index rules regarding query execution.

This change will require some redesign of the Secondary Index API.

Create an Optimizer and Cost Model API

Allowing for the pluggability of different optimizer or cost model implementations should facilitate experimentation and allow people to run Cassandra with implementations specific to their needs.

One hope is that by facilitating experimentations, Cassandra will, in the long run, benefit from the community findings and that those findings will be implemented into the default implementation.    

Introduce support for EXPLAIN and EXPLAIN ANALYZE

For the Cassandra developer and users, providing transparency on the internal decisions made by the query optimizer is critical.

EXPLAIN queries will expose the execution plan and the cost estimates without executing the query while the EXPLAIN ANALYZE queries will execute the query and expose the same information as well as the actual execution statistics. Together those queries can help to identify bottlenecks or help to discover bugs within the query optimizer logic. They can also help in validating that performance tuning and optimizations are having the intended effect. As such, they provide a common ground for discussing performance issues and optimizations.

Implement a cost optimizer

The 3 main components in a cost optimizer are the cardinality estimator, the cost model, and the framework for plan enumeration.

For computing cost, the most critical component is the cardinality estimator, as it can introduce errors of several orders of magnitudes (4, 5).

Plan Enumeration Framework (Cascades)

The Cascades framework is a general optimization strategy based on the principles of transformational optimization (1, 2).  

By systematically applying transformation rules, it can explore a wide range of potential execution plans, uncovering efficient strategies that simpler optimizers might miss. Its cost-based nature ensures that it can select the most efficient execution plan based on actual data distribution and usage patterns, rather than relying on static rules. Additionally, the framework's structured approach to query transformation and evaluation is proven to lead to more predictable and consistent query performance.

The cascade framework modular design makes it also highly extensible, allowing for the easy integration of new optimization techniques and adaptability to evolving query complexities.

Cost Model

Databases usually compute costs based on a mix of physical costs (CPU cycles, I/O, ...), logical costs (number of tuples per operator) and algorithmic costs.

In a distributed database like Cassandra, physical costs like CPU and disk I/O will differ per node, either forcing the coordinator to fetch that information or to use its own physical costs as reference. Due to that complexity the initial implementation of the cost model will only take into account the networking cost as physical cost. Other physical costs should be accounted as part of the algorithmic cost. 

Due to some Cassandra components being pluggable (memtables, SSTables, partitioners, …), the algorithm costs will change depending on which components are used. To make those costs accessible to the cost model, the API of those components will be modified to provide their algorithm cost per row and fixed cost. 

Logical costs will be computed based on the cardinalities returned by the cardinality estimator.

Cardinality Estimator

To compute logical cost, the optimizer will need to estimate the number of rows that will be processed (cardinality).

The traditional way to do those estimates, and the one proposed by this CEP, is to use tables, indexes, and Materialized Views statistics.

In order to ensure that the execution plans on each node are the same, the cardinality estimator should provide the same global statistics on every node as well as some notification mechanism that can be used to trigger re-optimization.

As part of this CEP some new statistics will be added to the Memtables, SSTables and secondary indexes:

  • histograms (e.g., single columns cardinality)
  • sketches (e.g., distinct cardinality estimation)
  • tries (e.g., text columns)

Statistics are critical for getting correct estimates. However, collecting and maintaining these statistics also incurs overhead, so there needs to be a balance between the level of detail in statistics and its performance impact.

Common problems

Two common problems of cardinality estimation are bind variables and column correlations(5).

Correlated columns

For column correlations, SQLServer, Oracle and DB2 have their own approach and syntax to allow administrators to inject new statistics for correlated columns (3, 8, 9). CoackroachDB seems to use the columns in secondary indexes as a hint to find correlated columns (10).

For Cassandra, a solution would be to allow users to specify which columns are correlated as part of the schema. That information would result in the computations of different statistics that take into account the correlation.

Bind variables

When trying to optimize a query at preparation time a common problem faced by optimizers is the fact that bind variables are unknown preventing an efficient use of statistics.

To solve this problem SQL Server uses parameter sniffing and Oracle uses Bind Variable Peeking(7) where they optimize the execution plan based on the parameter values provided during  the first queries execution. In the case of Cassandra this approach could result in a different choice of execution plans on different coordinators making the initial evaluation of the optimizer harder. Therefore as part of this CEP, the query optimizer will use statistical data to determine the value for a bind variable during query optimization. Parameter sniffing can be investigated in some follow up work.

Hints

In some specific cases, users could want to influence the optimizer decision. Therefore, support for hints allowing the optimizer to use or ignore specific indexes will be added. 

New or Changed Public Interfaces

The CQL layer refactoring and the switch to a cost based optimizer will have some impacts at the CQL level as well as at the API level.

Changes at the CQL level

At the CQL level, the main changes will be removing some limitations and extending the places where some functionalities can be used. 

Support for EXPLAIN and EXPLAIN ANALYZE will be added as well as support for hints.

Changes at the API level

  • The secondary index API will be changed to provide access to the logical code as well as the index statistics
  • The  APIs of the pluggable components involved in query processing will be changed to provide access to the logical costs.
  • Addition of new statistics at the Memtable and SSTable level.
  • A new Optimizer and Cost Model APIs will be developed to provide pluggability for a different Optimizer or cost model.

API changes related to cost will be breaking changes as providing a default implementation for those could result in those changes being unnoticed, leading to some performance issues in production. 

Timeline

The work will be divided into 2 phases. 

The first phase will be the refactoring of the CQL layer with introducing the new optimizer and cost model API. This phase will be done in an incremental way on trunk such that trunk is always in a releasable state. The performance benchmarks will be developed as part of this phase.

The Cascades optimizer and the cost model will be developed in the second phase.

In practice, there will be some overlap between the 2 phases.

Compatibility, Deprecation, and Migration Plan

After the first phase, the Cassandra optimizer should have the same behavior that it currently has. The second phase will introduce a new cost based optimizer allowing users to switch from the old optimizer to the new one through configuration. Once there is enough confidence in the new optimizer, the old one will be marked as deprecated and scheduled for removal.

Test Plan

  • Changes at the CQL level should be tested by adding new unit tests and fuzz testing using Harry.
  • The new optimizer and cost model should be tested through unit tests and JMH benchmarks.
  • The performance should be tested by performance benchmarks.

Rejected Alternatives


Apache Cassandra has been using a heuristic approach for a long time and an option would be to continue with this approach. 

In practice most databases start with heuristic optimizations and evolve to cost based optimization when they start supporting more complex queries. For example, supporting queries with joins in an efficient way requires looking at the data distribution and using a cost based approach.

Cassandra is also becoming more mature with each release and the number of possible execution paths will be growing as SAI evolves and other improvements like global indexes appear.

Considering that the current optimization code needs some heavy cleaning and the plans for supporting more advanced queries, switching to a cost based optimizer appears to be a better option than continuing with a heuristic optimizer. 

References

  1. Bruno, N., & Nehme, R. V. (2008). Configuration-Parametric Query Optimization for Physical Design Tuning. Proceedings of the ACM International Conference on Management of Data (SIGMOD).
  2. Graefe, G. (1995). The Cascades Framework for Query Optimization. IEEE Data Engineering Bulletin.
  3. IBM Documentation. (Date not provided). Avoiding problems with correlated columns in DB2 for z/OS.
  4. Leis, V., et al. (2015). How Good are Query Optimizers, Really? In VLDB.
  5. Lohman, G., (2014). Is Query Optimization a “Solved” Problem?. ACM SIGMOD Blog
  6. Momjian, B. (2022). Explaining the Postgres Query Optimizer. Citus Con: An Event for Postgres.
  7. Nanda, A. (Date not provided). Oracle Database 11g: Adaptive Cursors and SQL Plan Management.
  8. Shamsudeen, R. (Date not provided). Multi-Column Correlation and Extended Stats in Oracle 11g.
  9. SQL Server 2020 documentation. (2020). Query Predicate contains multiple correlated columns.
  10. Taft, R. (2020). CockroachDB's Query Optimizer. CMU Database Group - Quarantine Tech Talks.