Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Status

Current stateDraft

Discussion thread:

...

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Anti-entropy (Apache Cassandra repairs) is important for every Apache Cassandra cluster to fix data inconsistencies. Frequent data deletions and downed nodes are common causes of data inconsistency. A few open-source orchestration solutions that trigger repair externally are available, as many corporations may have already figured out their own repair solution. However, multiple custom solutions have led to a lot of confusion. Therefore, Apache Cassandra should officially support the repair orchestration, much like Compaction, to call it a complete solution.

Audience

This enhancement proposal unlocks newly adopting Apache Cassandra users, who often must make significant investments before using it.

Goals

  1. The proposal is to align one solution among the existing solutions and have it officially blessed and supported as a first-class by the Apache Cassandra community.
  2. The solution has to be extremely easy for an operator to manage, so any naive user should be able to manage it.
  3. The solution should scale on a large fleet without much additional operational overhead. In other words, the operational complexity should not linearly increase with the Cassandra fleet size.

...

  • Folks who already have their own solution should be able to continue their solution. Migrating from custom-built solutions is a non-trivial effort. In the long run, however, we want folks to converge to the official supported solution.
  • For the newcomers, advise to use the officially recommended solution.

Non-Goals

  1. Automated repair inside Cassandra itself, like compaction.

Proposed Changes

We already have a few ready-made solutions available and being used in the industry at scale in private forks. So, the first and foremost thing is to get a consensus among the available solutions. Once the solution is finalized, then honor it as the official Apache Cassandra solution.

Solution#1 Scheduled Repair in Cassandra by Joey Lynch

There have been many attempts to automate repair in Cassandra, which makes sense given that it is necessary to give our users eventual consistency. Most recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked for ways to solve this problem.

...

Ticket: [CASSANDRA-14346] Scheduled Repair in Cassandra - ASF JIRA (apache.org)

Solution#2 Automatic repair scheduling by Marcus Olsson

Scheduling and running repairs in a Cassandra cluster is most often a required task, but this can both be hard for new users and it also requires a bit of manual configuration. There are good tools out there that can be used to simplify things, but wouldn't this be a good feature to have inside of Cassandra? To automatically schedule and run repairs, so that when you start up your cluster it basically maintains itself in terms of normal anti-entropy, with the possibility for manual configuration.

Ticket: [CASSANDRA-10070] Automatic repair scheduling - ASF JIRA (apache.org)

Solution#3 Automated Repair in Cassandra by Jaydeepkumar Chovatia

Anti-entropy (Apache Cassandra repairs) is important for every Apache Cassandra cluster to fix data inconsistencies. Frequent data deletions and downed nodes are common causes of data inconsistency. A few open-source orchestration solutions that trigger repair externally are available, as many corporations may have already figured out their own repair solution. However, multiple custom solutions have led to a lot of confusion. Therefore, the repair activity should be an integral part of Cassandra itself, very much like Compaction, to call it a complete solution.

...

Discussion over Slack https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619

New or Changed Public Interfaces

TODO

Compatibility, Deprecation, and Migration Plan

TODO

Test Plan

  • Operationally easy to manage.
  • Correctness of the repair solution.

Rejected Alternatives

TODO