Status

Current stateDraft

Discussion thread:

JIRA

Released: <Cassandra Version>

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Anti-entropy (Apache Cassandra repairs) is important for every Apache Cassandra cluster to fix data inconsistencies. Frequent data deletions and downed nodes are common causes of data inconsistency. A few open-source orchestration solutions that trigger repair externally are available, as many corporations may have already figured out their own repair solution. However, multiple custom solutions have led to a lot of confusion. Therefore, Apache Cassandra should officially support the repair orchestration, much like Compaction, to call it a complete solution.

Audience

This enhancement proposal unlocks newly adopting Apache Cassandra users, who often must make significant investments before using it.

Goals

  1. The proposal is to align one solution among the existing solutions and have it officially blessed and supported as a first-class by the Apache Cassandra community.
  2. The solution has to be extremely easy for an operator to manage, so any naive user should be able to manage it.
  3. The solution should scale on a large fleet without much additional operational overhead. In other words, the operational complexity should not linearly increase with the Cassandra fleet size.

Since repair has been a widely discussed topic for a long time, and folks have developed their own repair solutions, so primary motivation is not to force everybody to migrate to an official recommended solution. The primary goal is to agree on an official solution so that we, as a Cassandra community, lower the bar for entry for newcomers. So, whatever solution we come up with should have the following two properties:

  • Folks who already have their own solution should be able to continue their solution. Migrating from custom-built solutions is a non-trivial effort. In the long run, however, we want folks to converge to the official supported solution.
  • For the newcomers, advise to use the officially recommended solution.

Non-Goals

  1. Automated repair inside Cassandra itself, like compaction.

Proposed Changes

We already have a few ready-made solutions available and being used in the industry at scale in private forks. So, the first and foremost thing is to get a consensus among the available solutions. Once the solution is finalized, then honor it as the official Apache Cassandra solution.

Solution#1 Scheduled Repair in Cassandra by Joey Lynch

There have been many attempts to automate repair in Cassandra, which makes sense given that it is necessary to give our users eventual consistency. Most recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked for ways to solve this problem.

At Netflix we've built a scheduled repair service within Priam (our sidecar), which we spoke about last year at NGCC. Given the positive feedback at NGCC we focussed on getting it production ready and have now been using it in production to repair hundreds of clusters, tens of thousands of nodes, and petabytes of data for the past six months. Also based on feedback at NGCC we have invested effort in figuring out how to integrate this natively into Cassandra rather than open sourcing it as an external service (e.g. in Priam).

As such, vinaykumarcse and I would like to re-work and merge our implementation into Cassandra, and have created a design document showing how we plan to make it happen, including the the user interface.

As we work on the code migration from Priam to Cassandra, any feedback would be greatly appreciated about the interface or v1 implementation features. I have tried to call out in the document features which we explicitly consider future work (as well as a path forward to implement them in the future) because I would very much like to get this done before the 4.0 merge window closes, and to do that I think aggressively pruning scope is going to be a necessity.

Ticket: [CASSANDRA-14346] Scheduled Repair in Cassandra - ASF JIRA (apache.org)

Solution#2 Automatic repair scheduling by Marcus Olsson

Scheduling and running repairs in a Cassandra cluster is most often a required task, but this can both be hard for new users and it also requires a bit of manual configuration. There are good tools out there that can be used to simplify things, but wouldn't this be a good feature to have inside of Cassandra? To automatically schedule and run repairs, so that when you start up your cluster it basically maintains itself in terms of normal anti-entropy, with the possibility for manual configuration.

Ticket: [CASSANDRA-10070] Automatic repair scheduling - ASF JIRA (apache.org)

Solution#3 Automated Repair in Cassandra by Jaydeepkumar Chovatia

Anti-entropy (Apache Cassandra repairs) is important for every Apache Cassandra cluster to fix data inconsistencies. Frequent data deletions and downed nodes are common causes of data inconsistency. A few open-source orchestration solutions that trigger repair externally are available, as many corporations may have already figured out their own repair solution. However, multiple custom solutions have led to a lot of confusion. Therefore, the repair activity should be an integral part of Cassandra itself, very much like Compaction, to call it a complete solution.

Design doc Automated Repair in Cassandra - Google Docs

PR [Private Fork - Cassandra 5.0]

Discussion over Slack https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619

New or Changed Public Interfaces

TODO

Compatibility, Deprecation, and Migration Plan

TODO

Test Plan

  • Operationally easy to manage.
  • Correctness of the repair solution.

Rejected Alternatives

TODO


  • No labels