Cluster Membership: ZOOKEEPER-107
ZooKeeper clusters are currently static: the set of servers participating in a cluster are statically defined in a configuration file. In many instances, however, it is desirable to have the ability of adding and removing servers from an ensemble. The difficulty of implementing such a feature is making sure that a change in the configuration does not cause inconsistencies in a ZooKeeper cluster. A related issue is the one of enabling a client to learn the current ensemble of servers dynamically.
- The ensemble of servers must agree upon the current configuration, and to reach agreement, Zab sounds like the obvious choice;
- We need new client calls to add and remove servers. It is unclear whether we want one call for each modification or one call to propose a whole new configuration;
- It must work with both majority and flexible quorums;
- We need a mechanism, perhaps based on URIs, to enable a client to learn the current configuration.
Some pre-design random thoughts
When moving from one configuration to another, we need to make sure that a quorum of the old configuration and a quorum of the new configuration commit to the new configuration. A quorum of the old configuration needs to agree to avoid a split-brain problem, for example, when adding more servers. A quorum of the new configuration needs to agree for progress. We also need to make sure that a quorum of the old configuration confirms first, otherwise a partition could cause a split-brain problem.
If the current leader is part of the old and new configurations, then it can keep being the leader once the new leader is installed. Otherwise an epoch change becomes necessary.
It is critical to make sure that every operation committed once a configuration is installed is acknowledged by a quorum from the new configuration. Otherwise a leader crash can cause committed operations to be lost. It might be simpler to stall the pipeline of request processors when a reconfiguration goes through PrepRequestProcessor. By stalling I mean holding operations until the reconfiguration operation is committed.