ZooKeeper clusters are currently static: the set of servers participating in a cluster are statically defined in a configuration file. In many instances, however, it is desirable to have the ability of adding and removing servers from an ensemble. The difficulty of implementing such a feature is making sure that a change in the configuration does not cause inconsistencies in a ZooKeeper cluster. A related issue is the one of enabling a client to learn the current ensemble of servers dynamically.
When moving from one configuration to another, we need to make sure that a quorum of the old configuration and a quorum of the new configuration commit to the new configuration. A quorum of the old configuration needs to agree to avoid a split-brain problem, for example, when adding more servers. A quorum of the new configuration needs to agree for progress. We also need to make sure that a quorum of the old configuration confirms first, otherwise a partition could cause a split-brain problem.
If the current leader is part of the old and new configurations, then it can keep being the leader once the new leader is installed. Otherwise an epoch change becomes necessary.
It is critical to make sure that every operation committed once a configuration is installed is acknowledged by a quorum from the new configuration. Otherwise a leader crash can cause committed operations to be lost. It might be simpler to stall the pipeline of request processors when a reconfiguration goes through PrepRequestProcessor. By stalling I mean holding operations until the reconfiguration operation is committed.
This section contains a proposed algorithm and API for reconfiguring zookeeper cluster membership.
Any comments and suggestions are welcome (Alex Shraer, shralex@yahoo-inc.com).
Let M be the current configuration of Zookeeper. We define a configuration to be the set of participating servers, members(M), and a quorum system over these members. In order to reconfigure the system to a new configuration M’ an administrative client submits a reconfig(M’) operation through any member s in M. This causes s to:
The algorithm for leader(M) is as follows:
All servers do the following:
During state discovery in M, if some server responds next(M)!=null, let M’ be such returned non-null configuration. The algorithm (or ZAB) will make sure that at-most a single non-null value is returned.
The leader executes the reconfiguration alg. with the following changes:
The server s that submitted the reconfig operation to the leader should check what's the current configuration after a new leader is established. If it is still the old one it can re-submit the reconfig. If s fails, the client can re-submit the reconfig to another server. Recall that version(M) is included by the client in the request so a server receiving it should first check that the configuration is still M (using local state) before submitting it to the leader, and the leader should also make this check (since it has the latest state).
A choice that has to be made is what kind of operations to support – incremental changes like “add server A” or “remove server B” (e.g., as in survey on reconfiguration of R/W objects , #DynaStore) or full membership specification, as in “reconfigure so that the membership becomes is A, B, C” (e.g., survey on reconfiguration with virtual synchrony , #Rambo).
One notable disadvantage of non-incremental requests is when multiple reconfigurations are proposed concurrently. Suppose that the initial configuration is A, B, C and one process proposes to add a server D whereas another proposes to remove B. If each process has to specify the full new configuration then the first process would propose A, B, C, D whereas the second would propose A, C. One of these would succeed first, suppose A, C. Then the second proposal should be aborted otherwise the resulting configuration would be A, B, C, D, where B appears even though it was already removed.
With the incremental approach this would not happen. Here, a configuration can be viewed as a set of changes. The initial configuration is (+A, +B, +C) then in the scenario above the next configuration is (+A, +B, +C, -B) and last one is (+A, +B, +C, -B, +D). This allows all reconfiguration requests to complete without aborting (a wait-free algorithm), as each change can be applied to the current configuration, whether this configuration remained the same or changed.
The non-incremental approach might be preferable for two main reasons:
To support the non-incremental approach, we propose that the administrative client intending to execute a reconfig, first fetches the current config from the system, and then submits a reconfig request where he fully specifies the requested configuration, its quorum system, and includes the version of the current configuration (version(M)). If the server that gets this request already has some other proposal to reconfigure (next(version(M))!=null) the reconfiguration is aborted and the administrative client should decide whether to retry.
At first stage we propose to use the non-incrememtal API for reconfigurations. In the future we intend to use this non-incremental interface only for changing the quorum system and to add an incremental API for wait-free reconfigurations.
Suppose that a reconfig request was issued and the leader started sending phase-1 messages to the current configuration M, but failed after sending to only one other server A. Then, when the recovery happened, the new leader did not see a message from A. Should we allow the reconfiguration request to surface at a later time ? If not, a possible solution might be to have a command “next(M) = null” as the first one issued by any elected leader. If ZAB is used for sending the message in Phase 1, explicitly making sure that there are no incomplete reconfigurations that can surface later might be unnecessary.
The idea of an “off-line” strategy for reconfiguration (survey on reconfiguration with with virtual synchrony , survey on reconfiguring state-machine replication ) is to stop operations in the old configuration, transfer the state to the new configuration and then enable operations – in the new configuration. In contrast, an online reconfiguration approach (#RAMBO, #DynaStore) never stops the service while reconfiguring.
One of the complexities arising in the online approach is that a normal operation can be executing concurrently with a reconfiguration, however the state still must be transferred correctly to the next configuration. The easy case is when the operation occurs in the old configuration and the reconfiguration transfers the state. It is possible, however, that the reconfiguration misses the operation when it transfers the state and completes. In this case, existing online reconfiguration solutions (#RAMBO, #DynaStore) continue the operation and execute it in the new configuration.
In Zookeeper, we must be careful not to violate the global primary order property - a situation where a new primary commits operations and only then the old primary's commit arrives is not allowed in ZAB. The algorithm avoids that by having leader(M') commit operations proposed by leader(M) during the reconfiguration,
and it does so before committing any other operations that it itself proposed in M'.
We should be able to bootstrap ZOOKEEPER using reconfig API instead of configuration files as it is currently done.
Should be similar to the one in survey on reconfiguration with with virtual synchrony and #DynaStore.
We should probably have a DNS-like solution as a fall-back for clients that were not around during the reconfiguration (suppose that all of the servers
that the client knew are already down). For normal operation we should also have a push-base solution to notify clients about the configuration change.
Some preliminary thoughts:
For servers in M', should the server periodically broadcast the new configuration (or its id), to try to makes sure that all servers in M' know about M' ?
Surveys:
1. Ken Birman, Dahlia Malkhi, and Robbert Van Renesse, Virtually Synchronous Methodology for Dynamic Service Replication, no. MSR-TR-2010-151, November 2010 paper
2. Leslie Lamport, Dahlia Malkhi and Lidong Zhou, Reconfiguring a State Machine. In SIGACT News 41(1), SIGACT News 41(1): 63-73 (2010) paper
3. Marcos K. Aguilera, Idit Keidar, Dahlia Malkhi, Jean-Philippe Martin, Alexander Shraer:
Reconfiguring Replicated Atomic Storage: A Tutorial. In the Bulletin of the European Association for Theoretical Computer Science 102, pages 84-108, Distributed Computing Column, October 2010. paper
Rambo (R/W objects, fully specifies configurations, online, uses external consensus on configurations order):
4. Nancy Lynch and Alex Shvartsman. RAMBO: A reconfigurable atomic memory service for dynamic networks. In 5th International Symposium on Distributed Computing (DISC), 2002. paper
DynaStore (R/W objects, incremental reconfiguration, online, doesn’t require consensus):
5. Marcos Aguilera, Idit Keidar, Dahlia Malkhi, and Alex Shraer, Dynamic Atomic Storage Without Consensus, in Journal of the ACM, ACM, 2011 paper