Mounting a Remote Zookeeper

Overview

Our goal is to enable write throughput scalability through partitions in ZooKeeper. The main use case is applications that access some portion of the namespace frequently, whereas the accesses to other areas are infrequent (primarily for inter-application synchronization). We preserve the normal ZK semantics, including the regular consistency guarantees. (The challenge is that sequentially consistent runs are non-composable).

A client maintains a single connection to one ZK server, serving the most-frequently-accessed partition (we call it home partition). The performance of read and write accesses to the home partition remains unaffected in most cases (see the details below), whereas accesses to remote partitions might suffer from lower latency/throughput.

Our design provides a way for incorporating a part of one ZK namespace in another ZK namespace, similarly to mount in Linux. Each ZK cluster has its own leader, and intra-cluster requests are handled exactly as currently in ZK. Inter-cluster requests are coordinated by the local leader. We strive to minimize the inter-cluster communication required to achieve the consistency guarantees.

Background

Namespace partitioning for throughput increase has been partially addressed in ZK before. The observer architecture allows read-only access to part of the namespace. The synchronization with the partition's quorum happens in the background. Since the remote partition is read-only, updates to it can be scheduled at arbitrary timing, and the consistency is preserved. The downside is restricted access semantics. Our solution can reuse the synchronization building block implemented for observers.

The design in http://wiki.apache.org/hadoop/ZooKeeper/PartitionedZookeeper discusses how to detect the need for partition automatically. It does not deal with the question of reconciling multiple partitions (containers). Hence, this discussion is largely orthogonal to our proposal.

Several Distributed Shared Memory (DSM) systems proposed in 1990's implemented sequential consistency, but hit performance bottlenecks around false sharing. ZK is a different case, because the namespace is structured as a filesystem, which prevents falsely shared accesses.

We expect the remote partition mounting to be advantageous for workloads with a highly skewed access pattern: frequent R/W access to one partition, and infrequent R/W access to other partition(s). The advantages of our proposal are as follows:

  1. Transparency to clients - there are no semantic differences in accessing multiple partitions.
  2. Minimum (if any impact) on R/W operations in the home partition - see details below.

API/system management modifications

Our proposal is to provide mount and unmount commands. Mount gets the following parameters:

We expect a DNS-like service to exist that maps a ZK cluster identifier (e.g., a virtual IP address) to a list of nodes comprising that cluster. Using this list, a local leader can connect to the remote leader as an observer for the mounted subtree. It is possible to designate a special path for mounts, for example /zookeeper/mnt. Unmount simply gets a path in the local ZK tree, e.g., /zookeeper/mnt/europe1. The mount and unmount commands are executed by (privileged) clients and affect all clients connected to the local ZK cluster. For now, we don't allow chaining of mounts (i.e., a mount must point to a subtree hosted by the second partition, not a mount point itself).

The current ZK design provides a prefix property - if a client operation fails, than the subsequent operations will fail as well. (The FIFO property is regular ZK semantics). We provide a parameter whether to preserve this property in the multi-partition setting:

Architecture

The proposed architecture is as follows. Each ZK cluster has its own leader. In addition, each cluster has an observer process for each remote partition mounted to the local namespace. The observer may replicate only the mounted part of the remote tree, not necessarily the whole remote tree. The leader is responsible for keeping track of these observers: it creates a new observer process when a new mount is performed or upon observer failure and kills an observer when an unmount is executed. Each ZooKeeper cluster tracks only the clients connected to servers in that cluster. To support ephemeral nodes a cluster can subscribe with another cluster to be notified upon the failure of a particular client.

When receiving a request for a local node, the leader schedules the request in its normal pipeline and executes it with ZAB. When receiving a request for a mounted partition, it then forwards the request to the appropriate observer. When the operation is committed, the observer returns a responce to the leader, which forwards it back to the follower.

The mapping of cluster identifiers to list of servers can be done by an external DNS-like service (which is also used to support membership changes insider a ZK cluster (reconfiguration)).

Changes to semantics

Why is coordination among partitions necessary to preserve current semantics

Sequential consistency, currently preserved by ZooKeeper, requires that clients see a single sequence of operations (no matter what server they are connected to), and this sequence conforms with the local program order at each client. For example, if operation A completes before operation B is invoked by the same client, then A takes affect before B. Sequential consistncy allows local reads since it does not require real-time order among clients -- for example, if A completes on client C1 and then B is invoked on client C2, sequential consistency allows B to be scheduled either before or after A. Unlike stronger properties such as linearizability, sequential consistency is not composable. Therefore, if we take two ZK clusters, each is sequentially consistent individually, but together they are not necessarily sequentially consistent, like in the following example:

Execution of Client C1 in ZK cluster 1: set local node x to 1 returns OK, then get remote node y is invoked and returns 0 Execution of Client C2 in ZK cluster 2: set local node y to 1 returns OK, then get remote node x is invoked and returns 0

Here x is part of the ZK namespace managed in cluster 1, and y is managed in cluster 2. If we look only on the operations involving one of the nodes, it is sequentially consistent, but there is no way to create a single sequence involving all these operations and preserve local program order: the write of 1 to y must appear in the sequence after the read from y which returns 0, but then the read of x appears after the write of 1 to x which is not possible since it returned 0.

Therefore, some coordination is necessary among ZK clusters in order to preserve current consistency semantics.

Rules for inter-cluster communication

Obviously if cluster A only accesses nodes in its own namespace and cluster B only accesses its own nodes, no coordination is necessary -- we can logically create a legal global order on operations in both clusters without any actual inter-cluster communication. On the other hand, in the example above, one of the remote reads must return 1 for the global order to exist, so some coordination is in fact necessary.

We propose to use the following rules:

The idea is to exploit the fact that "mount relationship" between the ZK clusters is not necessarily symmetric, and force communication with a remote cluster only when a symmetric interest exists. In order to support this rule, a local write to /a/b/c will notify those observers which are connected to remote partitions interested in /a/b/c, i.e., have mounted local /, /a, /a/b or /a/b/c. During a local write, the leader will perform a lookup in an in-memory data structures pointing to the appropriate observers.

An observer receives such notifications from the leader, and periodically performs a sync with the remote partition. When a remote read request arrives, the observer will only sync if it did not already do so after the last notification.

Note that the observer is mostly up to date as it gets all relevant commits from the remote cluster asynchronously. Thus, even if it is necessary to sync its state to execute a remote read, this update will most likely to be fast essentially flushing the communication chunnel between the observer and a remote leader.