This page describes the proposal to include Observers in ZooKeeper. See https://issues.apache.org/jira/browse/ZOOKEEPER-368
Observers are a type of participant in a ZooKeeper ensemble that do not take part in the underlying atomic broadcast protocol that ZK is built upon. In particular, they do not vote on new proposals and can not become Leaders.
Observers are informed of all committed proposals in zxid order. Observers may issue proposals, and will be told about the outcome.
The main advantage of having Observers is to remove some of the burden of serving clients from ensemble nodes which are more concerned with correctly executing instances of ZAB in a timely fashion. Observers are not a critical component of a ZooKeeper ensemble, and therefore do not have to meet strict timing guarantees. They can therefore be more heavily loaded without fear of jeopardising the liveness of the ensemble.
This design provides read-scalability; i.e. the ability to add more read-only clients without compromising write performance. Because Observers do not vote or participate in leader elections the only impact of a careful implementation on ensemble performance should be the bookkeeping required to keep track of them at the Leader, plus the messaging costs associated with informing them of committed proposals.
In the current design, where clients can only connect to Followers, attaching too many clients can slow down a Follower or even crash it. The behaviour of clients is such that, like a swarm of insects, they will move to the next Follower which may well suffer the same fate. The performance of the cluster, and eventually its liveness, would be compromised. Observers can insulate the core ensemble against these problems.
- As a datacenter bridge: Forming a ZK ensemble between two datacenters is a problematic endeavour as the high variance in latency between the datacenters could lead to false positive failure detection and partitioning. However if the ensemble runs entirely in one datacenter, and the second datacenter runs only Observers, partitions aren't problematic as the ensemble remains connected. Clients of the Observers may still see and issue proposals.
- As a link to a message bus: Some companies have expressed an interest in using ZK as a component of a persistent reliable message bus. Observers would give a natural integration point for this work: a plug-in mechanism could be used to attach the stream of proposals an Observer sees to a publish-subscribe system, again without loading the core ensemble.
- To support dynamic membership: once dynamic membership is enabled it will be common for nodes that intend to become Followers to connect to the ensemble in order to be ready to be informed of the membership change that admits them. Observers naturally capture the state of being connected but not being part of the voting ensemble.
An Observer is started much like any other ensemble peer. They find and connect to the Leader through the same mechanism as Followers. Instead of sending a FOLLOWERINFO packet, they send an OBSERVERINFO packet which tells the Leader that this is an Observer. The Leader from that point on can distinguish the Observer and doesn't send it any Follower-specific messages.
The Leader and the Observer sync up through the usual mechanism. The Leader then only sends INFORM messages to the Observer which are sent when a proposal has received enough votes to be committed. If the Leader receives an ACK from an Observer, or if the Observer receives a PROPOSAL from a Leader, it's an error.
Clients may connect to the Observer as if it were a Follower, and issue proposals, set watches and so on. The Observer forwards REQUEST packets to the Leader, and commits the resulting proposals once INFORM is received. Observers make the same guarantees about ordering of client proposals that Followers do. Observers will eventually see the whole proposal log due to syncing with the Leader upon every connection attempt.
The fact that INFORM is the only message sent to an Observer for a proposal means that the message cost of Observers is one message per Observer per proposal, compared to three per Follower. However, this means that INFORM must contain enough information to commit each proposal (unlike COMMIT messages which just include a zxid and are matched against the already seen PROPOSAL); so the saving is essentially the ACK / COMMIT message pair.
It is very important not to require a complete cluster restart for these changes, and to maintain backwards compatibility with existing data.
There are two cases to consider:
- An Observer tries to connect to a pre-Observer cluster: The Observer will succeed in connecting, but once it sends an OBSERVERINFO packet the Leader will respond that it does not understand the packet type and close the connection.
- A pre-Observer Follower tries to connect to an Observer-aware cluster: The behaviour of Followers has not been changed. They still receive the same set of messages, and connect via the same protocols, as before. Therefore they will be able to successfully connect to an upgraded cluster.
The data format has not changed, so logs will be backwards compatible. Downgrading will also be possible to a pre-Observer version.
Current security in ZK is achieved in two ways: ACLs on individual znodes that enforce policies at the client, and a whitelist of ensemble nodes in the configuration file.
Since Observers currently can connect from any source address, this removes all security from the cluster. We must, at least, implement a whitelist of IP ranges from which an Observer can connect. At this point security is the same as the current version - to attack a ZK cluster we must write a compromised Follower and run it from one of the IP addresses in the configuration file.
We should look into authenticated connections, and if we implement the re-publish use case must take sure to only re-publish proposals which are authorised by an ACL provided to the Observer.
We must test the claims made about backwards compatibility. Ideally we would have a running cluster and a suite of tests that can be run against it as we do a rolling upgrade of the ensemble nodes. This would be a helpful general tool for any major release.
Observers should be subject to the same test suite that Followers are (including stress tests etc).
We should also test the basic properties of Observers: never receiving a proposal, not voting to elect a leader (can maybe do this by constructing a situation where a leader would be elected iff the Observer voted), seeing proposals in zxid order etc.
The ensemble performance should remain reasonably unchanged. There are two areas where extra load is placed on the system:
The Leader must keep track of all Observers. The cost is the same as for each Follower - in fact in the current implementation FollowerHandlers are used to handle Observer connections.
- Observers must receive every proposal via a single INFORM message, adding N messages to every proposal for N Observers.
Regression testing of existing benchmarks will help validate these claims.
To add Observers to your cluster you must edit the configuration file for every machine.
Configuration for nodes that are to be Observers:
Add the line
(you may also specify peerType=participant, which is the default)
Configuration for all nodes (including Observers):
All nodes that are Observers must have :observer appended to their server config line. For example:
sets server 1 to be an Observer.
Note that currently you must specify an election port in order to denote a server as an Observer in the configuration file; this port is optional otherwise. This restriction will be removed.
All servers need to know which nodes are observers in order to construct quorums correctly. Observers do not vote, and therefore should not be considered part of any quorum.
With these configuration changes in place, you should be able to start your cluster as normal, and connect to Observer nodes as though they were Followers.
Observers duplicate a great deal of functionality from Followers. Therefore I have introduced a Peer base class that contains common code. There is an accompanying PeerZooKeeperServer class from which ObserverZooKeeperServer and FollowerZooKeeperServer inherit.
Observers currently have the same RequestProcessor pipeline as Followers. This might not be the case in the future.
Some of the code in the current patch is preparation for the dynamic membership patch which is built upon this one. (Some of these changes need erasing for the final version of this patch). There is PeerType enum which describes whether a Peer is a PARTICIPANT (i.e. will follow if able to) or if it is an OBSERVER. Based on that the LeaderElection process knows which state to move the Peer into once it has found a leader; from LOOKING to either FOLLOWING or OBSERVING.
Setting peerType=observer in a server's configuration file will ensure that its PeerType=OBSERVER. Otherwise it defaults to PARTICIPANT.
What's left to do?
- Different timeout for Observers - can be much more lax about timeouts.
- Evaluate impact on JMX
- Update four-letter commands
- Final version of patch that cleans up rough edges, provides documentation and test cases.