When a ZooKeeper server loses contact with over half of the other servers in an ensemble ('loses a quorum'), it stops responding to client requests. For some applications, it would be beneficial if a server still responded to read requests when the quorum is lost, but caused an error condition when a write request was attempted.
This project will implement a 'read-only' mode for ZooKeeper servers that allows read requests to be served as long as the client can contact a server
To enable read-only functionality, user should pass optional boolean parameter to the ZooKeeper's constructor. When a client with r-o mode enabled is connected to a server-in-majority, it behaves as a normal one. But if server is partitioned, read requests issued by such client are allowed, while write requests fail with exception.
If r-o mode is disabled for a client it won't connect to any server if there's no quorum.
Session states. New session state is introduced, CONNECTEDREADONLY. Client will move to this state when it's connected to a partitioned server (which automatically implies that only clients with r-o mode enabled can be in this state). If we're in this state we can issue only read requests. From this state session could transition to CONNECTED -- if client's reconnected to r/w server -- and to all states reachable from CONNECTED state.
Session events. How will application know mode's changed? Default watcher of r-o client will inform application about mode change from usual to read-only and vice versa, besides current notifications like connection loss.
Special case of state transitions. If the very first server client connects to is partitioned server, client receives "fake" session id from it (fake because majority doesn't knows about this session). When such client eventually connects to r/w server, it receives valid session id. All this happens transparently to the users. They should just be aware of fact that sessionId stored in ZooKeeper object could change (iff this is fake sessionId; valid sessionId will never change).
Watches set in r-o mode. Client can safely set watches in r-o mode, with the only obvious caveat that they will be fired when the client reconnects to r/w server. So, if client connects to partitioned server, sets a watch for data changes of node /a, and meanwhile /a's data is changed by majority servers, watch will be fired when client reconnects to majority server.
To make server distinguish these two types of clients, "am-i-readonly-client" field is added to a packet client sends to a server during connection handshake. If a server in r-o mode receives connection request from not r-o client, it rejects the client. This is the only protocol change, so traffic overhead is sizeof boolean per session.
This will involve changes in both Java and C clients.
Server-side activity in r-o mode is handled by a subclass of ZooKeeperServer, ReadOnlyZooKeeperServer. Its chain of request processors is similar to leader's chain, but at the beginning it has ReadOnlyRequestProcessor which passes read operations but throws exceptions to state-changing operations.
When server, namely QuorumPeer, loses a quorum it destroys whichever ZooKeeperServer was running and goes to LOOKING state (this is a current logic which doesn't need to be altered), and creates ReadOnlyZooKeeperServer (new logic). Then, when some client connects to this peer, if running server sees this is a read-only client, it starts handling its requests; if it's a usual client, server drops the connection, as it does currently.
When the peer reconnects to the majority, it acts similarly to other state changes: shutdowns zk server (which will cause notification of all read-only clients about state change), and switches to another state.
Recovering from partitioning
Server side. When server regains a quorum, all currently connected r-o clients are notified about this (see above)
Client side. Read-only client keeps seeking for r/w server: it goes through the list of available servers and pings them with newly added "isro" 4-letter command. If answer is "rw", which means server is r/w, client reconnects to it. This polling backoffs exponentially, i.e. timeout between successive calls doubles after each call (with upper bound of 60 seconds), so that if there's no r/w servers available clients don't overload alive ones with too frequent and unnecessary requests.
New server / old server:
Server-server protocol remains untouched. When a new server is in LOOKING state, it also runs ReadOnlyZooKeeperServer, but this server doesn't interact with other peers, just with r-o clients. New and old servers are safe to run together.
New client / old server:
New client sends new "am-i-read-only" field during connection handshake, along with another information. Old server just ignores this field as it doesn't know about it. From new client's point of view old server is just usual r/w server as it never accepts a connection when it's partitioned. In conclusion: new clients are safe to run against old servers, but obviously r-o mode will not be accessible in such case, as it's not implemented in old servers.
Old client / new server:
Since new server expects "am-i-read-only" info during connection handshake and old client doesn't send it, new server just treats old clients as r/w clients. Which means when new server becomes partitioned it rejects connections from such clients. So, old clients will be correctly handled by new servers, without any inconsistencies in session handling or other aspects.
Application developers will decide which client to use -- r-o enabled or not -- i.e. they'll choose whether to have guaranteed consistent view of system, or agree to sometimes have outdated view in return for read access.
Enabling read-only mode in current applications will involve changing session handling logic (they will have to detect new "mode changed" notifications), but since interface remains unchanged transition should be very smooth.
It's worth repeating, despite server side will be (heavily?) changed, behavior for usual clients remains the same, so there will be no backwards incompatibility issues introduced.
Benefits of this design: transparent usage of a new client, backwards compatible.
Drawbacks: more coupling between server and client (but this seems unavoidable in any case).
Jira ticket: ZOOKEEPER-704
My GSoC proposal