New SolrCloud Design
(Work in progress)
What is SolrCloud?
SolrCloud is an enhancement to the existing Solr to manage and operate Solr as a search service in a cloud.
Cluster : Cluster is a set of Solr nodes managed as a single unit. The entire cluster must have a single schema and solrconfig
Node : A JVM instance running Solr
Partition : A partition is a subset of the entire document collection. A partition is created in such a way that all its documents can be contained in a single index.
Shard : A Partition needs to be stored in multiple nodes as specified by the replication factor. All these nodes collectively form a shard. A node may be a part of multiple shards
Leader : Each Shard has one node identified as its leader. All the writes for documents belonging to a partition should be routed through the leader.
Replication Factor : Minimum number of copies of a document maintained by the cluster
Transaction Log : An append-only log of write operations maintained by each node
Partition version : This is a counter maintained with the leader of each shard and incremented on each write operation and sent to the peers
Cluster Lock : This is a global lock which must be acquired in order to change the range -> partition or the partition -> node mappings.
- Any operation can be invoked on any node in the cluster.
- No non-recoverable single point of failures
- Cluster should be elastic
- Writes must never be lost i.e. durability is guaranteed
- Order of writes should be preserved
- If two clients send document "A" to two different replicas at the same time, one should consistently "win" on all replicas.
- Cluster configuration should be managed centrally and can be updated through any node in the cluster. No per node configuration other than local values such as the port, index/logs storage locations should be required
- Automatic failover of reads
- Automatic failover of writes
- Automatically honour the replication factor in the event of a node failure
A ZooKeeper cluster is used as:
- The central configuration store for the cluster
- A co-ordinator for operations requiring distributed synchronization
- The system-of-record for cluster topology
The cluster is configured with a fixed max_hash_value (which is set to a fairly large value, say 1000) ‘N’. Each document’s hash is calculated as:
hash = hash_function(doc.getKey()) % N
Ranges of hash values are assigned to partitions and stored in Zookeeper. For example we may have a range to partition mapping as follows
range : partition ------ ---------- 0 - 99 : 1 100-199: 2 200-299: 3
The hash is added as an indexed field in the doc and it is immutable. This may also be used during an index split
The hash function is pluggable. It can accept a document and return a consistent & positive integer hash value. The system provides a default hash function which uses the content of a configured, required & immutable field (default is unique_key field) to calculate hash values.
The node -> partition mapping can only be changed by a node which has acquired the Cluster Lock in ZooKeeper. So when a node comes up, it first attempts to acquire the cluster lock, waits for it to be acquired and then identifies the partition to which it can subscribe to.
Node to a shard assignment
The node which is trying to find a new node should acquire the cluster lock first. First of all the leader is identified for the shard. Out of the all the available nodes, the node with the least number of shards is selected. If there is a tie, the node which is a leader to the least number of shard is chosen. If there is a tie, a random node is chosen.
A node is started pointing to a Zookeeper host and port. The first node in the cluster may be started with cluster configuration properties and the schema/config files for the cluster. The first node would upload the configuration into zookeeper and bootstrap the cluster. The cluster is deemed to be in the “bootstrap” state. In this state, the node -> partition mapping is not computed and the cluster does not accept any read/write requests except for clusteradmin commands.
After the initial set of nodes in the cluster have started up, a clusteradmin command (TBD description) is issued by the administrator. This command accepts an integer “partitions” parameter and it performs the following steps:
- Acquire the Cluster Lock
- Allocate the “partitions” number of partitions
- Acquires nodes for each partition
Updates the node -> partition mapping in ZooKeeper
- Release the Cluster Lock
Informs all nodes to force update their own node -> partition mapping from ZooKeeper
The node upon startup, checks ZooKeeper if it is a part of existing shard(s). If ZooKeeper has no record of the node or if the node is not part of any shards, it follows the steps in the New Node section else it follows the steps in the Node Restart section.
A new node is one which has never been part of the cluster and is newly added to increase the capacity of the cluster.
If the “auto_add_new_nodes” cluster property is false, the new nodes register themselves in ZooKeeper as “idle” and wait until another node asks them to participate. Otherwise, they proceed as follows:
- The Cluster Lock is acquired
- A suitable source (node, partition) tuple is chosen:
- The list of available partitions are scanned to find partitions which has less then “replication_factor” number of nodes. In case of tie, the partition with the least number of nodes is selected. In case of another tie, a random partition is chosen.
- If all partitions have enough replicas, the nodes are scanned to find one which has most number of partitions. In case of tie, of all the partitions in such nodes, the one which has the most number of documents is chosen. In case of tie, a random partition on a random node is chosen.
- If moving the chosen (node, partition) tuple to the current node will decrease the maximum number of partition:node ratio of the cluster, the chosen tuple is returned.Otherwise, no (node, partition) is chosen and the algorithm terminates
The node -> partition mapping is updated in ZooKeeper
The node status in ZooKeeper is updated to “recovery” state
- The Cluster Lock is released
- A “recovery” is initiated against the leader of the chosen partition
- After the recovery is complete, the Cluster Lock is acquired again
The source (node, partition) is removed from the node -> partition map in ZooKeeper
- The Cluster Lock is released
The source node is instructed to force refresh the node -> partition map from ZooKeeper
- Goto step #1