New SolrCloud Design

(Work in progress)

What is SolrCloud?

SolrCloud is an enhancement to the existing Solr to manage and operate Solr as a search service in a cloud.


Guiding Principles


A ZooKeeper cluster is used as:


The cluster is configured with a fixed max_hash_value (which is set to a fairly large value, say 1000) ‘N’. Each document’s hash is calculated as:

hash = hash_function(doc.getKey()) % N

Ranges of hash values are assigned to partitions and stored in Zookeeper. For example we may have a range to partition mapping as follows

range  : partition
------  ----------
0 - 99 : 1
100-199: 2
200-299: 3

The hash is added as an indexed field in the doc and it is immutable. This may also be used during an index split

The hash function is pluggable. It can accept a document and return a consistent & positive integer hash value. The system provides a default hash function which uses the content of a configured, required & immutable field (default is unique_key field) to calculate hash values.

Shard Assignment

The node -> partition mapping can only be changed by a node which has acquired the Cluster Lock in ZooKeeper. So when a node comes up, it first attempts to acquire the cluster lock, waits for it to be acquired and then identifies the partition to which it can subscribe to.

Node to a shard assignment

The node which is trying to find a new node should acquire the cluster lock first. First of all the leader is identified for the shard. Out of the all the available nodes, the node with the least number of shards is selected. If there is a tie, the node which is a leader to the least number of shard is chosen. If there is a tie, a random node is chosen.

Boot Strapping

Cluster Startup

A node is started pointing to a Zookeeper host and port. The first node in the cluster may be started with cluster configuration properties and the schema/config files for the cluster. The first node would upload the configuration into zookeeper and bootstrap the cluster. The cluster is deemed to be in the “bootstrap” state. In this state, the node -> partition mapping is not computed and the cluster does not accept any read/write requests except for clusteradmin commands.

After the initial set of nodes in the cluster have started up, a clusteradmin command (TBD description) is issued by the administrator. This command accepts an integer “partitions” parameter and it performs the following steps:

  1. Acquire the Cluster Lock
  2. Allocate the “partitions” number of partitions
  3. Acquires nodes for each partition
  4. Updates the node -> partition mapping in ZooKeeper

  5. Release the Cluster Lock
  6. Informs all nodes to force update their own node -> partition mapping from ZooKeeper

Node Startup

The node upon startup, checks ZooKeeper if it is a part of existing shard(s). If ZooKeeper has no record of the node or if the node is not part of any shards, it follows the steps in the New Node section else it follows the steps in the Node Restart section.

New Node

A new node is one which has never been part of the cluster and is newly added to increase the capacity of the cluster.

If the “auto_add_new_nodes” cluster property is false, the new nodes register themselves in ZooKeeper as “idle” and wait until another node asks them to participate. Otherwise, they proceed as follows:

  1. The Cluster Lock is acquired
  2. A suitable source (node, partition) tuple is chosen:
    1. The list of available partitions are scanned to find partitions which has less then “replication_factor” number of nodes. In case of tie, the partition with the least number of nodes is selected. In case of another tie, a random partition is chosen.
    2. If all partitions have enough replicas, the nodes are scanned to find one which has most number of partitions. In case of tie, of all the partitions in such nodes, the one which has the most number of documents is chosen. In case of tie, a random partition on a random node is chosen.
    3. If moving the chosen (node, partition) tuple to the current node will decrease the maximum number of partition:node ratio of the cluster, the chosen tuple is returned.Otherwise, no (node, partition) is chosen and the algorithm terminates
    4. The node -> partition mapping is updated in ZooKeeper

  3. The node status in ZooKeeper is updated to “recovery” state

  4. The Cluster Lock is released
  5. A “recovery” is initiated against the leader of the chosen partition
  6. After the recovery is complete, the Cluster Lock is acquired again
  7. The source (node, partition) is removed from the node -> partition map in ZooKeeper

  8. The Cluster Lock is released
  9. The source node is instructed to force refresh the node -> partition map from ZooKeeper

  10. Goto step #1