This page is for design notes and information relating to operations effecting token/range ownership. See also:
Offsetting ownership ratios for heterogeneous nodes
Correcting imbalances created by random token selection
Randomizing ranges after a migration
When running a cluster of heterogeneous nodes, (i.e. differing amounts of storage, memory, cores, etc), it may be desirable to place a greater or less portion of the keyspace on one or more nodes.
By default, nodes tokens are randomly generated with the expectation that an even distribution of the namespace will result. However, variations of as much as 7% have been reported on small clusters when using the num_tokens default of 256.
These randomly generated tokens are MD5 sums, so entropy isn't the problem here, at least not in the sense that using a better RNG would create a more even distribution of ranges. Increasing the token count (either by increasing num_tokens, or the number of nodes) will improve this, (the more tokens, the more the distribution will even out).
This anecdotal worst-case is probably Good Enough, especially when considering that key distribution is subject to the same properties, or that many data sets are skewed on their own, (i.e. optimal ownership is not necessary optimal anyway).
That said, our history is one where random token selection produced completely unacceptable results, and manual intervention was required. The typical (expected) result of manual token selection is near perfect balance of ownership, and it will likely be some time before people are comfortable seeing otherwise.
When migrating a legacy cluster with one-token-per-node to virtual nodes, the existing range is carved up into num_tokens new ranges. These new ranges are still contiguous however, and a means of randomizing their placement is needed.
In the most basic sense, balanced means that each node has 1/n of the token-space, so adjusting ownership for heterogeneous nodes is implicitly about unbalancing. This is important because, if for example, you reduced ownership of a node to say (1/n)*.8, you expect that imbalance to persist, and not be balanced-away by operations on other nodes.
"Shuffling node at a time means that for each node i for i in 0..N-1 (where N is the cluster size), i/N of the ranges shuffled will, on average, have been shuffled at least once already. So it's substantially less efficient than shuffling once, then assigning the vnodes out in one cluster-wide pass." -- Jonathan Ellis1
Shuffling will entail moving a lot of data around the cluster and so has the potential to consume a lot of disk and network I/O, and to take a considerable amount of time. For this to be an online operation, the shuffle will need to operate on a lower priority basis to other streaming operations, and should be expected to take days or weeks to complete.
- Corollary: shuffling should tell the operator what vnodes it plans to move where, and report progress whenever one completes successfully. This will allow recovering from an interrupted shuffle, if necessary.
- Shuffling can be sped up by parallelizing such that each node has one vnode moving to or from it at a time. With appropriate stream throttling this should be better than just one vnode at a time cluster-wide.
Nodes / Cluster
The most straightforward method of effecting ownership is a token move (i.e. relocating a range from one node to another). Exposing this with JMX would allow implementing all of the required operations client-side.