Cluster Membership: ZOOKEEPER-107

1. Motivation

ZooKeeper clusters are currently static: the set of servers participating in a cluster are statically defined in a configuration file. In many instances, however, it is desirable to have the ability of adding and removing servers from an ensemble. The difficulty of implementing such a feature is making sure that a change in the configuration does not cause inconsistencies in a ZooKeeper cluster. A related issue is the one of enabling a client to learn the current ensemble of servers dynamically.

2. Requirements

The ensemble of servers must agree upon the current configuration, and to reach agreement, Zab sounds like the obvious choice;
We need new client calls to add and remove servers. It is unclear whether we want one call for each modification or one call to propose a whole new configuration;
It must work with both majority and flexible quorums;
We need a mechanism, perhaps based on URIs, to enable a client to learn the current configuration.

3. Some pre-design random thoughts

When moving from one configuration to another, we need to make sure that a quorum of the old configuration and a quorum of the new configuration commit to the new configuration. A quorum of the old configuration needs to agree to avoid a split-brain problem, for example, when adding more servers. A quorum of the new configuration needs to agree for progress. We also need to make sure that a quorum of the old configuration confirms first, otherwise a partition could cause a split-brain problem.

...

It is critical to make sure that every operation committed once a configuration is installed is acknowledged by a quorum from the new configuration. Otherwise a leader crash can cause committed operations to be lost. It might be simpler to stall the pipeline of request processors when a reconfiguration goes through PrepRequestProcessor. By stalling I mean holding operations until the reconfiguration operation is committed.

4. Proposed Algorithm

This section contains a proposed algorithm and API for reconfiguring zookeeper cluster membership.
Any comments and suggestions are welcome (Alex Shraer, shralex@yahoo-inc.com).

...

upon receipt of <retire, version(M)>
- garbage-collect M

3.1 Notes:

State transfer starts when members(M’) connect to leader(M), before reconfig(M’) is invoked. However, some suffix of operations might not have been transferred to M’ before the reconfiguration began. Since no new operations can be committed in M’ before this state-transfer completes, we’d like to minimize this tail. A possible way to do that is to require that at most w operations scheduled before the reconfiguration are unknown to the connected quorum of M’ as a prerequisite (line 1).
When line 8 completes we know that operations scheduled before the reconfiguration are committed in M’.
Even if the current leader remains the leader of M’ we cannot allow operations to be executed in M’ before phase 1 ends, otherwise, if the leader fails, we have a spilt brain (some operations execute in M’ and then when a new leader recovers new ops will be executed in M).
Instead of lines (2b) and (3b) the leader of M can redirect further operations to leader(M’) (whether leader(M') is equal to leader(M) or not). Leader(M') will buffer them until M’ is activated.
If phase 1 is done using a normal ZAB proposal, explicitly making sure that there are no incomplete reconfiguration requests that remain in the system after a recovery might not be necessary.

3.2 Recovery from leader failure

During state discovery in M, if some server responds next(M)!=null, let M’ be such returned non-null configuration. The algorithm (or ZAB) will make sure that at-most a single non-null value is returned.
The leader executes the reconfiguration alg. with the following changes:

Instead of line 1, try to connect to M’ and transfer state, so that a quorum of M’ is connected and up-to-date. If unsuccessful, skip the reconfiguration.
If a quorum of M’ indicate that they’ve already activated M’ then skip to phase 3
Otherwise, if next(M) = M’ was returned by a quorum of M, skip to phase 2

Some design choices

3.3 Reconfiguration API

A choice that has to be made is what kind of operations to support – incremental changes like “add server A” or “remove server B” (e.g., as in survey on reconfiguration of R/W objects , #DynaStore) or full membership specification, as in “reconfigure so that the membership becomes is A, B, C” (e.g., survey on reconfiguration with virtual synchrony , #Rambo).

...

At first stage we propose to use the non-incrememtal API for reconfigurations. In the future we intend to use this non-incremental interface only for changing the quorum system and to add an incremental API for wait-free reconfigurations.

3.4. Old reconfiguration requests

Suppose that a reconfig request was issued and the leader started sending phase-1 messages to the current configuration M, but failed after sending to only one other server A. Then, when the recovery happened, the new leader did not see a message from A. Should we allow the reconfiguration request to surface at a later time ? If not, a possible solution might be to have a command “next(M) = null” as the first one issued by any elected leader. If ZAB is used for sending the message in Phase 1, explicitly making sure that there are no incomplete reconfigurations that can surface later might be unnecessary.

3.5. Online vs. offline reconfiguration

The idea of an “off-line” strategy for reconfiguration (survey on reconfiguration with with virtual synchrony , survey on reconfiguring state-machine replication ) is to stop operations in the old configuration, transfer the state to the new configuration and then enable operations – in the new configuration. In contrast, an online reconfiguration approach (#RAMBO, #DynaStore) never stops the service while reconfiguring.
One of the complexities arising in the online approach is that a normal operation can be executing concurrently with a reconfiguration, however the state still must be transferred correctly to the next configuration. The easy case is when the operation occurs in the old configuration and the reconfiguration transfers the state. It is possible, however, that the reconfiguration misses the operation when it transfers the state and completes. In this case, existing online reconfiguration solutions (#RAMBO, #DynaStore) continue the operation and execute it in the new configuration.
Unfortunately this may violate the global primary order in Zookeeper - operations issued in the new configuration (potentially by a different primary) may have already completed, in which case global primary order does not allow operations issued by an old primary to be applied.
We therefore choose the offline reconfiguration strategy, however we try to minimize the period of unavailability by pre-transferring the state to the new configuration before the reconfig begins.

3.6. Bibliography

Surveys:
1.

Anchor

	VSSurvey
	VSSurvey

Ken Birman, Dahlia Malkhi, and Robbert Van Renesse, Virtually Synchronous Methodology for Dynamic Service Replication, no. MSR-TR-2010-151, November 2010 paper
2.

Anchor

	SMRSurvey
	SMRSurvey

Leslie Lamport, Dahlia Malkhi and Lidong Zhou, Reconfiguring a State Machine. In SIGACT News 41(1), SIGACT News 41(1): 63-73 (2010) paper
3.

Anchor

	ASSurvey
	ASSurvey

Marcos K. Aguilera, Idit Keidar, Dahlia Malkhi, Jean-Philippe Martin, Alexander Shraer:
Reconfiguring Replicated Atomic Storage: A Tutorial. In the Bulletin of the European Association for Theoretical Computer Science 102, pages 84-108, Distributed Computing Column, October 2010. paper

...

Space shortcuts

Child pages

Versions Compared

Old Version 30

New Version 31

Key

Cluster Membership: ZOOKEEPER-107

1. Motivation

2. Requirements

3. Some pre-design random thoughts

4. Proposed Algorithm

3.1 Notes:

3.2 Recovery from leader failure

Some design choices

3.3 Reconfiguration API

3.4. Old reconfiguration requests

3.5. Online vs. offline reconfiguration

3.6. Bibliography

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 30

New Version 31

Key

Cluster Membership: ZOOKEEPER-107

1. Motivation

2. Requirements

3. Some pre-design random thoughts

4. Proposed Algorithm

3.1 Notes:

3.2 Recovery from leader failure

Some design choices

3.3 Reconfiguration API

3.4. Old reconfiguration requests

3.5. Online vs. offline reconfiguration

3.6. Bibliography