Registrar Design Document

Motivation

Before getting into the issues present in the current design of the Mesos Master, we'll first look at some motivating design principles.

Simplicity is an important and fundamental design principle to consider in the design of a fault-tolerant, large scale, maintainable system. Accordingly, we have rejected complex solutions for simpler ones, when the simpler solution is pragmatically considered "good enough".

The decision to have a Master without persistent state was an intentional design decision based on simplicity. However, it was at the cost of allowing inconsistent task state in very rare situations that required a failure confluence across Masters and Slaves. Frameworks such as Aurora had ways of working around these issues, which is why persistence has not been added to the Master; the stateless Master was "good enough" to warrant simplicity's substantial benefits.

Let's look at how the Master works in the current design. During the Master's normal operation, when a slave process fails, the Master detects the socket closure / health check failures and notifies the framework about the lost slave and that some of the framework's tasks may have been lost (see the diagram below).

Slave Failure

This works well because:

The Master is long-running, usually restarting due to either manual restarts or machine / network failures.
A new Master is elected in < 10 seconds. (We use a 10 second ZooKeeper session timeout).
Slaves recover from failures; the scenario above requires a Slave's permanent failure, which is rare in practice.

However, Master failovers cause consistency issues due to the lack of persistence. Consider the same scenario as above, but with a Master failover:

slave_failure_master_failover

Updates were never sent to the framework for the lost tasks. It would seem possible for the scheduler to reconcile this state against the Master, by asking "What is the status of my tasks?". The Master could reply with the appropriate task lost messages, since the task is unknown to it. However, this is trickier than it seems at first glance: what if the Master answers this question after a failover, and it hasn't yet learned of all the tasks? There's also the following consistency issue, in the presence of network partitions:

slave_partitioned

In this case, there was a network partition between the Master and the Slave, resulting in the Master notifying the Framework that the tasks and slave were lost. During the partition, the Master fails over to a new Master, which knows nothing of this partitioned Slave. Consequently, when the partition is restored and the Slave attempts to re-register, the new Master allows the re-registration, despite the framework already being told that the Slave was lost!

These sorts of inconsistencies are the issue at hand, and these can be solved through the Registrar: adding a minimal amount of persistent state to the Master.

Design

The proposed solution to guarantee eventual consistency (more on this later) for slave information is to introduce a minimal amount of persistent replicated state in the Master. This state will initially be the set of slaves managed by the Master. This state will be authoritative, if a Slave is not present in this state: it cannot re-register with the Master. This state will be referred to as the "Registry". We'll also introduce a component in the Master responsible for managing the Registry, called the "Registrar". The initial Registry will only store Slave information, in the future we will introduce additional state into the Registry.

The Registrar will be an asynchronous libprocess Process (think: Actor) inside the Master process. The Registrar will use the State abstraction to manage the Registry, where the State implementation can vary between ZooKeeper, the Replicated Log, and LevelDB (see the section below for more details).

Master Process

Now, the Master will first consult the Registrar when dealing with registration, re-registration, and removal of slaves. When the registrar replies to the Master, the modified state has been persisted to the Registry and the Master can proceed with the remainder of the operation. This is applies the same principle as write-ahead-logging for ensuring atomicity and durability.

Consider normal operation of the Registrar, with respect to the lifecycle of a slave:

registrar_adding_slave

Note that we only register the slave once it is persisted in the Registry, after this point, the slave will be allowed to re-register:

registrar_reregister

Again, note that the slave is only re-registered once the master confirms it is present in the Registry. Consider slave removal, when a slave is no longer reachable:

registrar_removal

Note the write-ahead nature of the Registry, the framework is only informed once we've removed the slave from the Registry. This signal is not delivered reliably. That is, if the Master fails after persisting the removal, but before notifying frameworks, the signal is dropped. Likewise, if a framework fails before receiving the signal, it is dropped. This is dealt with through a reconciliation mechanism against the master (see consistency below).

To observe how inconsistencies are prevented, consider the partition scenario from the Motivation above:

Registrar_slave_partition

We can see that once removed from the Registry, it will never be allowed to re-register. This is key to correctness: once decided, we must not go back on our decision to remove the slave, and so the slave must not be allowed to re-register.

The following registry format is proposed, and may be amended:

// The Registry is a single state variable, to enforce atomic versioned writes.
message Registry {
  // Leading master. This ensures the Registry version changes when an election occurs.
  required Master master;

  // All admitted slaves.
  required Slaves slaves;

  // Other information in the future.
}

// This wrapper is used to ensure 
message Master {
  required MasterInfo master;
}

// This wrapper message is used to ensure we have a single Message to represent slaves,
// in case Slaves will be stored separately in the future.
message Slaves {
  repeated Slave slaves;
}

// This wrapper message is used to ensure we can store additional information (missing
// from SlaveInfo) if needed.
message Slave {
  required SlaveInfo info;
}

Consistency

With the Registrar, it is not guaranteed for frameworks to receive the lost slave signal (we do not persist the receipt of messages, so frameworks may not receive lost slave messages in some circumstances, like the Master dying immediately after removing a slave from the Registry). As a result, frameworks must do the following to ensure consistent task state:

Before returning from statusUpdate(), the update must be persisted durably by the framework scheduler.
Periodically, the scheduler must reconcile the status of its non-terminal tasks with the Master (via reconcileTasks). (It is possible for the scheduler to receive a TASK_LOST after a different terminal update, but never before, these TASK_LOST updates should be ignored).
Optionally, when receiving slaveLost(), the scheduler should transition its tasks to TASK_LOST for the corresponding slave. This is an optimization only.

Later, these may be done automatically for schedulers via a persistent scheduler driver. Also, the periodic nature of reconciliation is unnecessary and is only required after a Master failover, however the initial implementation will require periodic reconciliation. More thorough documentation around reconciliation will be added separately from this document.

Performance

Performance testing will be performed to ensure the Registrar is efficient and has the ability to scale to over 10,000 slaves.

There are two key aspects of this design that ensure the Registrar can scale:

The Master and Registrar interact asynchronously. The Master does not block waiting for the Registrar, and vice versa.
The Registrar employs a batch processing technique that ensures at most 1 write operation is queued at any time. While a state operation is in effect, all incoming requests are batched together into a single mutation of the state, and applied as a single subsequent write.

1 alone would not be sufficient to ensure a performant Registrar, as the number of queued operations could grow unbounded. However, with 1 and 2, the Registrar is largely immune to performance issues in the underlying state implementation: with a fast state implementation we'll perform more writes of smaller mutations to state, and with a slow state implementation we'll perform fewer writes with larger mutations to the state.

State Implementation

ZooKeeper

Mesos currently supports ZooKeeper as a storage layer for State. However, there are two issues with the use of ZooKeeper in the Registrar:

ZooKeeper's default znode size limit is just under 1MB (see jute.maxbuffer). Zookeeper was also "designed to store data on the order of kilobytes in size." The initial design for the Registrar will treat slaves as a large versioned blob, which can easily exceed 1MB for thousands of slaves.
Because we treat slaves as a single versioned blob, it is undesirable to split our data across znodes. This would be additional complexity, and we will lose our atomic writes unless we also implement transactional writes in our zookeeper storage layer.

Therefore, although support will be provided, it is imperative to use ZooKeeper only for small Mesos clusters (< 100 slaves), where slaves do not contain large amounts of meta-data (e.g. SlaveInfo attributes).

Replicated Log

Mesos will soon support the Replicated Log as a state storage backend.

The Replicated Log has an atomic append operation for blobs of unbounded size.
When using the Replicated Log, Mesos must run in high availability mode (multiple masters running as hot spares), so that it has a quorum of online log replicas.
The Replicated Log has a safety constraint that at most one log co-ordinator has write access to the log. The safety implications of this are discussed below.

LevelDB

It is common when first using Mesos to run in a non-HA mode, with a single Master, and potentially no ZooKeeper cluster available. To support this type of user, LevelDB can be used as the state storage backend. This means that moving a Master to another machine will require moving the Master's work directory (containing LevelDB state) to the new machine.

Safety and State Initialization

The first safety concern is guaranteeing that only the leading Master can make a successful update to the Registry. The following properties help provide this guarantee:

The State abstraction uses a versioning technique to deny concurrent writes to the same variables. We will store the registry in a single State Variable, which enforces that it is consistently versioned. This means if a rogue Master attempts a write to the Registry after the leading Master has performed a write successfully, the rogue Master will fail. However, there is a race: the rogue Master can write first and cause the leading Master to fail.
To deal with the race in 1 above, when a leading Master recovers state from the Registrar, the Registry will be updated with the leader's MasterInfo. This ensures that the version of the Registry is updated as a result of an election, thus preventing a rogue Master from performing any writes.
In addition to the above two safety measures, if the replicated log is used, it is impossible for the rogue Master to perform a write as the rogue log co-ordinator will have lost its leadership when the new leading Master performs a read of the state. This prevents any writes in the rogue Master.

Related to upgrades and initialization, there are a few tradeoffs between safe operation and ease of use that are important to outline here. Let's begin first with some of the questions that should be asked:

What occurs when state is empty? With a simple implementation, this will appear as an empty cluster. Most of the time, this will be because the cluster is indeed empty, or being started for the first time. However, consider an accidental mis-configuration of the Master that pointed a production cluster to empty state, or less likely, the loss of all log replicas or ZooKeeper data loss. In these cases, it is not desirable to proceed happily assuming there are no slaves in the cluster (every slave will be refused re-admission to the Registry).

How does one upgrade? If one did so naively using a simple implementation, this would also appear as an empty cluster, resulting in a restart of all of the slaves (they will be refused re-admission into the Registry).

From these questions one can derive the following important requirements:

Safe upgrade functionality must be provided. This is essential for those upgrading a running cluster.
For production users, we must provide the ability to require an initialization of the Registry, allowing differentiation between a first run and misconfiguration / data loss.

It is interesting to consider satisfying 2 through 1: If the state is empty, enter a temporary upgrade period in which all slaves are permitted to re-register. This will keep a cluster operational, however it will expose the inconsistencies described in the Motivation. The implicit nature of this solution is also undesirable (operators may want to be alerted and intervene to recover data when this occurs).

Proposed Initialization and Upgrade Scheme

To determine whether the Registry state is initialized, we will persist an 'initialized' variable in the Registry. Two new flags provide upgrade semantics and state initialization:

--registrar_upgrade=bool (default: false) - When true, the Registrar auto-initializes state and permits all re-registrations of Slaves. This lets the Registry be bootstrapped with a running cluster's Slaves.
--registrar_strict=bool (default: false) - Only interpreted when not upgrading (--registrar_upgrade=false). When true (strict), the Registrar fails to recover when the state is empty. When false (non-strict), the Registrar auto-initializes state when the state is empty.

It is important to consider the upgrade semantics in the future as we add additional information to the Registrar, where we will only want to bootstrap a portion of the Registry. This could be done by updating the --registrar_upgrade flag to take the name(s) of the registry entries to bootstrap.

Let's look at how the various use cases map to these flags below.

Upgrade

When upgrading from pre-Registrar Mesos to Registrar Mesos:

Stop all Masters, and upgrade the binaries.
Start all Masters with --registrar_upgrade.
After all Slaves have re-registered with the Master, stop all Masters.
Start all Masters without --registrar_upgrade, optionally setting --registrar_strict if desired.

Starting a Cluster for the First Time

When starting a cluster for the first time:

Start all Masters without --registrar_strict.
Optionally, stop all Masters, start all Masters with --registrar_strict, if desired.

Space shortcuts

Child pages