Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

Before getting into issues in the Mesos Master's current design, let's look at some of its the motivating design principles and decisions.

Simplicity is an important and fundamental design principle, particularly when designing a fault-tolerant, large scale, maintainable system. Accordingly, when a simpler solution was pragmatically "good enough", we have rejected more complex solutions. 

In particular, it was an intentional design decision based on simplicity to have a Master without persistent state. However, it was at the cost of allowing inconsistent task state in very rare situations involving a failure confluence across Masters and Slaves. Frameworks such as Aurora had ways of working around these issues, which is why persistence has not been added to the Master; the stateless Master was "good enough" to warrant simplicity's substantial benefits.

...

  • The Master is long-running, usually restarting due to either manual restarts or machine / network failures.
     
  • A new Master is elected in < 10 seconds. (We use a 10 second ZooKeeper session timeout).
     
  • Slaves recover from failures; the above scenario requires a Slave's permanent failure, which is rare in practice.

However, Master failovers cause consistency issues due to the lack of persistence. Consider the same scenario, but with a Master failover instead of a Slave.:

Gliffy Diagram
nameslave_failure_master_failover

Updates for the lost tasks were never sent to the framework. It would at first glance seem possible for the scheduler could to fully reconcile this state against the Master by asking "What is the status of my tasks?". Since the tasks are unknown to it, the Master could reply with the task lost messages. However, this is trickier than it seems: what if the Master has just recovered from a failover, and hasn't yet learned of all the tasks? There's also the following consistency issue, in the presence of network partitions:

...

These inconsistencies can be solved through the Registrar: a way the addition of adding a minimal amount of persistent state to the Master.

...

Due to the Registry's write-ahead nature, the framework is only informed once we've removed the Slave from the Registry. This signal is not delivered reliably. That is, if the Master fails after persisting the removal, but before notifying frameworks, the signal is dropped. Likewise, if a framework fails before receiving the signal, it is dropped. We deal with this by through a reconciliation mechanism against the master (see consistency below).

...

// The Registry is a single state variable, to enforce atomic versioned writes.
message Registry {
// Leading master. This ensures the Registry version changes when an election occurs.
required Master master;

// All admitted slaves.
required Slaves slaves;

// Other information in the future.
}

// This wrapper ensures we have a single Message to represent the Master,
// in case Master is stored separately in the future.
message Master {
required MasterInfo master;
}

// This wrapper message ensures we have a single Message to represent Slaves,
// in case Slaves are stored separately in the future.
message Slaves {
repeated Slave slaves;
}

// This wrapper message ensures we can store additional information (missing
// from SlaveInfo) if needed.
message Slave {
required SlaveInfo info;

Consistency

The Registrar use does not guarantee that frameworks receive the lost Slave signal (we do not persist the message receipt, so frameworks may not receive lost Slave messages in some circumstances. For example, the Master dying immediately after removing a Slave from the Registry). As a result, frameworks must do the following to ensure consistent task state:

...

We will test the Registrar for efficiency and being able to scale to over 10,000 tens of thousands of Slaves. With respect to scalability, this design has two key aspects:

...

  1. ZooKeeper's default znode size limit is just under 1MB (see jute.maxbuffer). Zookeeper was also designed to store data on the order of kilobytes in size, it is considered an unsafe option to modify the upper limit. The Registrar's initial design treats Slaves as a large versioned blob, which can easily exceed 1MB for thousands of slaves.
  2. Because we treat Slaves as a single versioned blob, it is undesirable to split our data across znodes. This would add additional complexity, and we lose our atomic writes unless we also implement transactional writes in our ZooKeeper storage layer.

...

  1. The State abstraction uses a versioning technique to deny concurrent writes to the same variables. We will store the Registry in a single State Variable, which enforces consistent versioning. This means if a rogue Master attempts a write to the Registry after the leading Master has performed a successful write, the rogue Master fails. However, there is a race: the rogue Master can write first and cause the leading Master to fail.
  2. To deal with this race, when a leading Master recovers state from the Registrar, the Registry is updated with the leader's MasterInfo. This ensures the Registry version updates as a result of an election, thus preventing a rogue Master from performing any writes.
  3. In addition to these two safety measures, if the replicated log is used, the rogue Master cannot perform a write as the rogue log co-ordinator loses its leadership when the new leading Master performs a read of the state. This prevents helps prevent any writes in the rogue Master.

...